sybase replication server performance &...

SYBASE REPLICATION SERVER PERFORMANCE AND TUNING

Understanding and Achieving Optimal Performance with Sybase Replication Server

ver 2.0.1

Final v2.0.1

i

Table of Contents

Table of Contents .............................................................................................................................i Author’s Note ................................................................................................................................ iii Introduction.....................................................................................................................................1

Document Scope ...........................................................................................................................1 Major Changes in this Document..................................................................................................2

Overview and Review .....................................................................................................................5 Replication System Components ..................................................................................................5 RSSD or Embedded RSSD (eRSSD) ............................................................................................6 Replication Server Internal Processing .........................................................................................7 Analyzing Replication System Performance...............................................................................10

Primary Dataserver/Database......................................................................................................13 Dataserver Configuration Parameters .........................................................................................13 Primary Database Transaction Log .............................................................................................14 Application/Database Design......................................................................................................15

Replication Agent Processing.......................................................................................................29 Secondary Truncation Point Management ..................................................................................29 Rep Agent LTL Generation.........................................................................................................31 Replication Agent Communications ...........................................................................................34 Replication Agent Tuning ...........................................................................................................34 Replication Agent Troubleshooting ............................................................................................41

Replication Server General Tuning.............................................................................................53 Replication Server/RSSD Hosting ..............................................................................................53 RS Generic Tuning......................................................................................................................55 RSSD Generic Tuning.................................................................................................................63 STS Tuning .................................................................................................................................63 RSM/SMS Monitoring ................................................................................................................66 RS Monitor Counters ..................................................................................................................67 Impact on Replication .................................................................................................................75 RS M&C Analysis Repository ....................................................................................................76 RS_Ticket....................................................................................................................................77

Inbound Processing.......................................................................................................................87 RepAgent User (Executor) ..........................................................................................................87 SQM Processing..........................................................................................................................97 SQT Processing.........................................................................................................................113 Distributor (DIST) Processing ..................................................................................................127 Minimal Column Replication....................................................................................................141

Outbound Queue Processing ......................................................................................................145 DSI SQM Processing ................................................................................................................147 DSI SQT Processing .................................................................................................................148 DSI Transaction Grouping ........................................................................................................155 DSIEXEC Function String Generation .....................................................................................165 DSIEXEC Command Batching.................................................................................................172 DSIEXEC Execution.................................................................................................................179 DSIEXEC Execution Monitor Counters ...................................................................................180 DSI Post-Execution Processing.................................................................................................183

Final v2.0.1

ii

End-to-End Summary ................................................................................................................ 184 Replicate Dataserver/Database.................................................................................................. 187

Maintenance User Performance Monitoring............................................................................. 187 Warm Standby, MSA and the Need for RepDefs ..................................................................... 192 Query Related Causes............................................................................................................... 194 Triggers & Stored Procedures................................................................................................... 196 Concurrency Issues................................................................................................................... 199

Procedure Replication ................................................................................................................ 201 Procedure vs. Table Replication ............................................................................................... 201 Procedure Replication & Performance ..................................................................................... 202 Procedure Transaction Control ................................................................................................. 207 Procedures & Grouped Transactions ........................................................................................ 210 Procedures with “Select/Into”................................................................................................... 210

Replication Routes ...................................................................................................................... 217 Routing Architectures............................................................................................................... 217 Routing Internals ...................................................................................................................... 225 Routing Performance Advantages ............................................................................................ 229 Routing Performance Tuning.................................................................................................... 229

Parallel DSI Performance .......................................................................................................... 233 Need for Parallel DSI................................................................................................................ 233 Parallel DSI Internals................................................................................................................ 234 Serialization Methods ............................................................................................................... 244 Transaction Execution Sequence .............................................................................................. 249 Large Transaction Processing................................................................................................... 253 Maximizing Performance with Parallel DSI’s .......................................................................... 259 Tuning Parallel DSI’s with Monitor Counters.......................................................................... 265

Text/Image Replication .............................................................................................................. 273 Text/Image Datatype Support ................................................................................................... 273 RS Implementation & Internals ................................................................................................ 275 Performance Implications ......................................................................................................... 282

Asynchronous Request Functions ............................................................................................. 283 Purpose ..................................................................................................................................... 283 Implementation & Internals ...................................................................................................... 285 Performance Implications ......................................................................................................... 287

Multiple DSI’s ............................................................................................................................. 289 Concepts & Terminology.......................................................................................................... 289 Performance Benefits................................................................................................................ 289 Implementation ......................................................................................................................... 290 Business Cases.......................................................................................................................... 305

Integration with EAI .................................................................................................................. 309 Replication vs. Messaging ........................................................................................................ 309 Integrating Replication & Messaging ....................................................................................... 312 Performance Benefits of Integration......................................................................................... 312 Messaging Conclusion.............................................................................................................. 313

Final v2.0.1

iii

Author’s Note

Thinking is hard work – “Silver Bullets” are much easier. Several years ago, when Replication Server 11.0 was fairly new, Replication Server Engineering (RSE) collaborated on a paper that was a help to us all. Since that time, Replication Server has gone through several releases and Replication Server Engineering has been too busy keeping up with the advances in Adaptive Server Enterprise and the future of Replication Server to maintain the document. However, the requests for a paper such as this have been a frequent occurrence, both internally as well as from customers. Hopefully, this paper will satisfy those requests. But as the above comment suggests, reading this paper will require extensive thinking (and considerable time). Anyone hoping for a “silver bullet” does not belong in the IT industry.

This paper was written for and addresses the functionality in Replication Server 12.6 and 15.0 with Adaptive Server Enterprise 12.5.2 through 15.0.1 (Rep Agent and MDA Tables). As the Replication Server product continues to be developed and improved, it is likely that later improvements to the product may supersede the recommendations contained in this paper.

It is assumed that the reader is familiar with Replication Server terminology, internal processing and in general the contents of the Replication Server System Administration Guide. In addition, basic Adaptive Server Enterprise performance and tuning knowledge is considered critical to the success of any Replication System’s performance analysis.

This document could not have been achieved without the considerable contributions of the Replication Server engineering team, Technical Support Engineers, and the collective Replication Server community of consultants, educators, etc. who are always willing to share their knowledge. Thank you.

Document Version: 2.0.1

January 7, 2007

Final v2.0.1

1

Introduction

“Just How Fast Is It?” This question gets asked constantly. Unfortunately, there are no standard benchmarks such as TPC-C for replication technologies and RSE does not have the bandwidth nor resources to do benchmarking. Consequently, the stock RSE reply used to be 5MB/min (or 300MB/hr) based on their limited testing on development machines (small ones at that). However, Replication Server has been clocked at 2.4GB/hr sustained in a 1.2TB database and more than 40GB has been replicated in a single day into the same 1.2TB database (RS 12.0 and ASE 11.9.3 on Compaq Alpha GS140’s for the curious). Additionally, some customers have claimed that by using multiple DSI’s, they have achieved 10,000,000 transactions an hour!! Although this sounds unrealistic, a monitored benchmark in 1995 using Replication Server 10.5 achieved 4,000,000 transactions (each with 10 write operations) a day from the source replicating to three destinations (each with only 5 DSI’s) for a total delivery of 12,000,000 transactions per day (containing 120,000,000 write operations). Lately, RS 12.6 has been able to sustain ~3,000 rows/sec on a dual 3.0 GHz P4 XEON with internal SCSI disks. As usual, your results may vary. Significantly. It all depends. And every other trite caveat muttered by a tuning guru/educator/consultant. Of course, your expectations also need to be realistic.

Of course, implementers also need to be realistic as well. Product management recently got a call from a customer asking if Replication Server could achieve replicating 20GB of data in 15 minutes. The reality is that this is likely not even achievable using raw file IO streaming commands such as the unix dd command – let alone via a process that needs to inspect the data values and decide on subscription rules.

Replication Server is a highly configurable and highly tunable product. However, that places considerable responsibility on the system designers, implementers and operations staff to design and implement an efficient data movement strategy – as well as operations staff to monitor, tune and adjust the implementation as necessary.

The goal of this paper is to educate so that the reader understands why they may be seeing the performance they are and suggest possible avenues to explore with the goal of improved performance without resorting to the old tried-and-true trial-and-error stumble-fumble. Because performance and tuning is so situational dependent, it is doubtful that attempting to read this paper at a single sitting will be beneficial. Those familiar with Replication Server may want to skip to the specific detail sections that are applicable to their situation.

Document Scope

Before we begin, however, it is best to lay some ground rules about what to expect or not to expect from this paper. Focusing on the latter:

• This paper will not discuss database server performance and tuning (although it frequently is the cause of poor replication performance) except as required for replication processing.

• This paper will not discuss non-ASE RepAgent performance (perhaps it will in a future version) except where such statements can be made generically about RepAgents.

• This paper will not discuss Replication Server Manager. • This paper will not discuss how to “benchmark” a replicated system. • This paper will not discuss Replication Server system administration.

Now that we know what we are going to skip, what we will cover:

• This paper will discuss all of the components in a replication system and how each impacts performance. • This paper will discuss the internal processing of the Replication Server, ASE’s Replication Agent and the

corresponding tuning parameters that are specific for performance.

It is expected that the reader is already familiar with Replication Server internal processing and basic replication terminology as described in the product manuals. This paper focuses heavily on Replication Server in an Adaptive Server Enterprise environment.

In the future, it is expected that this paper will be expanded to cover several topics only lightly addressed in this version or not addressed at all. In the past, this list mostly focused on broader topics such as routing and heterogeneous replication. Routing has since been added, while heterogeneous replication has since been documented in the Replication Server Documentation. As a result, future topics will likely be new features added to existing functionality – much like the addition of the discussions on DSI partitioning (new in 12.5) and DSI commit control (new in 12.6) have been added to Parallel DSI’s.

Final v2.0.1

2

Major Changes in this Document

Because many people have read earlier versions of this document, the following sections will list the topics added to respective sections. This will aid by allowing them to skip to the applicable sections to read the updated information. An attempt was made to red-line the changed sections, including minor changes not noted above. However, this document is produced using MS Word - which provides extremely rudimentary, inconsistent (and sometimes not persistent) and unreliable red-lining capabilities (it also crashes frequently during spell checking and hates numerical list formats….one wonders how Microsoft produces their own documentation with such unreliable tools). As a result, red-lining will not be used to denote changes.

Updates 1.6 1.9

The following additions were made to this document in v1.9 as compared to v1.6:

Document Topic Modification

Batch processing Added NT informal benchmark with 750-1,000 rows/second

Batch processing Added trick to show how to replicate the SQL statement itself instead of the rows.

Batch processing Added discussion about ignore_dupe_key and CLR records with impact on RS

Rep Agent processing Added description of sp_help_rep_agent dbname, “scan” with example to clarify output of start/end/current markers and log recs scanned.

Monitors & Counters Added information about join to rs_databases and recommendation to increase RSSD size, add view to span counter tables, etc.

Rep Agent User Thread Expanded section to include processing & diagram

SQM Thread Added diagram to illustrate message queues

DIST Thread Expanded discussion on SRE, TD & MD

Parallel DSI Expanded discussion on transaction execution sequence to cover “disappearing updates” more thoroughly.

Routing Added section.

RS & EAI Added section

Updates 1.9 2.0

The following additions were made to this document in v2.0 as compared to v1.9:


RS Overview Add description of embedded RSSD

RS Internals Discussion on SMP feature and internal threading

Application Design Impact of "Chained Mode" on RepAgent throughput and RS processing

Application Design Further emphasized the impact of high-impact SQL statements and the fact that the latency is driven by the replicate DBMS vs. RS itself, including a benchmark from a financial trading system.

Rep Agent Tuning Added discussion on sp_sysmon repagent output as well as using MDA tables.

RS General Tuning Discussion on SMP feature and impact on configuration parameters such as num_mutexes, etc.

RS General Tuning Added discussion about rs_ticket

RS General Tuning Added 12.6 and 15.0 counters to each section with samples

RS General Tuning Discussion about embedded RSSD & tuning

Final v2.0.1

3


Routes Added 12.6 and 15.0 counters and discussion about load balancing using multiple routes in multi-database configurations.

Parallel DSI Updated for commit control

Parallel DSI Added discussion about MDA-based monitor tables to detect contention, SQL tracking, and RS performance metrics

Replicate Dataserver/ Database

Removed somewhat outdated section on Historical Server and added new material on monitoring with MDA tables and in particular a lot of details on using the WaitEvents and the monOpenObjectActivity/monSysStatement tables. Because of the depth of detail, this not only replaces the section on the legacy Historical Server, but also replaces the section on the Replicate DBMS resources.

Procedure Replication Added discussion on using procedures to emulate dynamic SQL (fully prepared statements) and performance gains as a result at the replicate database.

Text Replication Added discussion about changes in ASE 15.0.1 that allows the use of a global unique nonclustered index on the text pointer instead of the mass TIPSA update when marking tables with text for replication.

Final v2.0.1

5

Overview and Review

Where Do We Start? Unfortunately, this is the same question that is asked by someone faced with the task of finding and resolving throughput performance problems in a distributed system. The last words of that sentence hold the key…it’s a distributed system. That means that there are lots of pieces and parts that contribute to Replication Server performance – most of which are outside of the Replication Server. After the system has been in operation, there are several RS commands that will help isolate where to begin. However, if just designing the system and you wish to take performance in to consideration during the design phase (always a must for scalable systems), then the easiest place to begin is the beginning. Accordingly, this paper will attempt to trace a data bit being replicated through the system. Along the way, the various threads, processes, etc. will be described to help the reader understand what is happening (or should happen?) at each stage of data movement. After getting the data to the replicate site, a number of topics will be discussed in greater detail. These topics include text/image replication, parallel DSI’s, etc. A quick review of the components in a replication system and the internal processing within Replication Server are illustrated in the next sections

Replication System Components

The components in a basic replication system are illustrated below. For clarity, the same abbreviations used in product manuals as well as educational materials are used. The only addition to this over pictures in the product manuals is the inclusion of SMS – in particular, Replication Server Manager (RSM) and the inclusion of the host for the RS/RSSD.

PDB LOG RA PDS

RSSDLOG

RSSD DS

RDS RDBLOG

Host

RS

RSM

PDB LOG RA PDS

RSSDLOG

RSSD DS

RDS RDBLOG

Host

RS

RSM

Figure 1 – Components of a Simple Replication System

Of course, the above is extremely simple – the basic single direction primary to replicate distributed system, one example of which is the typical Warm-Standby configuration.

Whether for performance reasons or due to architectural requirements, often the system design involves more than one RS. A quick illustration is included below:

Final v2.0.1

6

PDB LOG RA PDS

PRSRSSDLOG

PRSRSSD DS

RDS RDBLOGPRS

RRSRSSDLOG

RRS RSSD DS

RRS

IRSRSSDLOG

IRSRSSD DS

IRS

RSM

PDB LOG RA PDS

PRSRSSDLOG

PRSRSSD DS

RDS RDBLOGPRS

RRSRSSDLOG

RRS RSSD DS

RRS

IRSRSSDLOG

IRSRSSD DS

IRS

RSM

Figure 2 – Components of a Replication System Involving More Than One RS

The above is still fairly basic. Today, some customers have progressed to multi-level tree-like structures or virtual networks exploiting high-speed bandwidth backbones to form information buses.

RSSD or Embedded RSSD (eRSSD)

Those familiar with RS from the past have always been aware that the RS required an ASE engine for managing the RSSD. Starting with version 12.6, DBA's now have a choice of using the older ASE-based RSSD implementation or the new embedded RSSD. The eRSSD is an ASA based implementation that offers the following benefits:

• Easier to manage – much of the DBA tasks associated with managing the DBMS for the RSSD have been built-in to the RS. This includes:

o RS will automatically start and stop the eRSSD DBMS. o The eRSSD will automatically grow as space is required - a useful feature when doing extensive

monitoring using monitor counters o The eRSSD transaction log is automatically managed - eliminating RS crashes due to log suspend, or the

dangerous practice of ‘truncate log on checkpoint’ • Reduced impact on smaller single or dual cpu implementations – ASE as a DBMS is tuned to consume every

resource it can – and even when not busy, ASE will "spin" looking for work. Consequently, ASE as a RSSD platform can lead to a "heavy" cpu and memory footprint in smaller implementations – robbing memory or cpu resources from the RS itself.

• With RS 15.0, the added capability of routing with an embedded RSSD removes any architectural advantage over using ASE

• Since an ASA database is bi-endian, migrating RS between different platforms is much simpler than the cross-platform dump/load (XPDL) procedure for ASE (although manual steps may be required in either situation).

• Benchmarks using an eRSSD vs. an RSSD have shown no difference in performance impact. While theoretical design and architectures would allow an ASE system to outscale an ASA based system, RS’s RSSD primary user activity does not reach the levels that would distinguish the two.

As a result, the only reason that might tip a DBA to using ASE for the RSSD for new installation using RS 15 is simply due to familiarity. One other difference is that tools and components shipped with ASE - such as the ASE Sybase

Final v2.0.1

7

Central Plug-in - allows DBA’s to connect to the ASE RSSD to view objects and data. This is especially useful when wanting to reverse engineer RSSD procedures or quickly view data in one of the tables. The similar Sybase Central ASA plug-in is not shipped with Replication Server. One way of obtaining the same tools is to simply download the SQL Anywhere Developer’s Edition, which as of this writing, is free.

Replication Server Internal Processing

When hearing the terms “internal processing”, most Replication Server administrators immediately picture the internal threads. While understanding the internal threads is an important fundamental concept, it is strictly the starting point to beginning to understand how Sybase Replication Server processes transactions. Unfortunately, many Replication Server administrators stop there, and as a result never really understand how Replication Server is processing their workload. Consequently, this leaves the administrator ill equipped to resolve issues and in particular to analyze performance bottlenecks within the distributed system. Details about what is happening within each thread as data is replicated will be discussed in later chapters.

Replication Server Threads

There are several different diagrams that depict the Replication Server internal processing threads. Most of these are extremely similar and only differ in the relationships between the SQM, SQT and dAIO threads. For the sake of this paper, we will be using the following diagram, which is slightly more accurate than those documented in the Replication Server Administration Guide:

Figure 3 – Replication Server Internal Processing Flow Diagram

Replicated transactions flow through the system as follows:

1. Replication Agent forwards logged changes scanned from the transaction log to the Replication Server. 2. Replication Agent User thread functions as a connection manager for the Replication Agent and passes

the changes to the SQM. Additionally, it filters and normalizes the replicated transactions according to the replication definitions.

3. The Stable Queue Manager (SQM) writes the logged changes to disk via the operating systems asynchronous I/O routines. The SQM notifies that Asynchronous I/O daemon (dAIO) that it has scheduled an I/O. The dAIO polls the O/S for completion and notifies the SQM that the I/O completed. Once written to disk, the Replication Agent can safely move the secondary truncation point forward (based on scan_batch_size setting).

4. Transactions from source systems are stored in the inbound queue until a copy has been distributed to all subscribers (outbound queue).

5. The Stable Queue Transaction (SQT) thread requests the next disk block using SQM logic (SQMR) and sorts the transactions into commit order using the 4 lists Open, Closed, Read, and Truncate. Again, the

Final v2.0.1

8

read request is done via async i/o by the SQT’s SQM read logic and the SQT notified by the dAIO when the read has completed.

6. Once the commit record for a transaction has been seen, the transaction is put in the closed list and the SQT alerts the Distributor thread that a transaction is available. The Distributor reads the transaction and determines who is subscribing to it, whether subscription migration is necessary, etc.

7. Once all of the subscribers have been identified, the Distributor thread forwards the transaction to the SQM for the outbound queue for the destination connections. This point in the process serves as the boundary between the inbound connection process and the outbound connection processing.

8. Similar to the inbound queue, the SQM writes to the queue using the async i/o interface and continues working. The dAIO will notify the SQM when the write has completed.

9. Transactions are stored in the outbound queue until delivered to the destination. 10. The DSI Scheduler uses the SQM library functions (SQMR) to retrieve transactions from the outbound

queue, then uses SQT library functions to sort them into commit order (in case of multiple source systems) and determines delivery strategy (batching, grouping, parallelism, etc.)

11. Once the delivery strategy is determined, the DSI Scheduler then passes the transaction to a DSI Executor.

12. The DSI Executor translates the replicated transaction functions into the destination command language (i.e. Transact SQL) and applies the transaction to the replicated database.

Again, the only difference here vs. those in the product manuals is the inclusion of the System Table Services (STS), Asynchronous I/O daemon (dAIO), SQT/SQM and queue data flow and the lack of a SQT thread reading from the outbound queue (instead, the DSI-S is illustrated making SQMR/SQT library calls). While the difference is slight, it is illustrated here for future discussion. Keeping these differences in mind, the reader is referred to the Replication Server System Administration Guide for details of internal processing for replication systems involving routing or Warm Standby.

Replication Server SMP & Internal Threading

In the past, Replication Server was a single process using internal threads for task execution along with kernel threads for asynchronous I/O. Beginning with version 12.5, a SMP version of RS exploiting native OS threads was available via an EBF. Each of the main threads discussed above were implemented as full native threads, which could run on multiple processors. The SMP capabilities could be enable or disable through configuring the Replication Server. By itself, even without enabling SMP, the native threading improved the RS throughput. Version 12.6 improved this by reducing the internal contention from the initial 12.5 implementation – consequently DBA's should consider upgrading to version 12.6 prior to attempting SMP. Further discussion about RS SMP capabilities and the impact on performance will be discussed later.

However, one new aspect of this from an internals perspective is that shared resources now required locks or mutexes. Typically in most multi-threaded applications, there are resources – typically memory structures – that are shared among the different threads. For example, in RS, the SQT cache is shared between the SQT thread and an SQT client such as a Distributor thread (this shared cache will be important to understanding the hand-off between DSI-S and DSI-EXEC threads later). To coordinate access to such shared resources (so that one thread does not delete it while another is using it – or one be reading while another has not finished writing and get corrupted values), threads are required to “lock” the resource for their exclusive use – typically by grabbing the mutex that controls access to the resource. In RS 12.5 and earlier non-SMP environments, since the threads were internal to RS and execution could be controlled by the OpenServer scheduler, conflicting access to shared resources could often be avoided simply due to the fact that only one thread would be executing at a time. In RS 12.6 – with or without SMP enabled – the native threading implementation allows the thread execution to be controlled by the OS – consequently mutexes had to be added to several shared resources.

In RS 12.6 and higher, you may sometimes see a state of “Locking Resource” when issuing an admin who command. Grabbing a mutex really does not take but a few milliseconds – unless someone else has it already, at which point the requesting thread is blocked and has to wait. The state of “Locking Resource” corresponds more to this condition – the thread in question is attempting to grab exclusive access to a shared resource and is waiting on another thread to release the mutex. Because mutex allocation is so quick, it is likely that when you see this, RS is undergoing a significant state change – for example switching the active in a Warm Standby.

Inter-Thread Messaging

Additionally, inter-thread communications is not accomplished via a strict synchronous API call. Instead, each thread simply writes a message into one of the target thread’s OpenServer message queue (standard OpenServer in memory message structures for communicating between OpenServer threads) specific for the message type. Once the target

Final v2.0.1

9

thread has processed each message, it can use standard callback routines or put a response message back into a message queue for the sending thread. This resembles the following:

SQMSQM

OpenServerMessage Queues

OpenClientCallback

Rep AgentUser

Rep AgentUser

Figure 4 – Replication Server Inter-Thread Communications

Those familiar with multi-threaded programming or OpenServer programming will recognize this as a common technique for communication between threads – especially when multiple threads are trying to communicate with the same destination thread. Accordingly, callbacks are used primarily between threads in which one thread spawned the other and the child thread needs to communicate to the parent thread. An example of this in Replication Server is the DIST and SQT threads. The SQT thread for any primary database is started by the DIST thread. Consequently, in addition to using message queues, the SQT and DIST threads can communicate using Callback routines.

Note that the message queues are not really tied to a specific thread - but rather to a specific message. As a result, a single thread may be putting/retrieving messages from multiple message queues. Consequently, it is possible to have more message queues than threads, although the current design for Replication Server doesn’t require such. By now, those familiar with many of the Replication Server configuration parameters will have realized the relationship between several fairly crucial configuration parameters: num_threads, num_msgqueues and num_msgs (especially why this number could be a large multiple of num_msgqueues). Since this section was strictly intended to give you a background in Replication Server internals, the specifics of this relationship will be discussed later in the section discussion Replication Server tuning.

OQID Processing

One of the more central concepts behind replication server recovery is the OQID – Origin Queue Identifier. The OQID is used for duplicate and loss detection as well as determining where to restart applying transactions during recovery. The OQID is generated by the Replication Agent when scanning the transaction log from the source system. Due to the fact the OQID contains log specific information, each OQID format will be dependent upon the source system. For Sybase ASE, the OQID is a 36 byte binary value composed of the following elements:

Byte Contents

1-2 Database generation id (from dbcc gettrunc())

3-8 Log page timestamp

9-14 Log page rowid (rid)

15-20 Log page rid for the oldest transaction

21-28 Datetime for oldest transaction

29-30 Used by RepAgent to delete orphaned transactions

31-32 Unused

33-34 Appended by TD for uniqueness

35-36 Appended by MD for uniqueness

Through the use of the database generation id, log page timestamp and log record row id (rid), ASE guarantees that the OQID is always increasing sequentially. As a result, any time the RS detects an OQID lower than the last one, it can somewhat safely assume that it is a duplicate. Similarly at the replicate, when the DSI compares the ODID in the rs_lastcommit table with the one current in the active segment, it can detect if the transaction has already been applied.

Final v2.0.1

10

Why would there be duplicates?? Simply because the Replication Server isn’t updating the RSSD or the rs_lastcommit table with every replicated row. Instead, it is updating every so often after a batch of transactions has been applied. Should the system be halted mid-batch and then restarted, it is possible that the first several have already been applied. At the replicate, a similar situation occurs in that the Replication Server begins by looking at the oldest active segment in the queue – which may contain transactions already applied.

Note that the oldest open transaction position is also part of the ASE. This is deliberate. Since the Replication Agent could be scanning past the primary truncation point and up to the end of the log, the oldest open transaction position is necessary for recovery. As discussed later, the ASE Rep Agent does not actually ever read the secondary truncation point. Consequently, if the Replication system is shutdown, the Replication Agent may have to restart at the point of the oldest transaction and rescan to ensure that nothing is missed.

For heterogeneous systems, the database generation (bytes 1-2) and the RS managed bytes (33-36) are the same, however the other components depend on what may be available to the replication agent to construct the OQID. This may include system transaction id’s or other system generated information that uniquely identifies each transaction to the Replication Agent.

An important aspect of the OQID is the fact that each replicated row from a source system is associated with only one OQID and vice versa. This is key to not only identifying duplicates for recovery after a failure (i.e. network outage), but also in replication routing. From this aspect, the OQID ensures that only a single copy of a message is delivered in the event that the routing topology changes. Those familiar with creating intermediate replication routes and concept of logical network topology provided by the intermediate routing capability will recognize the benefit of this behavior.

The danger is that some people have attempted to use the OQID or origin commit time in the rs_lastcommit table for timing. This is extremely inaccurate. First, the origin commit time comes from the timestamp in the commit record (a specific record in the transaction log) on the primary. This time is derived from the dataserver’s clock, which is synched with the system clock about once per minute. There can be drift obviously, but not more than a minute as it is re-synched each minute. The dest_commit time in the rs_lastcommit table, on the other hand, comes from the getdate() function call in rs_update_lastcommit. The getdate() function is a direct poll of the system clock on the replicate. The resulting difference between the two could be quite large in one sense or even negative if the replicate’s clock was slow. In any case, since transactions are grouped when delivered via RS (topic for later), the rs_lastcommit commit time is for the last command in the batch – and not necessarily the one issued that you are testing with. Additionally, as we will see later, if the last command was a long running procedure, it may appear to be worse than it is. On the other hand, much like network packeting, the Replication Agent and Replication Server both have deliberate delays built in when only a small number of records are received. This ‘pause’ is built in so that subsequent transactions can be batched into the buffer for similar processing. Those familiar with TCP programming will recognize this buffering as similar to the delay that is disabled by enabling TCP_NO_DELAY as well as other O/S parameters such as tcp_deferred_ack_interval on Sun Solaris.

The best mechanism to determining latency is to simply run a batch of 1,000 normal business transactions (can be simulated with atomic inserts spread across the hot tables) into the primary and monitor the end time at the primary and replicate. For large sets of transactions, obviously a stop watch is not even necessary. If the Replication Server is keeping the system up to the point a stop watch would be necessary, then you don’t have a latency problem. If, however, it finishes at the primary in 1 minute and at the replicate in 5 minutes – then you have a problem – maybe....

Analyzing Replication System Performance

Having set the stage, the rest of this document will be divided into sections detailing how these components work in relation to possible performance issues. The major sections will be:

• Primary Dataserver/Database • Replication Agent Processing • Replication Server and RSSD General Tuning • Inbound Processing • Outbound Queue Processing • Replicate Dataserver/Database

After these sections have been covered in some detail, this document will then cover several special topics related to DSI processing in more detail. This includes:

• Procedure Replication • Replication Routes • Parallel DSI Performance

Final v2.0.1

11

• Text/Image Replication • Asynchronous Request Functions • Multiple DSI’s • Integration with EAI

Final v2.0.1

13

Primary Dataserver/Database

It is Not Possible to Tune a Bad Design The above comment is the ninth principal of the “Principals of OLTP Processing” as stated by Nancy Mullen of Andersen Consulting (now Accenture?) in her paper OLTP Program Design in OLTP Processing Handbook (McGraw-Hill). A truer statement has never been written. Not only can you not fix it by replication, but in most cases, a bad design will also cause replication performance to suffer. In many cases when replication performance is bad, we tend to focus quickly at the replicate. While it is true that many replication performance problems can be resolved there, the primary database often also plays a significant role. In fact, implementing database replication or other forms of distributing database information (messaging, synchronization, etc.) will quickly point to significant flaws in the primary database design or implementation, including:

• Poor transaction management, particularly with stored procedures, batch processes. • Single threaded batch processes. While they may “work”, they are not scalable. • High-impact SQL statements - such as a single update or delete statement that affects a large number of rows

(>10,000). • Inappropriate design for a distributed environment (heavy reliance on sequential or pseudo keys) • Improper implementation of relational concepts (i.e. lack of primary keys, duplicate rows, etc.)

Note that all of these have problems in a distributed environment – whether using Replication Server or MQSeries messaging. However, the proper design of a database system for distributed environments is beyond the scope of this paper. In this section, we will begin with basic configuration issues and then move into some of the more problematic design issues that affect replication performance.

Dataserver Configuration Parameters

While Sybase has striven (with some success) to make replication transparent to the application, it is not transparent to the database server. In addition to the Replication Agent Thread (even though significantly better than the older LTM’s as far as impact on the dataserver), replication can impact system administration in many ways. One of those ways is proper tuning of the database engine’s system configuration settings. Several settings that would not normally be associated with replication, nonetheless, have a direct impact on the performance of the Replication Agent or in processing transactions within the Replication Server.

Procedure Cache Sizing

A common misconception is that procedure cache is strictly used for caching procedure query plans. However, in recent years, this has changed. The reason is than in most large production systems, the procedure cache was grossly oversized, consequently under utilized and contributed to the lack of resources for data cache. For example, in a system with 2GB of memory dedicated to the database engine, the default of 20% often meant that ~400MB of memory was being reserved for procedure cache. Often, real procedure cache used by stored procedure plans is less than 10MB. ASE engineers began tapping in to this resource by caching subquery results, sort buffers, etc. in procedure cache. When the Replication Agent thread was internalized within the ASE engine (ASE 11.5), it was no different. It also used procedure cache. Later releases of ASE (from ASE 12.0) have moved this requirement from procedure cache to additional memory grabbed at startup similar to additional network memory. Consequently, if using ASE 12.5, this may not be as great of a problem as ASE 11.9.2 or earlier.

The Replication Agent uses memory for several critical functions:

Schema Cache - Caching for database object structures, such as table, column names, text/image replication states, used in the construction of LTL.

Transaction Cache - Caching LTL statements pending transfer to the Replication Server

As a result, system administrators who have tuned the procedure cache to the minimal levels prior to implementing replication may need to increase it slightly to accommodate Replication Agent usage if using an earlier release of ASE. You can see how much memory a Replication Agent is using via the 9204 trace flag (additional information on enabling/disabling Replication Agent trace flags is located in the Replication Agent section).

sp_config_rep_agent <db_name>, “trace_log_file”, “<filepathname>” sp_config_rep_agent <db_name>, “traceon”, “9204” -- monitor for a few minutes sp_config_rep_agent <db_name>, “traceoff”, “9204”

Final v2.0.1

14

Generally speaking, the Replication Agent’s memory requirements will be less than normal server’s metadata cache requirements for system objects (sysobjects, syscolumns, etc.). A rule of thumb if sizing a new system for replication might be to use the metadata cache requirements as a starting point.

Metadata Cache

The metadata cache itself is important to replication performance. As will be discussed later, as the Replication Agent reads a row from the transaction log, it needs access to the object’s metadata structures. If forced to read this from disk, the Replication Agent processing will be slowed while waiting for the disk I/O to complete. Careful monitoring of the metadata cache via sp_sysmon during periods of peak performance will allow system administrators to size the metadata cache configurations appropriately.

User Log Cache (ULC)

User (or Private) Log Cache was implemented in Sybase SQL Server 11.0 as a means of reducing transaction log semaphore contention and the number of times that the same log page was written to disk. In theory, a properly sized ULC would mean that only when a transaction was committed, would the records be written to the physical transaction log. One aspect of this that could have had a large impact on the performance of replication server was that this would mean that a single transaction’s log records would be contiguous on disk vs. interspersed with other user’s transactions. This would significantly reduce the amount of sorting that the SQT thread would have to do within the Replication Server.

However, in order to ensure low latency and due to an Operating System I/O flushing problems, a decision was made in the design of SQL Server 11.x, that if the OSTAT_REPLICATED flag was on, the ULC would be flushed much more frequently than normal. In fact, in some cases, the system behaves as if it did not have any ULC. As one would suspect, this can lead to higher transaction log contention as well as negating the potential benefit to the SQT thread. Over the years, Operating Systems have matured considerably, eliminating the primary cause and hence the need for this. In ASE 12.5, this ULC flush was removed, but as of this writing not enough statistics are available to tell how much of a positive impact this has on throughput by reducing the SQT workload. One reason is that it is extremely rare that the SQT workload is the performance bottleneck.

Primary Database Transaction Log

As you would assume, the primary transaction log plays an integral role in replication performance, particularly the speed at which the Replication Agent can read and forward transactions to the Replication Server.

Physical Location

The physical location of the transaction log plays a part in both the database performance as well as replication performance. The faster the device, the quicker Replication Agent will be able to scan the transaction log on startup, recovery and during processing when physical i/o is required. Some installations have opted to use Solid State Disks (SSD’s) as transaction log devices to reduce user transaction times, etc. While such devices would help the Replication Agent, if resources are limited, a good RAID-based log device will be sufficient to enable the SSD to be used as a stable device or other requirement for general server performance (tempdb).

Named Cache Usage

Along with log I/O sizing, binding the transaction log to a named cache can have significant performance benefits. The reason stems from the fact that the Replication Agent cannot read a log page until it has been flushed to disk. While this does happen immediately after the page is full due to recovery reasons, if a named cache is available, the probability is much higher that the Replication Agent can read the log from memory vs. disk. If forced to read from disk, the Replication Agent performance may drop to as low as 1GB/hr.

A word of caution. While it may be tempting to simply allocate a small 4K pool in an existing cache, the best configuration is a separate dedicated log cache with all but 1MB allocated to 4K buffer pools. For example, a 50MB dedicated log cache would have 49MB of 4K buffers and 1MB of 2K buffers. The reason is that if the named cache is for mixed use (log and data), more than likely other buffer pools larger than 4K have been established. In the Adaptive Server Enterprise Monitor Historical Server User’s Guide, a little known fact is stated: “Regardless of how many buffer pools are configured in a named data cache, Adaptive Server only uses two of them. It uses the 2K buffer pool and the pool configured with the largest-sized buffers.” While the intention may have been the largest size buffers were used, experience monitoring production systems suggests that it is the buffer pool with the largest buffer space instead in some cases, while in others it appears to use different pools almost exclusively for different periods of time. Unfortunately, some DBA’s simply assume that any 4KB I/O’s must be the transaction log, when it could be query activity – counters available through sp_sysmon do not differentiate log I/O from data pages. Rather than trying to

Final v2.0.1

15

second guess this, it is much simpler to simply restrict any named cache to only 2 sizes of buffer pools and use a dedicated log cache for this purpose.

In most cases where the RepAgent was lagging, every time that a separate log cache has been enabled, customers have witnessed an immediate 100% improvement in Replication Agent throughput as long as the RepAgent stayed within the log cache region.

Application/Database Design

While the above configuration settings can help reduce performance degradation, undoubtedly the best way to improve replication performance from the primary database perspective is the application or primary database design itself.

Chained Mode Transactions

In chained mode, all data retrieval and modification commands (delete, insert, open, fetch, select, and update) implicitly begin a transaction. The biggest impact on RS is from the implicit transactions that result from select statements – which in most applications accounts for 75-80% of all activity in a DBMS. Simple transactions that only involve queries vs. DML operations result in empty transactions, which are committed as usual. While some might think that the User Log Cache would filter these empty log transactions from even reaching the transaction log itself. However, since the transactions are committed vs. rolled back, these empty transactions are instead flushed to the transaction log. Besides the obvious negative impact on application performance, they have a negative impact in replication as well as these empty transactions are forwarded to the Replication Server.

Earlier versions of Replication Server would filter these empty transactions at the DSI thread due to the way transaction grouping works. Newer versions of Replication Server have reduced the impact by removing empty transactions earlier – those from chained transactions as well as system transactions such as reorgs. In ASE 12.5.2, the replication agent has been improved to eliminate the empty transactions from system transactions, however, user actions that result in empty transactions will still result in empty begin/commit pairs sent to the RS. As a result, an application that uses chained mode will degrade Replication Agent throughput as well as increase the processing requirements for Replication Server.

Multiple Physical Databases

One of the most frequent complaints is that the Replication Agent is not reading the transaction log fast enough, prompting calls for the ability to have more than one Replication Agent per log or a multi-threaded Replication Agent vs. the current threading model. Although some times this can be alleviated by properly tuning the Replication Agent thread, adjusting the above configuration settings, etc., there is a point where the Replication Agent is simply not able to keep up with the logged activity. A classic case of this can be witnessed during large bcp operations (100,000 or more rows) in which the overhead of constructing LTL for each row is significant enough to cause the Replication Agent to begin to lag behind. With the exception of bulk operations, when ever normal OLTP processing causes the Replication Agent to lag behind, the most frequent cause is the failure on the part of the database designers to consider splitting the logical database into two or more physical databases based on logical data groups.

Consider for example, the mythical pubs2 application. Purportedly, it is a database meant to track the sales of books to stores from a warehouse. Let’s assume that 80% of the transactions are store orders. That means the other 20% of the transactions are administering the lists of authors, books, book prices, etc. If maintained in the same database, this extra 20% of the transactions could be just enough to cause a single Replication Agent to lag behind the transaction logging. And yet, what would be lost by separating the database into two physical databases – one containing the authors, books, stores and other fairly static information, while the other functions strictly as the sales order processing database? The answer is not much. While some would say that it would involve cross-database write operations, the real answer is not really. Appropriately designed, new authors, books and even stores would be entered into the system outside the scope of the transaction recording book sales. Cross-database referential integrity would be required (for which a trigger vs. declarative integrity may be more appropriate), but even this does not pose a recovery issue except to academics. The real crux of the matter is, is it more important to have a record of a sale to a store in the dependent database even if the parent store record is lost due to recovery, or is it more important to enforce referential integrity at all points and force recovery of both systems?? Obviously, the former is better.

As a result, it makes sense to separate a logical database into several physical databases for the following types of data groupings:

• Application object metadata such as menu lists, control states, etc. • Application driven security implementations (screen navigation permissions, etc.) • Static information such as tangible business objects including part lists, suppliers, etc. • Business event data such as sales records, shipment tracking events, etc.

Final v2.0.1

16

• One-up/sequential key tables used to generate sequential numbers

Not only does this naturally lend itself to the beginnings of shareable data segments reusable by many applications, by doing so, you also will increase the degree of parallelism on the inbound side of Replication Server processing.

The last item might catch many people by surprise and immediate generate cautions about cross database transactions. First of all, under any recovery scenario – either the correct next value could be determined by scanning the real data or, the gap of missing rows can be determined from the key table. This last is important from a different perspective. Now, consider replication. By placing the one-up key tables in a separate database, they effectively have a dedicated Replication Agent – and simple path through the Replication Server. As a result, one-up/sequential key tables will have considerably less latency than the main data tables. Consequently, during a Warm Standby failure, it is less likely that any transactions were stranded, but the number of real transactions stranded may be able to be determined with more accuracy – and the associated key sequences preserved.

In addition, in some cases splitting a database can be highly recommended for other reasons. Consider the common problem of databases containing large text or image objects. As will be illustrated later, text/image or other types of BLOBs can significantly slow Rep Agent performance due to having to also scan the text chains – a slow process in any event. It is probably advisable to put such tables in a separate database with a view in the original for application transparency purposes. The reasons for this are:

• Enable multiple Replication Agents to work in parallel – in effect, dedicating one to reading text data • Enable separate physical connection at the replicate to write the data – improving overall throughput as non-

textual data is not delayed while text or image data is processed by the DSI thread. • Improve overall application/database recoverability.

The first two are obvious solutions to replication performance degradation as result of text processing. The latter comment is not so obvious. However, consider the following:

• Text/Image data is typically static. Once inserted, it is rarely updated and the most common write activity post-insert will be a delete operation performed during archival.

• To avoid transaction log issues with text/image, most applications will use minimally logged functions such as writetext (or the CT-Library equivalent ct_send_data() function) to insert the text.

As an example, consider the types of data that you may be storing in a text or image column. Some financial institutions store loan applicant credit reports as text datatypes (although not recommended). Other organizations will frequently store customer emails, digitized applications containing signatures, or other infrequently access reference data.

So how does a separate database improve recoverability? First, anytime a minimally logged function is executed in a database, the ability to perform transaction log dumps is voided. Consequently, databases containing text/image data often must be backed up using full database dumps. For any large database, this will require significant time to perform – depending on the quantity and speed of backup devices. By separating the text/image data, the primary data related to business processing can support transaction log dumps allowing up to the minute recovery as well as be brought online faster after a system shutdown.

Avoid Unnecessary BLOBs

The handling of BLOB (text/image) data is becoming more of a problem today as application developers faced with storing XML messages in the database are often choosing to store the entire message as a BLOB datatype (image for Sybase if using native XML indexing). In most cases, storing structured data in a BLOB datatype is actually orders of magnitude less efficient for the application. For instance, consider the “credit report” instance alluded to earlier. If a person’s credit report is stored as a single text datatype, the application must then perform the parsing to determine such items as the credit score, the number of open charge accounts, number of delinquent payments, etc. In addition, annotations about a specific charge are difficult to record. For example, if applying for a mortgage, an applicant may be required to explain late payments to a specific credit account. Stored as text datatype, it would be difficult to link the applicant’s rebuttal (which would be a good use of text) with the specific account. Additionally, it can detract from the business’s ability to perform business analysis functions critical to profitability. For example, a common requirement may be to determine the number of credit accounts and balances with any reported late payments for customers who are late in paying their current bill. This might allow a bank to reduce it’s risk of exposure either dynamically or avoid it altogether by refusing credit to someone who’s profile would suggest a greater chance of defaulting on the loan.

The point of this discussion is not to discourage storing XML documents when necessary – in fact storing the credit report as an entire entity might be needful – particularly if exchanging it with other entities. However, the tendency of

Final v2.0.1

17

some is to think of the RDBMS as a big bit bucket to store all of their data as “objects” in XML format without recognizing the futility of doing so.

Similarly, XML is mainly an application layer communications protocol. While serving an extremely useful purpose in providing the means to communicate with other systems, it can seriously degrade overall application performance if XML messages are stored as a single text datatype. For example, if a cargo airplane’s schedule and load manifest were stored in XML format as a text datatype, the business’s routing/scheduling and in transit visibility functions would be extremely hampered. Questions such as whether ground facility capacity had been exceeded, re-routing of shipments due to delays, or even the location of specific shipments would require the XML document to be parsed. While doable, and text indexing/XML indexing may assist in some efforts (i.e. finding shipments), often such operations require the retrieval of a large number of data values and subsequent parsing to find the desired information. Consider the query “What scheduled flights or delayed flights are scheduled to arrive in the next 1 hour?”

Transaction Processing

After the physical database design itself, the next largest contributor is how the application processes transactions. An inefficient application not only increases the I/O requirements of the primary database, it also can significantly degrade replication performance. Several of the more common inefficiencies are discussed below.

Avoid Repeated Row Re-Writes

One of the more common problems brought about by forms-based computing is that the same row of data may be inserted and then repeatedly updated by the same user during the same session. A classic scenario is the scenario of filling out an application for loans or other multi-part application process. A second common scenario is one in which fields in the “record” are filled out by database triggers, including user auditing information (last_update_user), order totals, etc. While some of this is unavoidable to insure business requirements are met, it may add extra work to the replication process. Consider the following mortgage application scenario:

1. User inserts basic loan applicant name, address information 2. As user transitions to next screen for property info, the info is saved to the database. 3. User adds the property information (stored in same database table). 4. As user transitions to the next screen, the property information is saved to the database 5. User adds dependent information (store in same table in denormalized form) 6. User hits save before asking credit info (not stored in same table)

Just considering the above scenario, the following database write operations would be initiated by the application: insert loan_application (name, address) update loan_application (property info) update loan_application (dependent info)

Now, consider the actual I/O costs if the database table had a trigger that recorded the last user and datetime that the record was last updated.

insert loan_application (name, address) update loan_application (lastuser, lastdate) update loan_application (property info) update loan_application (lastuser, lastdate) update loan_application (dependent info) update loan_application (lastuser, lastdate)

As a result, instead of a single record, the Replication Agent must process 6 records – each of which will incur the same LTL translation, Replication Server normalization/distribution/subscription processing, etc. On top of which, consider what happens at the replicate (if triggers are not turned off for the connection) – local trigger firings at the replicate are bolded.

insert loan_application (name, address) update loan_application (lastuser, lastdate) update loan_application (lastuser, lastdate) update loan_application (lastuser, lastdate) update loan_application (property info) update loan_application (lastuser, lastdate) update loan_application (lastuser, lastdate) update loan_application (lastuser, lastdate) update loan_application (dependent info) update loan_application (lastuser, lastdate) update loan_application (lastuser, lastdate) update loan_application (lastuser, lastdate)

Some may question the reality of such an example. It is real. While remaining unnamed, one of Sybase’s mortgage banking customers had a table containing 65 columns requiring 8-10 application screens before completely filled out.

Final v2.0.1

18

After each screen, rather than filling out a structure/object in memory, each screen saved the data to the database. During normal database processing, this led to an extremely high amount of contention within the table made worse by the continual page splitting to accommodate the increasing row size. Replication was enabled in a Warm-Standby configuration for availability purposes. Although successful, you can guess the performance implications within Replication Server from such a design.

Understanding Batch Processing

Most typical batch processes involve one of the following types of scenarios:

• Bulkcopy (bcp) of data from a flat file into a production table. This is more common than it should be as bcp-ing data is inherently problem-prone.

• Bulk SQL statement via insert/select or massive update or delete statement. • A single or multiple stream of individual atomic SQL statements affecting one row each. This is extremely

rare and usually only is present in extremely high OLTP systems where contention avoidance is paramount.

The last one typically is not a problem for replicated systems, however, the first two are – and it has nothing to do with Replication Server. The simple fact of the matter is that any batch SQL statement logs each row individually in the transaction log. Consequently, any distributed system is left with the unenviable task of moving the individual statements enmass (and frequently as one large transaction).

So, what’s the problem with this? The problem is the dismal performance of executing atomic SQL statements vs. bulk SQL statements. Consider what happens for each SQL statement as it hits ASE:

• SQL statement is parsed by the language processor • SQL statement is normalized and optimized • SQL is executed • Task is put to sleep pending lock acquisition and logical or physical I/O • Task is put back on runnable queue when I/O returns • Task commits (writes commit record to transaction log) • Task is put to sleep pending log write • Task sends return status to client

When this much overhead is executed for every row affected in a batch process, the process slows to a crawl. This can be seen in the following graph which compares a straight bcp in, a bcp in using a batch size of 100, an insert/select statement, and atomic inserts grouped in batches of 100 - in an unreplicated system .

Batch Insert Speeds

0

100

200

300

400

500

600

700

800

0 25,000 50,000 100,000 150,000 200,000 250,000

Rows

Seco

nds bcp in

bcp -b100insert/select100 grouped inserts

Figure 5 – Non-replicated Batch Insert Speeds on single CPU/NT

Final v2.0.1

19

The above test was run on a small NT system, however, the relative difference holds. Notice that the results are fairly linear and show a marked difference between the grouped atomic inserts and any of the bulk statements (a factor of 700%).

So why is this important? One of the biggest causes in latency within a replicated environment is bulk SQL operations during batch processing - in particular high-impact update and delete statements. In these cases, a single update or delete operation could easily affect 100’s of thousands of rows. If you think about what was mentioned earlier, the primary ASE can execute the batch SQL along the performance lines as indicated above – easily completing 250,000 rows in less than 2 minutes. Note that in the cases of the bcp or the single large insert/select, the parse, compile, optimize steps are either eliminated or only executed once. The problem is that all that is in the transaction log is the 250,000 row images - not the SQL statement that caused the problem. As a result, the replicate system unfortunately has to follow the atomic SQL statement route – and suffers mightily as it attempts to execute 250,000 individual inserts. Using the above as an indication, since RS is sending individual inserts, the best it could hope for would be 12 minutes of execution instead of 1.5 - however this is even not attainable as it is unlikely that RS could group 100 inserts into a single batch (as we will see later, it is limited to 50 statements per batch). The problem is that a typical batch process may contain dozens to hundreds of such bulk SQL statements - each one compounding the problem.

To see the impact of this in real life, a recent test with a common financial trading package that had a single delete of ~800,000 rows showed the following statistics (over several executions):

Component Rows/Min Latency

Primary ASE (single delete stmt) 800,000 N/A

Rep Agent RS (Inbound Queue) 120,000 7-12 min

Inbound Queue Outbound Queue 180,000 5-7 min

DSI Replicate ASE 15,000 53 min

It is extremely important to realize, it is not the Replication Server that can’t achieve the throughput - but rather the inability of the target dataserver to process each statement quickly enough that causes the latency. This leads to the first key concept that is indisputable, but for some reason is unbelievable as so many are quick to blame RS for the latency:

Key Concept #1: Replication Server with a single DSI/single transaction will be limited in its ability to achieve any real throughput by the replicate data server’s (DBMS) performance. Beyond that point, Parallel DSI’s and smaller transactions must be used to avoid latency.

It was interesting to note that while the financial package used a single delete statement to remove the rows, it then re-populated the table using inserts of 1,000 rows at a time as atomic transactions. At this point, with parallel DSI’s, RS was able to execute the same volume of inserts and achieve the same throughput. Had the delete (above) note been clogging the system, there would have been near-zero latency for the inserts.

To further illustrate that this is not just a Replication Server issue, consider the typical messaging implementation: a message table is populated within ASE (similar to the transaction log), the message agent (such as TIBCO’s ADB) polls the messages from this table (similar to the RepAgent), the message bus stores the messages to disk (if durable messaging is used), and finally the message system applies the data as SQL statements to the destination system. If the messaging system treats each transaction as a singular message to maintain transactional consistency, it would have the same problem as RS - slow execution by the target server. Only if transactional consistency is ignored and the messages applied in parallel could the problem be overcome.

Batch Process/Bulkcopy Concurrency

In some cases, the lack of concurrency at the primary translates directly into replication performance problems at the replicate. Consider for example, the ever-common bulkcopy problem. “Net gouge” for years has stated that during slow bcp, the bcp utility translates the rows of data into individual insert statements. Consequently, people find it surprising that Replication Server has difficulty keeping up. In the first place, the premise is false. While slow bcp is an order of magnitude slower than fast bcp, it is still a bulk operation and consequently does not validate user-defined datatypes, declarative referential integrity, check constraints nor fire triggers. In fact, the only difference between “slow” bcp and “fast” bcp is that the individual inserted rows are logged for “slow” bcp whereas in “fast” bcp only the space allocations are logged. As a result, of course, it is still several orders of magnitude faster than individual insert statements that Replication Server will use at the replicate. This is clearly illustrated above in the insert batch

Final v2.0.1

20

test (figure 5) as the bcp in this case was a “slow” bcp - hence the comparable performance of the insert/select (which would log each row as well).

Typical Batch Scenario

Now, consider the scenario of a nightly batch load of three tables. If bcp’d sequentially using slow bcp, it may take 1-2 hours to load the data. Unfortunately, when replication is implemented, the batch process at the replicate requires 8-10 hours to complete, exceeding the time requirements and possibly encroaching on the business day. Checking the replicated database during this time shows extremely little CPU or I/O utilization and the maintenance user process busy only a fraction of the time. All the normal “things” are tried and even parallel DSI’s are implemented – all to no avail. Customer decides that Replication Server just can’t keep up.

The reality of the above scenario is that several problems contributed to the poor performance:

• The bcp probably did not use batching (-b option) and as a result was loaded in a single transaction. As a result, the Replication Server could only ever use a single DSI, no matter how many were configured, as it had to apply it as a single transaction.

• Further, it would be held in the inbound queue until the commit record was seen by the SQT thread – as a large transaction, this may incur multiple scans of the inbound queue to recreate the transaction records due to filling the SQT cache.

• Lack of batch size in the bcp (-b option) more than likely drove Replication Server to use large transaction threads – while this may have reduced the overall latency in one area due to not having to wait for the DSI to see the commit record, it also meant that Replication Server only considered a small number of threads preserved for large transactions.

• Replication Agent probably was not tuned (batching and ltl_batch_size) as will be discussed in the next section.

• Even if bcp batching were enabled, by sequentially loading the tables, concurrent DSI threads would suffer a high probability of contention, especially on heap tables or indexes – due to working on a single table. If attempting to use parallel DSI’s, this will force the use of the less efficient default serialization method of “wait_for_commit”.

Some of the above will be addressed in the section specific to Parallel DSI tuning, however, it should be easy to see how the Replication Server lagged behind. It also illustrates a very key concept:

Key Concept #2: The key to understanding Replication Server performance is understanding how the entire Replication System is processing your transaction.

Batch Scenario with Parallelism

Now, consider what would likely happen if the following scenario was followed for the three tables:

• All three tables were bcp’d concurrently using a batch size of 100. • Replication Server was tuned to recognize 1,000 statements as a large transaction vs. 100. • Replication Agent was tuned appropriately. • DOL/RLL locking at the replicate database. • DSI serialization was set to “wait_for_start” (see Parallel DSI tuning section). • Optionally, tables partitioned (although not necessary for performance gains – if partitioned, DOL/RLL is a

must).

Would the SQT cache size fill? Probably not. Would the Parallel DSI’s be used/effective? Most assuredly. Would Replication Server keep up? It probably would still lag, but not as much. At the primary, it now may take only 2 hours to load the data (arguably less if not batching) and 3 hours at the replicate. In fact, as noted earlier in the financial trading system example, an insert of ~800,000 rows in 1,000 row transactions executed using 10 parallel DSI’s completed at the replicate in the same amount of time as it took to execute at the primary - any latency would be simply due to the RS processing overhead.

The same scenario is evident in purge operations. Typically, a single purge script begins by deleting masses of records using SQL joins to determine which rows can be removed. The problem is of course that this is identical from a replication perspective as a bcp operation – a large transaction with no concurrency. An alternative approach in which a delete list is generated and then used to cursor through the main tables using concurrent processes may be more recoverable, cause less concurrency problems at the primary and improve replication throughput. Consider the

Final v2.0.1

21

following benchmark results from a 50,000 row insert into one table from a different table (mimicking a typical insert from a staging table to production table):

50,000 Row Bulk Insert Between Two Tables

Method Time (sec)

Single SQL statement (insert/select) 1

10 threads processing 1 row at a time 57

10 threads processing 100 ranged rows at a time* 5

10 threads processing 250 ranged rows at a time* 1

By ranged rows (*), the system predefined 10 ranges of rows (i.e. 1-5000, 5001-10000, 10001-15000, etc.). As each thread initialized, it was assigned a specific range. It then performed the same insert/select, but specified a rowcount of 100 or 250 as noted above. Ignoring the replication aspects, the above benchmark easily demonstrates a couple of key batch processing hallmarks:

1. It is possible to achieve the same performance as large bulk statements by running parallel processes using smaller bulk statements on predefined ranges

2. Atomic statement processing is slow

This leads to a second key concept:

Key Concept #3: The optimal primary transaction profile for replication is concurrent users updating/inserting/deleting small numbers of rows per transaction spread throughout different tables.

That does not mean low volume! It can be extremely high volume. It just means it is better from a replication standpoint for 10 processes to delete 1,000 rows each in batches of 100 than for a single process to delete 100,000 rows in a single transaction. Accordingly, the best way to improve replication performance of large batch operations is to alter the batch operation to use concurrent smaller transactions vs. a single large transaction.

An interesting test (some results were described above) was done on a dual processor (850MHz P3 standard (not XEON)) NT workstation with ASE 12.5 and RS 12.5 running on the same host machine. Several batch inserts of 25,000-100,000 rows were conduction from one database on the ASE engine to another using a Warm Standby implementation. By using 10 processes to perform the inserts in 250 row transactions in pre-defined ranges, RS was still able to reliably achieve 750-1,000 rows per second total throughput (and since ASE was configured for 2 engines, this machine was sorely over utilized). This was all accomplished with 10 parallel threads in RS with dsi_serialization_method set to ‘isolation_level_3’.

Replicating SQL for Batch Processing

The fundamental problem in batch processing is that a single SQL statement at the primary is translated into thousands of rows at the replicate – each row requiring RS resources for processing and then the typical parse, optimize and sleep pending I/O at the replicate dataserver delays. For updates and deletes, users of ASE 12.5 and RS 12.5 can take advantage of a feature introduced with ASE 12.0 that allows the actual replication of a SQL statement. Consider the following code fragment: if exists (select 1 from sysobjects where name="replicated_sql" and type="U" and uid=user_id()) drop table replicated_sql go create table replicated_sql (

sql_statement_id numeric(20,0) identity, sql_string varchar(1800) null, begin_time datetime default getdate() not null, commit_time datetime default getdate() not null

) go create unique clustered index rep_sql_idx on replicated_sql (sql_statement_id) go create trigger replicated_sql_ins_trig on replicated_sql for insert as begin

Final v2.0.1

22

declare @sqlstring varchar(1800) select @sqlstring=sql_string from inserted set replication off execute(@sqlstring) set replication on

end go exec sp_setreptable replicated_sql, true go if exists (select 1 from sysobjects where name="sp_replicate_sql" and type="P" and uid=user_id())

drop proc sp_replicate_sql go create proc sp_replicate_sql @sql_string varchar(1800) as begin

declare @began_tran tinyint, @triggers_state tinyint, @proc_name varchar(60)

select @proc_name=object_name(@@procid)

-- check for tran state. If already in tran, set a save point so we are well-behaved if @@trancount=0

begin select @began_tran=1 begin transaction rep_sql end else begin select @began_tran=0 save transaction rep_sql end

-- check for trigger state. For NT, byte 6 of @@options & 0x02 = 2 is on -- in unix, the bytes may be swapped if (convert(int,substring(@@options,6,1)) & 0x02 = 0) begin select @triggers_state=0 -- since triggers are off, we'd better check if we can turn them on if proc_role('replication_role')=0 begin raiserror 30000 "%1!: You must have replication role to execute this

procedure at the replicate", @proc_name if @began_tran=1 rollback tran return(-1) end set triggers on end else begin select @triggers_state=1 end

-- okay, now we can do the insert insert into replicated_sql (sql_string) values (@sql_string) if @@error!=0 or @@rowcount=0 begin rollback tran rep_sql raiserror 30001 "%1!: Insert failed. Transaction rolled back", @proc_name if @triggers_state=0 set triggers off return(-1) end else if @began_tran=1 commit tran

if @triggers_state=0 set triggers off return (0)

end go exec sp_setrepproc 'sp_replicate_sql', 'function' go

Then use the following replication definitions (this example is for a Warm Standby between two copies of pubs2 with a logical connection of WSTBY.pubs2) Create replication definition replicated_sql_repdef With primary at WSTBY.pubs2 With all tables named replicated_sql ( sql_statement_id identity,

Final v2.0.1

23

sql_string varchar(1800) ) primary key (sql_statement_id) send standby replication definition columns go create function replication definition sp_replicate_sql

with primary at WSTBY.pubs2 deliver as sp_replicate_sql ( @sql_string varchar(1800) ) send standby all parameters

go

Now, if you really want to amaze your friends, simply execute something like the following: Exec sp_replicate_sql “insert into publishers values (‘9990’,’Sybase, Inc.’,’Dublin’,’CA’)”

The trick is in the highlighted portions of the trigger and the stored procedure. Starting in ASE 12.0, Sybase provided a capability to execute dynamically constructed SQL statements using the execute() function. However, if placed directly in a replicated procedure, the Rep Agent stack traces and fails (a nasty recovery issue for a production database). However, if the execute() function is in a trigger, Rep Agent behaves fine. Accordingly, we simply insert the desired SQL statement in a table. Of course, this also provides us a way to audit the execution of batch SQL and compare commit times for latency purposes (even replicated SQL statements could run for a long time).

Now then, the only problem is that with Warm Standby, triggers are turned off by default via the dsi_keep_triggers setting (and it probably is off for most other normal replication implementations as well). Rather than enabling triggers for the entire session and cause performance problems during the day, we simply borrow a trick that dsi_keep_triggers simply calls the ‘set triggers off’ command. Rather than simply indiscriminately turning the triggers off and then on and the beginning and end of the procedure, we employ trick #2 - @@options. @@options is an undocumented global variable that stores session settings – such as ‘set arith_abort on’, etc. Since it is a binary number, you need to consider the byte order on your host, however, it now becomes a simple matter to replicate a proc that turns on triggers, inserts a SQL string into a table, which in turn triggers the execution of the string, and then the proc returns triggers to the original setting and exits.

By the way, why replicate both the table and the proc? Well, the answer is it allows you to replicate truncate table or SQL deletes against the table when it begins getting unwieldy.

As stated, this is a neat trick for handling updates and deletes. Inserts, particularly bcp’s are not able to use this for the simple fact that the source data needs to exist at the replicate already. However, if batch feeds are bcp’d into staging databases on both systems (which should be done in WS situations), the bulk insert into the production database using ‘insert into … select…’ can be replicated in this fashion as well. Additionally, while it has been stated that this is limited to the 12.5 versions of the products, it will in fact work with any 12.x version, but the SQL statement would be limited to 255 characters due to the varchar(255) limitation prior to ASE 12.5 and RS 12.5.

Batch Processing & Ignore_dupe_key

Some of the more interesting problems arise when programmers make logical assumptions - and without fully understanding the internal workings of ASE – implement an easy work around. Consider the following code snippet that might be used when moving rows from a staging database to the production system: create proc load_prod_table @batch_size int=250… as begin

declare @done_loading tinyint select @done_loading=0 set rowcount @batch_size while @done_loading=0 begin

insert into prod_table… select from staging_table if @@rowcount=0 select @done_loading=1 delete staging_table

end end

This appears to be fairly harmless, and assuming that the proc is NOT replicated, it would appear to be a normal implementation. However, two things are wrong with it:

• The assumption is that the same rows selected for insert will be the same rows deleted. Remember, if worker threads are involved, this may not be the case, particularly with partitioned tables. As a result, the delete could affect other rows than those inserted.

• The assumption is that the insert only READ ‘rowcount’ rows from the source data. This is perhaps the biggest failure that affects performance.

Final v2.0.1

24

Why is the last bullet so important? Remember, that setting rowcount affects the final result set – and does not limit any subqueries, etc. Hence ‘select sum(x) from y group by a’ will return ‘rowcount’ rows despite the fact it may have to scan millions to generate the sums. Accordingly, it may require ASE to scan hundreds or thousands of rows to generate ‘rowcount’ unique rows for a table in which ignore_dupe_key is set for the primary key index.

So, why is this a problem? Lets assume that we have a batch of 100,00 records in which 50% of them are duplicates (every other row) or already exist in the target table. Assuming rowcount is set to 250, it would mean that the insert would have to scan 500 rows in order to generate 250 unique ones to be inserted. However, the delete would only remove 250 of them. As a result, on the second pass through the loop, the insert would scan 250 rows it had already scanned and then an addition 500 rows to get 250 unique ones that it could insert. And the delete would remove 250. On the third pass, the insert would scan 500 rows already processed plus 500 new rows. And so forth. Essentially, even though 100,000 rows with 50% unique and a batch size of 250 would suggest a fairly smooth 200 iterations through the loop, by the last iteration, the insert would be scanning 49,750 rows already scanned plus the final 500 (with 250 unique).

A reproduction of this problem (for the confused or interested) is as the below: use tempdb go if exists (select 1 from sysobjects where name="test_table" and type="U" and uid=user_id()) drop table test_table go create table test_table ( col_1 int not null, col_2 varchar(40) null ) go create unique nonclustered index test_table_idx on test_table (col_1) with ignore_dup_key go if exists (select 1 from sysobjects where name="test_table_staging" and type="U" and uid=user_id()) drop table test_table_staging go create table test_table_staging ( col_1 int not null, col_2 varchar(40) null ) go insert into test_table_staging values (1,"expected batch=1") insert into test_table_staging values (2,"expected batch=1") insert into test_table_staging values (3,"expected batch=1") insert into test_table_staging values (3,"expected batch=1") insert into test_table_staging values (4,"expected batch=1") insert into test_table_staging values (5,"expected batch=2") insert into test_table_staging values (6,"expected batch=2") insert into test_table_staging values (7,"expected batch=2") insert into test_table_staging values (7,"expected batch=2") insert into test_table_staging values (8,"expected batch=2") insert into test_table_staging values (9,"expected batch=3") insert into test_table_staging values (10,"expected batch=3") insert into test_table_staging values (11,"expected batch=3") insert into test_table_staging values (11,"expected batch=3") insert into test_table_staging values (12,"expected batch=3") go if exists (select 1 from sysobjects where name="lsp_insert_test_table" and type="P" and uid=user_id()) drop proc lsp_insert_test_table go CREATE PROC insert_test_table @batchsize INT = 5 AS BEGIN DECLARE @cnt INT, @myloop int, @err INT, @del int -- added to track deletes. SELECT @cnt = -1, @err = 0, @myloop = 1 SET ROWCOUNT @batchsize WHILE @cnt != 0 BEGIN select "Loop ----------- ", @myloop INSERT test_table (col_1, col_2)

Final v2.0.1

25

SELECT col_1, col_2+" ==> actual batch="+convert(varchar(3),@myloop) FROM test_table_staging SELECT @cnt = @@ROWCOUNT, @err = @@ERROR set rowcount 0 select "test_table:" select * from test_table -- added to show what is inserted to this point.... select "Rowcount = " , @cnt set rowcount @batchsize DELETE test_table_staging set rowcount 0 select "test_table_staging:" select * from test_table_staging -- added to show what is left select "Delete Rowcount = ",@del set rowcount @batchsize select @myloop = @myloop + 1 END RETURN 0 END go

Consider the following sample execution – since the default is set to 5, executing the procedure without any parameter value should result in a ROWCOUNT limit of 5 rows: use tempdb go select * from test_table_staging go exec insert_test_table go select * from test_table go The output from this as executed is: ---------- isql CHINOOK ---------- col_1 col_2 ----------- ---------------------------------------- 1 expected batch=1 2 expected batch=1 3 expected batch=1 3 expected batch=1 4 expected batch=1 5 expected batch=2 6 expected batch=2 7 expected batch=2 7 expected batch=2 8 expected batch=2 9 expected batch=3 10 expected batch=3 11 expected batch=3 11 expected batch=3 12 expected batch=3 (15 rows affected)

The above is the output from the first select statement, showing the original 15 rows containing 3 duplicates (3,7, and 11). Note the highlighted rows (5,9, and 10) and their expected batch. Now, consider the procedure execution – loop iteration 1 is contained below: ----------------- ----------- Loop ----------- 1 (1 row affected) Duplicate key was ignored. --------- test_table: (1 row affected) col_1 col_2 ----------- ---------------------------------------- 1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1

Final v2.0.1

26

3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 (5 rows affected) ----------- ----------- Rowcount = 5 (1 row affected) ----------------- test_table_staging: (1 row affected) col_1 col_2 ----------- ---------------------------------------- 5 expected batch=2 6 expected batch=2 7 expected batch=2 7 expected batch=2 8 expected batch=2 9 expected batch=3 10 expected batch=3 11 expected batch=3 11 expected batch=3 12 expected batch=3 (10 rows affected) ------------------ ----------- Delete Rowcount = 5 (1 row affected)

Note what occurred. Because of the duplicate row for row_id 3, the subquery select in the insert statement had to read 6 rows – consequently row_id 5 was actually inserted as part of the first batch. However, because the delete is an independent statement, it simply deletes the first 5 rows, which contains the duplicate, leaving row_id 5 in the list. Now, consider what happens with loop iteration #2: ----------------- ----------- Loop ----------- 2 (1 row affected) Duplicate key was ignored. --------- test_table: (1 row affected) col_1 col_2 ----------- ---------------------------------------- 1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 (10 rows affected) ----------- ----------- Rowcount = 5 (1 row affected) ----------------- test_table_staging: (1 row affected) col_1 col_2 ----------- ---------------------------------------- 9 expected batch=3 10 expected batch=3 11 expected batch=3

Final v2.0.1

27

11 expected batch=3 12 expected batch=3 (5 rows affected) ------------------ ----------- Delete Rowcount = 5 (1 row affected)

Again, notice what occurred. Because of the row_id 5 is repeated and the duplicate for row_id 7, the insert scans 7 rows to achieve the rowcount of 5. Of course, the delete only removes the next five, leaving rows 9 &10 still in the staging table. Finally, we come to the last loop iteration: ----------------- ----------- Loop ----------- 3 (1 row affected) Duplicate key was ignored. --------- test_table: (1 row affected) col_1 col_2 ----------- ---------------------------------------- 1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) ----------- ----------- Rowcount = 2 (1 row affected) ----------------- test_table_staging: (1 row affected) col_1 col_2 ----------- ---------------------------------------- (0 rows affected) ------------------ ----------- Delete Rowcount = 5 (1 row affected) ----------------- ----------- Loop ----------- 4 (1 row affected) --------- test_table: (1 row affected) col_1 col_2 ----------- ---------------------------------------- 1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2

Final v2.0.1

28

10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) ----------- ----------- Rowcount = 0 (1 row affected) (return status = 0) col_1 col_2 ----------- ---------------------------------------- 1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) Normal Termination Output completed (1 sec consumed).

Because of the implementation, each duplicate compounds the problem, causing subsequent batches to begin with duplicates. So what’s the problem?? A couple of points are key to understanding what is happening:

• When a duplicate is encountered, the server uses a Compensation Log Record (CLR) to undo a previous log record – in this case, the duplicate insert.

• “SET ROWCOUNT” affects the number of rows affected by the statement vs. the rows processed by subquery or other individual parts of the statement. Consequently an insert limited by SET ROWCOUNT to 5 rows may have to read 6 or more rows if a duplicate is present.

• The implementation does not check to ensure that the rows inserted are the rows being deleted. Consequently, some rows could be “dropped” without even being inserted.

Now then, since the Rep Agent can be fully caught up, it replicates records for uncommitted transactions as well as committed. In this case, as soon as each log page is flushed, the Rep Agent can read it. Since the log page contains the duplicate rows for those being inserted (remember, bulk SQL first logs the affected rows and THEN applies them), it also reads the CLR records – which is needful. By this point you can determine that the following is occurring (assuming the 50,000 row delete using 250 row iterations, again):

• Each loop iteration causes and additional 250 duplicate insert rows to be replicated along with 250 CLR records over the previous iteration

• By the last iteration, RS receives ~49,750 duplicate insert records, 49,750 CLR records plus 250 duplicate inserts from the last batch along with the 250 CLR records and then (last but not least) the 250 actually inserted rows.

This is all in one transaction. With all 200 iterations, RS must then remove the duplicate inserts that the CLR records point to. Consequently, this seemingly innocent 100,000 row insert of 50,00 new rows results in an astounding 4,925,250 total CLR records (250+500+750+…49,500+49,750) and a duplicate number of inserts for a whopping total of 9,850,500 unnecessary records on top of the 50,000 rows really wanted. Can you guess the impact on:

• Your transaction log at the primary system (remember, all those CLR and inserts are logged)!!! • The Replication Server performance as it also removes all the duplicates!!!

Oh, yes, this actually did happen at a major bank, and may have happened at least one more that we are aware of. The point of this discussion is that even though the SQL to remove the duplicates from the staging table appeared to be a slower design than the quick “band-aid” of ignore_dupe_key, in reality, given the data quality, it turns out to be tremendous performance boost. Sometimes, band-aids don’t stick.

Final v2.0.1

29

Replication Agent Processing

Why is the Replication Agent so slow??? Frequently, comments will be made that the ASE Rep Agent is not able to keep up with logging in the ASE. For most normal user processing, a properly tuned Rep Agent on a properly tuned transaction log/system will have no trouble keeping up. This is especially true if the bulk of the transactions originate from GUI-base user screens since such applications naturally tend to have an order of magnitude more reads than writes. However, for systems with large direct electronic feeds or sustained bulk loading, Replication Agent performance is crucial. At this writing, a complete replication system based on Replication Server 12.0 is capable of maintaining over 2GB/Hr from a single database in ASE 11.9.3 using normal RAID devices (vs. SSD’s). In a different type of test, the ASE 12.5.2 RepAgent thread on a single cpu NT machine is capable of sending >3,000 updates/second to Replication Server 12.6. Note that there are many factors that contribute to RepAgent performance – cpu load from other users, network capabilities, etc. Readers should expect to achieve the same results if their system is notoriously cpu or network bound (for example).

In this section we will be examining how the Replication Agent works – and in particular, two bottlenecks quite easily overcome by adjusting configuration parameters. As mentioned earlier, since this paper does not yet address many of the aspects of heterogeneous replication, this section should be read in the context of the ASE Replication Agent thread. However, the discussions on Log Transfer Language and the general Rep Agent communications are common to all replication agents as all are based on the replication agent protocol supported by Sybase.

Secondary Truncation Point Management

Every one knows that the ASE Replication Agent maintains the ASE secondary truncation point, however, there are a lot of misconceptions about the secondary truncation point and the Replication Agent, including:

• The Replication Agent looks for the secondary truncation point at startup and begins re-reading the transaction log from that point.

• The Replication Agent cannot read past the primary truncation point. • “Zero-ing the LTM” resets the secondary truncation point back to the beginning of the transaction log.

As you would guess from the previous sentence, these are not necessarily accurate. In reality, there is a lot more communication and control from the Replication Server in this process than realized.

Replication Agent Communication Sequence

The sequence of events during communication between the Replication Agent and the Replication Server is more along the lines of:

1. The Replication Agent logs in to the Replication Server and requests to “connect” the source database (via the “connect source” command) and provides a requested LTL version. Replication Server responds with the negotiated LTL version and upgrade information.

2. The Rep Agent asks the Replication Server who the maintenance user is for that database. The Replication Server looks the maintenance user up in the rs_maintusers table in the RSSD database and replies to the Rep Agent.

3. The Rep Agent asks the Replication Server where the secondary truncation point should be. The Replication Server looks up the locater in the rs_locaters table in the RSSD database and replies to the Rep Agent.

4. The Rep Agent starts scanning from the location provided by the Replication Server 5. The Replication Agent scans for a configurable number (scan_batch_size) log records. 6. After reaching scan_batch_size log records, the Replication Agent requests a new secondary truncation

point for the transaction log. When this request is received, the Replication Server responds with the cached locater which contains the log page containing the oldest open transaction received from the Replication Agent. In addition, the Replication Server writes this cached locater to the rs_locaters table in the RSSD.

7. The Rep Agent moves the secondary truncation point to the log page containing the oldest open transaction received by Replication Server.

8. Repeat step 5.

Final v2.0.1

30

An interaction diagram for this might look like the following:

ct_connect(ra_user,ra_pwd)select from rs_users where...

cs_ret_succeed

connect source lti ds.db 300 [mode]select from rs_sites...

lti 300

get maintenance user for ds.dbselect from rs_maintusers...

db_name_maint

get truncation site.dbselect from rs_locaters...

0x0000aaaa0000bbbbbbb

get truncation site.dbinsert into rs_locaters values (0x000aaaa0000…)


log_scan()

LTL SQL

RepAgent Rep Server RSSD

ReplicateDS.DB


cs_ret_succeed


cs_ret_succeed


lti 300


lti 300


db_name_maint


db_name_maint









log_scan()

LTL SQL

RepAgent Rep Server RSSD

ReplicateDS.DB

Figure 6 – Replication Interaction Diagram for Rep Agent to RSSD

The key elements to get out of this are fairly simple:

• Keep the RSSD as close as possible to the RS • Every scan_batch_size rows, the Rep Agent stops forwarding rows to move secondary truncation point. • The secondary truncation point is set to the oldest open transaction received by Replication Server – which

may be the same as the oldest transaction in ASE (syslogshold) or it may be an earlier transaction as the Rep Agent has not yet read the commit record from the transaction log.

Regarding the first, if you notice, most of the time that the Rep Agent asks the RS for something, the RS has to check with the RSSD – or update the RSSD (i.e. the locater). So, don’t put the RSSD to far (network wise) from the RS. The best place is on the same box and have the primary network listener for the RSSD ASE be the TCP loopback port (127.0.0.1)

Replication Agent Scanning

The second can be overcome with a willingness to absorb more log utilization. The default scan_batch_size is 1,000 records. As anyone who has read the transaction log will tell you,1,000 log records happen pretty quickly. The result is that the Rep Agent is frequently moving the secondary truncation point. Benchmarks have show that raising scan_batch_size can increase replication throughput significantly. For example, at an early Replication Server customer, setting it to 20,000 improved overall RS throughput by 30%. Of course, the tradeoff to this is that the secondary truncation point stays at a single location in the log – translates to a higher degree of space used in the transaction log. In addition, database recovery time as well as replication agent recovery time will be lengthened as the portion of the transaction log that will be rescanned at database server and replication agent startup will be longer.

In contrast to the last paragraph, some have reported better performance with lower scan batch size – particularly in Warm Standby situations. While not definite, there is considerable thought within Sybase that this has the same impact of exec_cmds_per_timeslice in that it "throttles" the RepAgent back and allows other threads to have more access time. As the other threads are able to keep up more now, there is less contention for the inbound queue (SQM reads are not delaying SQM writes). While decreasing the RepAgent workload is one way to solve the problem, a better solution would have been to improve the DSI or other throughput to allow it to keep up without throttling back the RepAgent.

Final v2.0.1

31

Rep Agent LTL Generation

The protocol used by sources to replication server is called Log Transfer Language (LTL). Any agent that wishes to replicate data via Replication Server must use this protocol, much the same way that RS must use SQL to send transactions to ASE. Fortunately, this is a very simple protocol with very few commands. The basic commands are listed in the table below.

LTL Command Subcommand Function

connect source request to connect a source database to the replication system in order to start forwarding transactions.

get maintenance user request to retrieve maintenance user name to filter transactions applied by the replication system.

get truncation request to retrieve a log pointer to the last transaction received by the Replication Server.

distribute

begin transaction Used to distribute begin transaction statements

commit/rollback transaction Used to distribute commit/rollback statements

applied Used to distribute insert/update/delete SQL statements

execute Used to distribute both replicated procedures as well as request functions

sqlddl append Used to distribute DDL to WS systems

dump Used to distribute the dump database/ transaction log SQL commands

purge Used during recovery to notify Replication Server that previously uncommitted transactions have been rolled back.

A sample of what LTL looks like is as follows: distribute @origin_time='Apr 15 1988 10:23:23.001PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000001, @tran_id=0x000000000000000000000001 begin transaction 'Full LTL Test'

-- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.002PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000002, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_insert yielding after @intcol=1,@smallintcol=1,@tinyintcol=1,@rsaddresscol=1,@decimalcol=.12, @numericcol=2.1,@identitycol=1,@floatcol=3.2,@realcol=2.3,@charcol='first insert',@varcharcol='first insert',@text_col=hastext always_rep,@moneycol=$1.56, @smallmoneycol=$0.56, @datetimecol='4-15-1988 10:23:23.001PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=hastext rep_if_changed,@bitcol=1

-- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.003PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000003, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append first last changed with log textlen=30 @text_col=~.!!?This is the text column value.

-- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.004PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000004, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append first changed with log textlen=119 @imagecol=~/!"!gx"3DUfw@4ª»ÌÝîÿðÿ@îO@Ý@y@f9($&8~'ui)*7^Cv18*bhP+|p{`"]?>,D *@4ª

-- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.005PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000005, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append @imagecol=~/!!7Ufw@4ª»ÌÝîÿðÿ@îO@Ý@y@f

-- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.006PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000006,

Final v2.0.1

32

@tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append last @imagecol=~/!!Bîÿðÿ@îO@Ý@y@f9($&8~'ui)*7^Cv18*bh

-- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.007PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000007, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_update yielding before @intcol=1,@smallintcol=1,@tinyintcol=1,@rsaddresscol=1,@decimalcol=.12,@numericcol=2.1,@identitycol=1,@floatcol=3.2,@realcol=2.3,@charcol='first insert', @varcharcol='first insert',@text_col=notrep always_rep, @moneycol=$1.56,@smallmoneycol=$0.56,@datetimecol='Apr 15 1988 10:23:23.002PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=notrep rep_if_changed, @bitcol=1 after @intcol=1, @smallintcol=1, @tinyintcol=1, @rsaddresscol=1, @decimalcol=.12, @numericcol=2.1, @identitycol=1, @floatcol=3.2, @realcol=2.3, @charcol='updated first insert', @varcharcol='first insert', @text_col=notrep always_rep, @moneycol=$1.56, @smallmoneycol=$0.56, @datetimecol='Apr 15 1988 10:23:23.002PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=notrep rep_if_changed, @bitcol=0

Although it looks complicated, the above is fairly simple – all of the above are distribute commands for a part of a transaction comprised of multiple SQL statements. The basic syntax for a distribute command for a DML operation is as follows:

distribute <commit time> <OQID> <tran id> applied <table>.<function> yielding [before <col name>=<value> [, <col name>=<value>, …]] [after <col name>=<value> [, <col name>=<value>, …]]

As you could guess, the distribute command will make up most of the communication between the Rep Agent and the Rep Server. Looking closely at what is being sent, you will notice several things:

• The appropriate replicated function (rs_update, rs_insert, etc.) is part of the LTL (highlighted above) • The column names are part of the LTL

The latter is not always the case as some heterogeneous Replication Agents can cheat and not send the column names (assuming Replication Definition was defined with columns in same order or through a technique called “structured tokens”. Although currently beyond the scope of this paper, this is achieved by the Replication Agent directly accessing the RSSD to determine replication definition column ordering. This improves Replication Agent performance by reducing the size of the LTL to be transmitted and allowing the Replication Agent to drop columns not included in the replication definition. This information, once retrieved, can be cached for subsequent records. Currently, the ASE Replication Agent does not support this interface. However, in general, the LTL distribute command illustrated above does leave us with another key concept:

Key Concept #4: Ignoring subscription migration, the appropriate replication function rs_insert, rs_update, etc., for a DML operation is determined by the replication agent from the transaction log. The DIST/SRE determines which functions are sent according to migration rules, while the DSI determines the SQL language commands for that function.

Having determined what the Replication Agent is going to send to the Replication Server, the obvious question is how does it get to that point? The answer is based on two separate processes – the normal ASE Transaction Log Service (XLS) and the Rep Agent. The process is similar to the following:

1. (XLS) The XLS receives a log record to be written from the ASE engine 2. (XLS) The XLS checks object catalog to see if logged object’s OSTAT_REPLICATED bit is set. 3. (XLS) If not, the XLS simply skips to writing the log record. If it is set, then the XLS checks to see if

the DML logged event is nested inside a stored procedure that is also replicated. 4. (XLS) If so, the XLS simply skips to writing the log record. If not, then the XLS sets the log record’s

LSTAT_REPLICATE flag bit 5. (XLS) The XLS writes the record to the transaction log 6. (RA) Some arbitrary time later, the Rep Agent reads the log record 7. (RA) The Rep Agent checks to see if the log record’s LSTAT_REPLICATE bit is set. 8. (RA) If so, Rep Agent proceeds to LTL generation. If not, the Rep Agent determines if the log record is

a “special log record” such as begin/commit pairs, dump records, etc. 9. (RA) If not, the Rep Agent can simply skip to the next record. If it was, the Rep Agent proceeds with

constructing LTL.

Final v2.0.1

33

10. (RA) The Rep Agent checks to see if the operation was an update. If so, it also reads the next record to construct the before/after images.

11. (RA) The Rep Agent checks to see if the logged row was a text chain allocation. If so, it reads the text chain to find the TIPSA. This TIPSA is then used to find the data row for the text modification. The data row for writetext is then constructed in LTL. Then the text chain is read and constructed into LTL chunks of text/image append functions.

12. (RA) LTL Generation begins. Rep Agent checks it’s own schema cache (part of proc cache) to see if the logged object’s metadata is in cache. If not, it reads the objects metadata from system tables (syscolumns).

13. (RA) Rep Agent constructs LTL statement for the logged operation 14. (RA) If ‘batch_ltl’ parameter is false (default), the Rep Agent passes the LTL row to the Rep Server

using the distribute command. If ‘batch_ltl’ is true, the Rep Agent waits until the LTL buffer is full prior to sending the records to the Rep Server.

This process is illustrated below. The two services are shown side-by-side due to the fact that they are independent threads within the ASE engine and execute in parallel on different log regions. This latter is due to the fact that the Rep Agent can only read flushed log pages (flushed to disk), consequently, it will always be working on a different log page than the XLS service.

ASE XLS Service Rep Agent ProcessingRead next record from transaction log

Does record have LSTAT_REPLICATE set

Is record BT/CT or schema change

Is replicated object metadata in RA cache

Read object metadata from syscolumns

Construct LTL

Is LTL batching on

Send LTL to Replication Server

Pause for LTL buffer to fill

YES

YES

YES

NO

NONO

Receive log record

Is OSTAT_REPLICATED set?

Nested in Store Procedure?

Set log record’s LSTAT_REPLICATE bit

Write record to transaction log

YES

YES NO

NO

NO

Is logged operation an updateYES

Read before/after image from log

Is operation a writetext

Find datarow for writetext

NO

NOYES

LTL for text/image chain

NOYES

rs_datarow_for_writext

Store Procedure OSTAT_REPLICATED set?

ASE XLS Service Rep Agent ProcessingRead next record from transaction log

Does record have LSTAT_REPLICATE set

Is record BT/CT or schema change

Is replicated object metadata in RA cache

Read object metadata from syscolumns

Construct LTL

Is LTL batching on

Send LTL to Replication Server

Pause for LTL buffer to fill

YES

YES

YES

NO

NONO

Receive log record

Is OSTAT_REPLICATED set?

Nested in Store Procedure?

Set log record’s LSTAT_REPLICATE bit

Write record to transaction log

YES

YES NO

NO

NO

Is logged operation an updateYES

Read before/after image from log

Is operation a writetext

Find datarow for writetext

NO

NOYES

LTL for text/image chain

NOYES

rs_datarow_for_writext

Store Procedure OSTAT_REPLICATED set?

Figure 7 - ASE XLS and Replication Agent Execution Flow

The following list summarizes this into key elements how this affects replication performance and tuning.

• Replication Agent has a schema cache to maintain object metadata (schema cache) for constructing LTL as well as tracking transactions (transaction cache). As a result, more procedure cache may be necessary on systems with a lot of activity on large numbers of tables. In addition, careful monitoring of the system metadata cache to ensure that physical reads to system tables are not necessary.

• LTL batching can significantly improve Rep Agent processing as it can scan more records prior to sending the rows to the Rep Server (effectively a synch point in Rep Agent processing).

• Replicating text/image columns can slow down Rep Agent processing of the log due to reading the text/image chain.

• Marking objects for replication that are not distributed (i.e. for which no subscriptions or Warm Standby exists) has a negative impact on Rep Agent performance as it must perform LTL generation needlessly. In

Final v2.0.1

34

addition, these “extra” rows will consume space in the inbound stable queue and valuable CPU time for the distributor thread.

• Procedure replication can improve Rep Agent throughput by reducing the number of rows for which LTL generation is required. For example, if a procedure modifies 1,000 rows, replicating the table will require 1,000 LTL statements to be generated (and compared in the distributor thread). By replicating the procedure only a single LTL statement will need to be generated and processed by Replication Server.

Key Concept #5 – In addition to Rep Agent tuning, the best way to improve Rep Agent performance is to minimize it’s workload. This can be achieved by not replicating text/image columns where not necessary and ensuring only objects for which subscriptions exist are marked for replication. In addition, replicating procedures for large impact transactions could improve performance significantly.

The last sentence may not make sense yet. However, a replicated procedure only requires a single row for the Replication Agent to process no matter how many rows are affected by it. How this is achieved as well as the benefits and drawbacks are discussed in the Procedure Replication section.

Note that in the above list, nowhere does it say that enabling replication slows down the primary by resorting to all deferred updates vs. in-place updates. The reason is that this was always a myth. While an update will generate two log records, for the before and after images respectively, the actual modification can be a normal update vs. a deferred one. Unfortunately, the existence of the two log records has led many to mistakenly assume that replication reverts to deferred updates.

Replication Agent Communications

The Rep Agent connects to the Replication Server in “PASSTHRU” mode. A common question is “What does it mean by passthru mode?” The answer lies in how the server responds to packets. In passthru mode, a client can send multiple packets to the server without having to wait for the receiver to process them fully. However, they do have to synchronize periodically for the client to receive error messages and statuses. A way to think of it is that the client can simply start sending packets to the server and as soon as it receives packet acknowledgement from the TDS network listener, it can send the next packet. Asynchronously, the server can begin parsing the message. When the client is done, it sends an End-Of-Message (EOM) packet that tells the server to process the message and respond with status information. By contrast, typical client connections to Adaptive Server Enterprise are not passthru connections, consequently, the ASE server processes the commands immediately on receipt and passes the status information back to the client.

This technique provides the Rep Agent/Rep Server communication with a couple of benefits:

• Rep Agent doesn’t have to worry if the LTL command spans multiple packets. • The destination server can begin parsing the messages (but not executing) as received, achieving greater

parallelism between the two processes

If the Rep Agent configuration batch_ltl is true, Rep Agent will batch LTL to optimize network bandwidth (although the TDS packet size is not configurable prior to ASE 12.5). If not, as each LTL row is created, it is sent to the Rep Server. In either case, the messages are sent via passthru mode to the Rep Server. Every 2K, the Rep Agent synchs with the Rep Server by sending an EOM (at an even command boundary – EOM can not be placed in the middle of an LTL command).

Replication Agent Tuning

Prior to ASE 12.5, the Replication Agent thread embedded inside ASE could not be tuned much. As this was a frequent cause of criticism, ASE engineering added several new configuration parameters to the replication agent. Some of these new parameters as well as other pre-existing parameters are listed below:

Parameter (Default) ASE Explanation

batch ltl Default: True Suggest: True (verify)

11.5* Specifies whether RepAgent sends LTL commands to Replication Server in batches or one command at a time. When set to "true", the commands are sent in batches. The default is "false" according to the manuals, however, in practice, most current ASE’s default this to “true”.

Final v2.0.1

35


connect database Default: [dbname] Suggest: [dbname]

11.5 Specifies the name of the temporary database RepAgent uses when connecting to Replication Server in recovery mode. This is the database name RepAgent uses for the connect source command; it is normally the primary database.

connect dataserver Default: [dsname] Suggest: [dsname]

11.5 Specifies the name of the data server RepAgent uses when connecting to Replication Server in recovery mode. This is the data server name RepAgent uses for the connect source command; it is normally the data server for the primary database.

data limits filter mode Default: stop or off Suggest: truncate

12.5 Specifies how RepAgent handles log records containing new, wider columns and parameters, or larger column and parameter counts, before attempting to send them to Replication Server. ·off - RepAgent allows all log records to pass through. ·stop - RepAgent shuts down if it encounters log records containing widedata. ·skip - RepAgent skips log records containing wide data and posts a message to the error log. ·truncate - RepAgent truncates wide data to the maximum the Replication Server can handle. Warning! Sybase recommends that you do not use the "data_limits_filter_mode, off" setting with Replication Server version 12.1 or earlier as this may cause RepAgent to skip or truncate wide data, or to stop. The default value of data limits filter mode depends on the Replication Server version number. For Replication Server versions 12.1 and earlier, the default value is "stop." For Replication Server versions 12.5 and later, the default value is "off."

fade_timeout Default: 30

11.5* Specifies the amount of time after Rep Agent has reached the end of the transaction log and no activity has occurred before the Rep Agent will fade out it’s connection to the Replication Server. This command is still supported as of ASE 12.5.2 although not reported when executing sp_config_rep_agent to get a list of configuration parameters and their values.

ha failover Default: true Suggest: true

12.0 Specifies whether, when Sybase Failover has been installed, RepAgent automatically starts after server failover. The default is "true."

msg confidentiality Default: false Suggest: false

12.0 Specifies whether to encrypt all messages sent to Replication Server. This option requires the Replication Server Advanced Security option as well as the Security option for ASE to enable SSL-based data encryption.

msg integrity Default: false Suggest: false

12.0 Specifies whether all messages exchanged with Replication Server should be checked for tampering. This option requires the Replication Server Advanced Security option as well as the Security option for ASE to enable SSL-based data integrity.

msg origin check Default: false Suggest: false

12.0 Specifies whether to check the source of each message received from Replication Server.

msg out-of-sequence check Default: false Suggest: false

12.0 Specifies whether to check the sequence of messages received from Replication Server.

Final v2.0.1

36


msg replay detection Default: false Suggest: false

12.0 Specifies whether messages received from Replication Server should be checked to make sure they have not been intercepted and replayed.

mutual authentication Default: false Suggest: false

12.0 Specifies whether RepAgent should require mutual authentication checks when connecting to Replication Server. This option is not implemented.

priority Default: 5 Suggest: 4

12.5 The thread execution priority for the Replication Agent thread within the ASE engine. Accepted values are 4-6 with the default being 5.

retry_time_out Default: 60

11.5* Specifies the number of seconds RepAgent sleeps before attempting to reconnect to Replication Server after a retryable error or when Replication Server is down. The default is 60 seconds.

rs servername 11.5* The name of the Replication Server to which RepAgent connects and transfers log transactions. This is stored in the sysattributes table.

rs username 11.5* The new or existing user name that RepAgent thread uses to connect to Replication Server. This is stored in the sysattributes table.

rs password 11.5* The new or existing password that RepAgent uses to connect to Replication Server. This is stored in encrypted form in the sysattributes table. If network-based security is enabled and you want to establish unified login, you must specify NULL for repserver_password when enabling RepAgent at the database.

scan_batch_size Default: 1000 Suggest: 10,000+ for high volume systems only

11.5* Specifies the maximum number of log records to send to Replication Server in each batch. When the maximum number of records is met, RepAgent asks Replication Server for a new secondary truncation point. The default is 1000 records. This should not be adjusted for low volume systems.

scan_time_out Default:15 Suggest: 5

11.5* Specifies the number of seconds that RepAgent sleeps once it has scanned and processed all records in the transaction log and Replication Server has not yet acknowledged previously sent records by sending a new secondary truncation point. RepAgent again queries Replication Server for a secondary truncation point after scan timeout seconds. The default is 15 seconds. scan timeout 'scan_timeout_in_seconds' RepAgent continues to query Replication Server until Replication Server acknowledges previously sent records either by sending a new secondary truncation point or extending the transaction log. If Replication Server has acknowledged all records and no new transaction records have arrived at the log, RepAgent sleeps until the transaction log is extended.

schema_cache_growth_factor Default: 1 Suggest: 1-3

12.5 Controls the duration of time table or stored procedure schema can reside in the RepAgent schema cache before expiring. Larger values mean a longer duration and require more memory. Range is 1 to 10. This is a factor, so setting it to ‘2’ doubles the size of the schema cache.

Security mechanism 12.0 Specifies the network-based security mechanism RepAgent uses to connect to Replication Server.

Final v2.0.1

37


send_buffer_size Default: 2K Suggest: 8-16K

12.5 Determines both the size of the internal buffer used to buffer LTL as well as the packet size used to send the data to the Replication Server. Accepted values are: 2K, 4K, 8K, or 16K (case insensitive), with the default of 2K. Larger send buffer sizes will reduce network traffic, as it has to do less sends. Note that this is not tied to the ASE server page size.

send maint xacts to replicate Default: false Suggest: false (don’t change)

11.5* Specifies whether RepAgent should send records from the maintenance user to the Replication Server for distribution to subscribing sites. The default is "false."

send structured oqids Default: false Suggest: true

12.5 Specifies whether the Replication Agent will send queue IDs (OQIDs) to the Replication Server as structured tokens or as binary strings (the default). Since every LTL command contains the oqid, this has the ability to significantly reduce network traffic. Valid values are true/false, default is false.

send_warm_standby_xacts Default: false for most, true for Warm Standby

11.5* Specifies whether RepAgent sends information about maintenance users, schema, and system transactions to the warm standby database. This option should be used only with the RepAgent for the currently active database in a warm standby application. The default is "false."

short ltl keywords Default: false** Suggest: false** ( true)**

12.5 Similar to "send structured oqids", this specifies whether the Replication Agent will use abbreviated LTL keywords to reduce network traffic. LTL keywords are commands, subcommands, etc. The default value is "false."

skip ltl errors Default: false Suggest: false

11.5 Specifies whether RepAgent ignores errors in LTL commands. This option is normally used in recovery mode. When set to "true," RepAgent logs and then skips errors returned by the Replication Server for distribute commands. When set to "false," RepAgent shuts down when these errors occur. The default is "false."

skip unsupported features Default: false Suggest: false

11.5 Instructs RepAgent to skip log records for Adaptive Server features unsupported by the Replication Server. This option is normally used if Replication Server is a lower version than Adaptive Server. The default is "false."

trace flags Default: 0

11.5* This is a bitmask of the RepAgent traceflags that are enabled. The valid traceflags are in the range 9201-9220 (not all values are valid).

trace log file Default: null Suggest: [filename as needed]

11.5* Specifies the full path to the file used for output of the Replication Agent trace activity.

Traceoff 11.5* Disables Replication Agent tracing activity.

Traceon 11.5* Enables Replication Agent tracing activity. Could severely degrade Rep Agent performance due to file I/O.

unified login Default: false Suggest: false

12.0 When a network-based security system is enabled, specifies whether RepAgent seeks to connect to other servers with a security credential or password. The default is "false."

* Some parameters above are noted as having been first implemented in ASE 11.5. This is due to the fact that ASE 11.5 was the first ASE with the Rep Agent Thread internalized. Prior to ASE 11.5, an external Log Transfer Manager (LTM) was used – it had similar parameters for those above, but sometimes used different names.

** In ASE 12.5.0.1, the short_ltl_keywords parameter seemed to operate in the reverse – setting ltl_short_keywords to ‘true’ resulted in the opposite of what was expected. See example later. However, this may be ‘fixed’ in a later EBF – if so, whether using this parameter or not, corrective action may be required.

Final v2.0.1

38

In the above tables, several of the configuration parameters that will have the most impact on performance have been high-lighted. A discussion about these is not included here as in each of the above, a suggested configuration setting is mentioned. While your optimal configuration may differ, these are a good starting point. In addition, a couple of the new parameters take a bit more explanation and are detailed in the following paragraphs.

Scan_Batch_Size

As mentioned in the description, in high volume environments, setting scan_batch_size higher can have a noticeable improvement on Replication Agent throughput. The reason should be clear from the description – the RepAgent stops scanning to request a secondary truncation point less often. However, in very low volume environments, this setting should be left at the default or possibly decreased. The reason is that when the RepAgent reaches the end of the log portion it was scanning, it checks to see if the log has been extended. If so, it simply starts scanning again – while not starting over, it does so without requesting a secondary truncation point if the scan_batch_size has not been reached. Consequently, if the system is experiencing “trickle” transactions which always extend the log, but are a low enough volume that it would take hours or days to reach the scan_batch_size, the secondary truncation point may not move during that time period – significantly impacting log space.

For example, one customer had a number of larger OLTP systems and the usual collection of lesser volume systems. In an attempt to adopt “standard configurations” (always a hazardous task), they had adopted a scan_batch_size setting to 20,000 as it did benefit the larger systems. However, in one of the lesser systems, the transaction log started filling and could not be truncated. It turned out that the system only had about 140 transactions per hour – which would take about 48 days to reach the 20,000 batch size at which point the secondary truncation point would finally be moved. Ouch!! Consequently, while adjusting scan_batch_size (and other settings) to drastically higher values may help in high-volume situations, take care in assuming that these settings can be adopted as “standard configurations” and applied unilaterally.

Rep Agent Priority

Beyond a doubt, the most frequently asked for feature to the ASE Replication Agent thread, was the ability to increase the priority. As of ASE 12.5, this is possible. Within ASE, there are 8 priority levels with the lower levels having the highest execution priority (similar to operating system priorities). These levels are:

Level Priority Priority Class Processes

0 Kernel Kernel

1 Reserved

2 Reserved

3 Highest Rep Agent highest in 12.5

4 High EC1 Execution Class

5 Medium EC2 Execution Class Default for all users/processes

6 Low EC3 Execution Class

7 Idle CPU Maintenance Tasks Housekeeper

As illustrated above, priorities 3-6 are the only ones associated with user tasks with 4-6 corresponding to the Logical Process Manager’s EC1-EC3 Execution Classes. Although attempted by many, the LPM EC Execution Classes did not apply to the Replication Agent Threads (nor any other system threads). As a result, until ASE 12.5, there was no way to control a Replication Agent’s priority.

What if more than one database is being replicated? How are the cpu’s distributed to avoid cpu contention with one engine attempting to service multiple Rep Agents running at “highest” priority level of 3? At start-up, the RepAgent is affinity bound to a specific ASE engine, if multiple engines are available, each RepAgent being started will be bound to the next available engine. For example: if max online engines = 4, the first RepAgent will be bound to engine 0 and the second RepAgent will be bound to engine 1. Subsequent Replication Agents are then bound in order to the engines. The RepAgent is then placed at specified priority on the runnable queue of the affinitied engine. If ASE is unable to affinity bind the RepAgent process to any available engines, ASE error 9206 is raised.

Although a setting of “3” allows a Replication Agent thread to be scheduled more often than user threads, care should be taken to avoid monopolizing a cpu. Best approach is for an OLTP system is to set the priority initially to 4 and see how far the Rep Agent lags (after getting caught up in the first place). Then, only if necessary bump the priority up to 3. If user processes begin to suffer, than additional cpu’s and engines may have to be added to the primary to avoid

Final v2.0.1

39

Rep Agent lag while maintaining performance. There is a word of caution about this – you may not see any improvement in performance by raising the execution priority in current ASE releases as the main bottleneck isn't the ASE cpu time, but rather the ASE internal scheduling for network access and the RS ability to process the inbound data to the queue fast enough. Consequently, changing the priority will only have a positive effect when the ASE engine cpu time is being monopolized by user queries. This can be determined by monitoring monProcessWaits for the RepAgent spid/kpid. If a significant amount of time is spent waiting on the cpu (WaitEventID’s 214 & 215), increasing the priority of the RepAgent may help. If not, increasing the priority will do little as the actual cause is elsewhere.

Send_buffer_size

As noted above, the send_buffer_size parameter really affects three things:

1. The size of the internal buffer used to hold LTL until sent to the Replication Server 2. The amount of LTL sent each time 3. The packet size used to communicate with the Replication Server.

The last has been an extremely frequent request – to be able to control the size of the packets the Replication Agent uses – similar to the db_packet_size DSI tuning parameter. It should be noted that the earlier LTM’s already had an internal buffer of 16K, however, when the Replication Agent was internalized in ASE 11.5, this buffer was reduced to 2K – more than likely to reduce the latency during low to mid volume situations. Consequently, before the packet size could be adjusted, the internal buffer also had to be adjusted. By allowing the user to specify the size of the internal buffer/packet size, optimal network utilization can be achieved.

While the 2K setting at first glance may seem the logical choice, for high volume systems, it may not be the optimal setting. The transport layer limits the TCP packet size to the maximum network interface frame size to avoid fragmentation. In terms of effort, significant work is involved in preparing data for transfer. The process of dividing data into multiple packets for transfer, managing the TCP/IP layers and handling network interrupts requires significant CPU involvement. The more data is segmented into packets, the more CPU resources are needed. As a result, the maximum frame size supported by the networking link layer has an impact on CPU utilization. TCP/IP typically penalizes systems that transmit a large number of small packets.

Additionally, within the Replication Server, the processing of the Replication Agent user thread and SQM is nearly synchronous for recovery reasons. The Replication Server does not acknowledge that the data from the Replication Agent has been received until it has been written to disk. As a result, even without the scan_batch_size, there is an implicit sync point every 2K of data from servers previous to ASE 12.5. If a new segment needs to allocated, this could involve an update to the RSSD to record the new space allocation. As a result, by increasing the send_buffer_size, the number of sync points is decreased and overall network efficiency improved. To aid in this, ASE 12.0.0.7+ and 12.5.0.3+ added several new sysmon counters. These counters are described in much more detail in the "Replication Agent Troubleshooting: Using sp_sysmon' section below

Structured Tokens

Heterogeneous Replication Agents have had the capability for a while to send the Replication Server structured tokens and shortened key words. Structured tokens are a mechanism for dramatically reducing the network traffic caused by replication, specifically by reducing the amount of overhead in the LTL protocol and compressing the data values. In the full structured token implementation, this is achieved in a number of ways, including using shortened LTL key words, structured tokens for data values, etc. As of ASE 12.5, some of these capabilities have been introduced in the Replication Agent thread internal to ASE. These two new parameters, send_structured_oqids and short_ltl_keywords, focus strictly on reducing the overhead of the LTL protocol and do not attempt to reduce the actual column values themselves. For example, using short LTL keywords, the “distribute” command is represented by the token “_ds”. While a savings of 7 bytes for one command may not appear that great, the average LTL distribute command would be shortened by a total of 20 bytes.

For example, let’s say we want to add this white paper to the list of titles in pubs2 (ignoring the author referential integrity to keep things simple). We would use the following SQL statements: Begin tran add_book Insert into publishers values (‘9990’,’Sybase, Inc.’,’Dublin’,’CA’) Insert into titles (title_id, title, type, pub_id, price, advance, total_sales, notes,

pubdate, contract) values (‘PC9900’,’Replication Server Performance & Tuning’,’popular_comp’,’9990’,

0.00, -- free to all good Sybase customers 0.00, -- contrary to belief, we didn’t get paid extra 100, -- make up a number for number of times downloaded ‘This what happens on sabbaticals taken by geeks – and why Sybase still offers them’, ‘November 1, 2000’, 0) –- we wish – make us an offer commit tran

Final v2.0.1

40

Tracing the LTL under normal replication (see below), we get the following LTL stream: REPAGENT(4): [2002/09/08 17:55:12.23] The LTL packet sent is of length 1097. REPAGENT(4): [2002/09/08 17:55:12.23] _ds 1 ~*620020908 17:55:32:543,4 0x000000000000445800000c40000300000c400003000092810127681300000000,6 0x000000000000445800034348494e4f4f4b7075627332 _bg tran ~")add_book for ~"#sa _ds 4 0x000000000000445800000c40000400000c400003000092810127681300000000,6 0x000000000000445800034348494e4f4f4b7075627332 _ap owner =~"$dbo ~"+publishers.~!*rs_insert _yd _af ~$'pub_id=~"%%9990,~$)pub_name=~"-Sybase, Inc.,~$%%city=~"'Dublin,~$&state=~"#CA _ds 4 0x000000000000445800000c40000500000c40000 REPAGENT(4): [2002/09/08 17:55:12.23] 3000092810127681300000000,6 0x000000000000445800034348494e4f4f4b7075627332 _ap owner =~"$dbo ~"'titles.~!*rs_insert _yd _af ~$)title_id=~"'PC9900,~$&title=~"HReplication Server Performance & Tuning,~$%%type=~"-popular_comp,~$'pub_id=~"%%9990,~$&price=~(($0.0000,~$(advance=~(($0.0000,~$,total_sales=100 ,~$&notes=~#"3This what happens on sabbaticals taken by geeks - and why Sybase still offers them,~$(pubdate=~*620001101 00:00:00:000,~$)contract=0 _ds 1 ~*620020908 17:55:32:543,4 REPAGENT(4): [2002/09/08 17:55:12.23] 0x000000000000445800000c40000700000c400003000092810127681300000000,6 0x000000000000445800034348494e4f4f4b7075627332 _cm tran

Turning on both short_ltl_keywords and structured oqids, we get the following: REPAGENT(4): [2002/09/08 17:55:46.24] The LTL packet sent is of length 958. REPAGENT(4): [2002/09/08 17:55:46.24] distribute 1 ~*620020908 17:55:45:543,4 ~,A[000000000000]DX[00000c]@[00]'[00000c]@[00]'[0000928101]'wO[00000000],6 ~,7[000000000000]DX[00]'CHINOOKpubs2 begin transaction ~")add_book for ~"#sa distribute 4 ~,A[000000000000]DX[00000c]@[00]([00000c]@[00]'[0000928101]'wO[00000000],6 ~,7[000000000000]DX[00]'CHINOOKpubs2 applied owner =~"$dbo ~"+publishers.~!*rs_insert yielding after ~$'pub_id=~"%%9990,~$)pub_name=~"-Sybase, Inc.,~$%%city=~"'Dublin,~$&state=~"#CA distribute 4 ~,A[0000] REPAGENT(4): [2002/09/08 17:55:46.24] [00000000]DX[00000c]@[00])[00000c]@[00]'[0000928101]'wO[00000000],6 ~,7[000000000000]DX[00]'CHINOOKpubs2 applied owner =~"$dbo ~"'titles.~!*rs_insert yielding after ~$)title_id=~"'PC9900,~$&title=~"HReplication Server Performance & Tuning,~$%%type=~"-popular_comp,~$'pub_id=~"%%9990,~$&price=~(($0.0000,~$(advance=~(($0.0000,~$,total_sales=100 ,~$&notes=~#"3This what happens on sabbaticals taken by geeks - and why Sybase still offers them,~$(pubdate=~*620001101 00:00:00:000,~$)con REPAGENT(4): [2002/09/08 17:55:46.24] tract=0 distribute 1 ~*620020908 17:55:45:543,4 ~,A[000000000000]DX[00000c]@[00]+[00000c]@[00]'[0000928101]'wO[00000000],6 ~,7[000000000000]DX[00]'CHINOOKpubs2 commit transaction

** A couple of comments – this is ASE 12.5 LTL (version 300) – some examples in this document use older LTL versions, and were traced from the EXEC module consequently, it may look slightly different.

As you can see by the first example, with short_ltl_keywords set to ‘false’, the LTL command verbs are replaced with what kind of looks almost like abbreviations. As mentioned in the table, the ‘false’ setting appears to be backwards for the short_ltl_keywords as setting it to ‘true’ along with structured_oqids results in the second sequence. Note that the column names, datatype tokens, length tokens and data values remain untouched in both streams. The LAN replication agent used for heterogeneous replication is capable of stripping out the column names as it reads the column order from the replication definition and formats the columns in the stream accordingly.

Schema Cache Growth Factor

As mentioned earlier, the Rep Agent contains 2 caches - a schema cache and a transaction cache. The transaction cache is used to store open transactions. The other cache (the topic of this section) basically caches components from sysobjects and syscolumns. It used to be (11.x) made up from proc cache, however, as of 12.0, it uses it's own memory outside of the main ASE pool. Each cache item essentially is a row from sysobjects and associated child rows from syscolumns in a hash tree. Accordingly, it follows an LRU/MRU chain much like any cache in ASE - consequently, more frequently hit tables will be in cache while those infrequently will get aged out. When the rep agent reads a DML before/after image from the log it first checks this cache. If not found, then it has to do a look up in sysobjects and syscolumns (hopefully in metadata cache and not physical i/o - a hash table lookup in schema cache is quicker than a logical i/o in metadata cache).

The schema cache can "grow" in one of two ways - (A) either a large number of objects are replicated and the transaction distribution is fairly even across all objects (rare - most transactions only impact <10 tables), and/or (B) the structure of tables/columns are being modified. You can watch the growth with RA trace 9208 - if it stays consistent then you are fine. Customers with the most issues similar to (A) are those replicating a lot of procs as you can have a lot of procs modifying a small number of tables. Customers with the most issues similar to (B) are those that tend to change the DDL to tables/procs frequently. The reason is that RA needs to send the correct version of the schema at the point that the DML happened. As a result, if you insert a row, add another column, insert another row, RA needs to send the appropriate info for each - i.e. don't send the new column for the old row, nor ignore it for the new one. As a result, the schema cache may grow (somewhat).

Final v2.0.1

41

The RepAgent config "schema cache growth factor" is a factor - not a percentage - consequently it is extremely sensitive. In other words, setting it to 2 doubles the size of the cache, while 3 triples the size. Depending on the hardware platform, other processing on the box, etc. anything above 3 may not be recommended. Hence, unless you have over 100 objects being replicated per database, setting this above 1 is probably useless.

Replication Agent Troubleshooting

There are several commands for troubleshooting the Rep Agent. At a basic level, sp_help_rep_agent can help track where in the log and how much of the log the Rep Agent is processing. However, for performance related issues, sp_sysmon ‘RepAgent’ or the MDA based monProcessWaits table are the best bets.

RepAgent Trace Flags

However, for tougher problems, several trace flags exist.

Trace Flag Trace Output

9201 Traces LTL generated and sent to RS

9202 Traces the secondary truncation point position

9203 Traces the log scan

9204 Traces memory usage

9208 Traces schema cache growth factor

Output from the trace flags is to the specified output file. The trace flags and output file are specified using the normal sp_config_rep_agent procedure as in the following:

sp_config_rep_agent <db_name>, “trace_log_file”, “<filepathname>” sp_config_rep_agent <db_name>, “traceon”, “9204” -- monitor for a few minutes sp_config_rep_agent <db_name>, “traceoff”, “9204”

However, tracing the Rep Agent has a considerable performance impact as the Rep Agent must also write to the file. In the case of LTL tracing (9201), this can be considerable. As a result, Rep Agent trace flags should only be used when absolutely necessary. For NT, note that you will need to escape the file path with a double slash as in:

exec sp_config_rep_agent pubs2, 'trace_log_file', 'c:\\ltl_verify.log'

Determining RepAgent Latency

Another useful command when troubleshooting the Replication Agent is the sp_help_rep_agent procedure call. Of particular interest are the columns that report the transaction log endpoints and the Rep Agent position – “start marker”, “end marker”, and “current marker”. The problem is that these are reported as logical pages on the virtual device(s). This can lead to frequent accusations that the Replication Agent is always #GB behind. Remember, the logical page id’s are assigned in device fragment order. Consider the following example database creation script (assume 2K page server):

create database sample_db on data_dev_01=4000 log on log_dev_01=250 go alter database sample_db on data_dev_02=4000 log on log_dev_01=2000 go

This would more than likely result in a sysusages similar to (dbid for sample_db=6):

Dbid Segmap Lstart Size Vstart …

6 3 0 2048000 (…)

6 4 2048000 128800

6 3 2176800 2048000

6 4 4224800 1024000

Executing sp_help_rep_agent sample_db could yield the following marker positions:

Final v2.0.1

42

Start Marker End Marker Current Maker

2148111 4229842 2166042

Those quicker with the calculator than familiar with the structure of the log would erroneously conclude that the Rep Agent is running ~4GB behind (4229842-2166042=2063800; 2063800/512=4031MB) – a good trick when the transaction log is only slightly bigger than 2GB. In reality, the Replication Agent is only 31MB behind ( (4229842-4224800)+(2176800-2166042)=5042+10758=15800; 15800/512=31 ) - assuming that the end maker points to the final page of the log.

One of the most understood aspects of sp_help_rep_agent is the “scan” output - the makers (as listed above) as well as the “log records scanned”. For the first part, once the XLS wakes up the Rep Agent from a sleeping state, the start and end markers are set to the current log positions. The Rep Agent commences scanning from that point. As it nears the end marker, it requests an update - and may get a new end marker position. The “log records scanned” works similar but on a more predictable basis. If you remember, one of the Rep Agent configuration settings is “scan batch size”, which has a default value of 1,000. At the default value, monitoring this value can be extremely confusing. However, if setting “scan batch size” to a more reasonable value of 10,000 or 25,000 clears it up. What the “scan” section of sp_help_rep_agent is reporting in the “log records scanned” is the number of records scanned towards the “scan batch size”. Once the “scan batch size” number of records are reached, the counter is reset. This is what causes the confusion - particularly when just using the default as the Rep Agent is capable of scanning 1,000 records a second from the transaction log. Some administrators have attempted to run sp_help_rep_agent every second and were extremely surprised to see little or no change in the “log records scanned” (or even a drop). The reason is that the Rep Agent was working on subsequent scan batches. Consider the following output from a sample scan:

start marker end marker current marker

log rec scanned

recs/ sec

scan cnt

tot recs

(133278,22) (134923,20) (133493,3) 3587 0 1 3587

(133278,22) (134923,20) (133594,15) 4807 1220 1 4807

(133278,22) (134923,20) (133681,14) 5841 1034 1 5841

(133278,22) (134923,20) (133765,11) 6849 1008 1 6849

(133278,22) (134923,20) (133849,5) 7856 1007 1 7856

(133278,22) (134923,20) (133931,9) 8810 954 1 8810

(133278,22) (134923,20) (134037,6) 10083 1273 1 10083

(133278,22) (134923,20) (134116,15) 11038 955 1 11038

(133278,22) (134923,20) (134201,15) 12048 1010 1 12048

(133278,22) (134923,20) (134294,7) 13163 1115 1 13163

(133278,22) (134923,20) (134378,3) 14171 1008 1 14171

(133278,22) (134923,20) (134471,19) 15286 1115 1 15286

(133278,22) (134923,20) (134562,8) 16375 1089 1 16375

(133278,22) (134923,20) (134658,5) 17516 1141 1 17516

(133278,22) (134923,20) (134726,20) 18341 825 1 18341

(133278,22) (134923,20) (134824,0) 19509 1168 1 19509

(133278,22) (134923,20) (134902,5) 20437 928 1 20437

(134923,20) (137410,2) (135000,9) 21605 1168 1 21605

(134923,20) (137410,2) (135084,5) 22613 1008 1 22613

(134923,20) (137410,2) (135169,0) 23621 1008 1 23621

(134923,20) (137410,2) (135266,5) 24790 1169 1 24790

(134923,20) (137410,2) (135371,23) 1061 1271 2 26061

(134923,20) (137410,2) (135447,23) 1963 902 2 26963

(134923,20) (137410,2) (135549,13) 3184 1221 2 28184

(134923,20) (137410,2) (135642,7) 4300 1116 2 29300

(134923,20) (137410,2) (135725,0) 5283 983 2 30283

Final v2.0.1

43

start marker end marker current marker

log rec scanned

recs/ sec

scan cnt

tot recs

(134923,20) (137410,2) (135815,12) 6371 1088 2 31371

(134923,20) (137410,2) (135904,23) 7433 1062 2 32433

(134923,20) (137410,2) (135985,11) 8389 956 2 33389

(134923,20) (137410,2) (136091,11) 9663 1274 2 34663

(134923,20) (137410,2) (136188,15) 10832 1169 2 35832

(134923,20) (137410,2) (136277,23) 11894 1062 2 36894

(134923,20) (137410,2) (136370,17) 13009 1115 2 38009

(134923,20) (137410,2) (136451,5) 13965 956 2 38965

(134923,20) (137410,2) (136566,0) 15346 1381 2 40346

(134923,20) (137410,2) (136636,16) 16195 849 2 41195

(134923,20) (137410,2) (136712,18) 17098 903 2 42098

(134923,20) (137410,2) (136803,8) 18187 1089 2 43187

(134923,20) (137410,2) (136895,11) 19275 1088 2 44275

(134923,20) (137410,2) (136980,9) 20284 1009 2 45284

(134923,20) (137410,2) (137068,17) 21346 1062 2 46346

(134923,20) (137410,2) (137161,11) 22463 1117 2 47463

(134923,20) (137410,2) (137244,9) 23447 984 2 48447

(134923,20) (137410,2) (137339,7) 24588 1141 2 49588

(137410,2) (137416,9) (137416,9) 513 925 3 50513

Before asking about where the 3 right-most columns are in your sp_help_rep_agent output, the above output is from a modified version of sp_help_rep_agent. The above output was taken from an NT stress test of a 50,000 row update done by 10 parallel tasks. The Rep Agent was configured for a ‘scan batch size’ of 25,000. As you can see from the first high-lighted section, as the current marker approached the end marker, the end marker was updated. The second high-lighted area illustrates the ‘scan batch size’ rollover effect on ‘log recs scanned’.

Adding more fun to the problem of determining the latency is the fact that the transaction log is a circular log, consequently, it is possible for the markers to have wrapped around. The logic to calculating the latency is:

1. Determine the distance from the end of the current segment in sysusages 2. Add the space for all other segments between the current segment and the segment containing the end

marker. 3. Add the distance for the end marker within the end marker’s segment

Unfortunately, there isn't a built-in function that returns the last log page. If there are not any other open transactions, perhaps the easiest way is to begin a transaction, update some row – and then check in master..syslogshold. Otherwise, one way to find the last log page is to use dbcc log as in:

use pubs2 go begin tran mytran rollback tran go dbcc traceon(3604) go -- dbid=5, obj=0, page=0, row=0, recs = last one, all recs, header only dbcc log(5,0, 0, 0, -1, -1, 1) go DBCC execution completed. If DBCC printed error messages, contact a user with System Administrator (SA) role. LOG SCAN DEFINITION: Database id : 5 Backward scan: starting at end of log maximum of 1 log records.

Final v2.0.1

44

LOG RECORDS: ENDXACT (13582,14) sessionid=13582,13 attcnt=1 rno=14 op=30 padlen=0 sessionid=13582,13 len=28 odc_stat=0x0000 (0x0000) loh_status: 0x0 (0x00000000) endstat=ABORT time=Oct 22 2004 11:26:39:166AM xstat=0x0 [] Total number of log records 1 DBCC execution completed. If DBCC printed error messages, contact a user with System Administrator (SA) role. Normal Termination Output completed (0 sec consumed).…

Where the first number in parenthesis is current log page (and row) – note that the sessionid points to the log page and row where the transaction begin record is (since this was an empty tran, it is immediately preceding).

Another alternative that measures instead the difference in time between the last time the secondary truncation point was updated is by using the following query:

-- executed from the current database select db_name(dbid), stp_lag=datediff(mi,starttime,getdate()) from master..syslogshold where name = '$replication_truncation_point' and dbid=db_id()

This tells how far behind in minutes the Replication Server is (kind of). The problem is that it can be highly inaccurate. Remember, the STP points to the page containing the oldest open transaction that the Replication Server has processed. If a user began a transaction and went to lunch, the STP won’t move until the transaction is committed. Unfortunately, this may give the impression that the Replication Agent is lagging, when in reality, the current marker value may be very near the end of the transaction log. The second reason that this can be inaccurate is a matter of interpretation. Simply because the STP and the current oldest open transaction is 30 minutes does not mean that the Rep Agent will take 30 minutes to scan that much of the log – consider if the Rep Agent is down or a low volume system. Consequently, the suggestion made earlier to either invoke your own transaction (add a where clause of 'spid = @@spid' to the above) or just grab the latest one and hope that it isn't a user gone to lunch.

Using sp_sysmon

Most DBA’s are familiar with sp_sysmon – until the advent of the MDA monitoring tables in 12.5.0.3, this procedure was the staple for most database monitoring efforts (unfortunately so, as Historical Server provided more useful information and yet was rarely implemented). A little known fact is that while the default output for sp_sysmon does not include RepAgent performance statistics, executing the procedure and specifically asking for the “repagent” report does provide more detailed information than what is available via sp_help_rep_agent. The syntax is:

-- sample the server for a 10 minute period and then output the repagent report exec sp_sysmon “00:01:00”, “repagent”

While the output is described in chapter 5 of the Replication Server Administration Guide, some the main points of interest are repeated below (header lines repeated for clarity):

per sec per xact count % of total ------------ ------------ ---------- ---------- Log Scan Summary Log Records Scanned n/a n/a 206739 n/a Log Records Processed n/a n/a 105369 n/a

The log summary section is a good indicator of how much work the RepAgent is doing – and how much information is being sent to the Replication Server. The difference between ‘Log Records Scanned’ and ‘Log Records Processed’ – is fairly obvious – ‘Processed’ records were converted into LTL and sent to the RS. In the example above, ~50% of the log records scanned were sent to the RS.

per sec per xact count % of total ------------ ------------ ---------- ---------- Log Scan Activity Updates n/a n/a 101317 n/a Inserts n/a n/a 19 n/a Deletes n/a n/a 0 n/a Store Procedures n/a n/a 0 n/a DDL Log Records n/a n/a 0 n/a

Final v2.0.1

45

Writetext Log Records n/a n/a 0 n/a Text/Image Log Records n/a n/a 0 n/a CLRs n/a n/a 0 n/a

The Log Scan Activity contains some useful information if you think something is occurring out of the norm. While the first four are fairly obvious (updates, inserts, deletes and proc execs replicated), the last three should bear some attention. ‘DDL Log Records’ refers to DDL statements that were replicated – generally this should be zero, with only minor lifts in a Warm Standby when DDL changes are made – hence we exclude this from concern. ‘Writetext Log Records’ will show how many writetext operations are being replicated. ‘Text/Image Log Records’ is similar but a bit different in that it displays how many row images are processed (we need to confirm whether this is rs_datarow_for_writetext or the actual number of text rows). If you see a large number of text rows being replicated, you may want to investigate whether a text/image column was inappropriately marked or left at “always_replicate” vs. “replicate_if_changed”. CLR’s refer to Compensation Log Records – and clearly point to a design problem as earlier discussed with indexes using ignore_dup_row or ignore_dup_key (discussed earlier in the Primary Database section on Batch Processing & ignore_dup_key). More detail about which tables were updated/inserted/deleted can be gotten from the MDA monitoring tables in 12.5.0.3+ -specifically monOpenObjectActivity table which has the following definition:

-- ASE 15.0.1 definition create table monOpenObjectActivity ( DBID int, ObjectID int, IndexID int, DBName varchar(30) NULL, ObjectName varchar(30) NULL, LogicalReads int NULL, PhysicalReads int NULL, APFReads int NULL, PagesRead int NULL, PhysicalWrites int NULL, PagesWritten int NULL, RowsInserted int NULL, RowsDeleted int NULL, RowsUpdated int NULL, Operations int NULL, LockRequests int NULL, LockWaits int NULL, OptSelectCount int NULL, LastOptSelectDate datetime NULL, UsedCount int NULL, LastUsedDate datetime NULL, ) materialized at "$monOpenObjectActivity" go

Since the above is at the index level, though, you will need to isolate IndexID=[0|1] to avoid picking up rows inserted into the index nodes (which of course are not replicated).

Some may have notice in the Scan Activity that ~4,000 log records sent to the RS were not DML statements (105,369 processed – 101,336 DML = 4,033) – most of these are transaction records as seen in the next section.

per sec per xact count % of total ------------ ------------ ---------- ---------- Transaction Activity Opened n/a n/a 2015 n/a Commited n/a n/a 2016 n/a Aborted n/a n/a 0 n/a Prepared n/a n/a 0 n/a Maintenance User n/a n/a 0 n/a

Here are the missing 4,000 records – since each transaction is a begin/commit pair, 2015+2016=4031 records sent to the Replication Server. Most of the above statistics should be fairly obvious, except ‘Prepare’ – which refers to two-phased commit (2PC) ‘prepare transaction’ records that are part of the commit coordination phase. ‘Maintenance User’ refers of course to maintenance user applied transactions that are in turn re-replicated. Normally, this should be zero, but if a logical Warm Standby is also the target of a different replication source, then the primary database in the logical pair is responsible for re-replicating the data to the standby database. The transaction flow is SourceDB RS PrimaryDB RS/WS StandbyDB as illustrated below:

Final v2.0.1

46

Remote Site

HQ Warm Standby

Direct ReplicationRe-Replicated

Remote Site

HQ Warm Standby

Direct ReplicationRe-Replicated

Figure 8 – Path for External Replicated Transactions in Warm Standby

The next section of the Rep Agent sp_sysmon output is the ‘Log Extension’ section:

per sec per xact count % of total ------------ ------------ ---------- ---------- Log Extension Wait Count n/a n/a 2 n/a Amount of time (ms) n/a n/a 14750 n/a Longest Wait (ms) n/a n/a 14750 n/a Average Time (ms) n/a n/a 7375.0 n/a

Here, waiting is not ‘bad’ as it refers to the time that the Rep Agent was fully caught up and was waiting for more log records to be added to the transaction log. Obviously a count of zero is not desired. In the above example from a 1 minute sysmon taken during heavy update activity, the RepAgent caught up twice and waited ~7 seconds each time for more information to be added to the log. Or it seems so. In reality, the RepAgent was waiting when the sp_sysmon started, then the 100,000 updates occurred in ~2,000 transactions – taking 45 seconds to process – then the RepAgent was caught up again. So if doing benchmarking, remember that the RepAgent wait count may reflect the before and end state of the benchmark run.

The next section of the RepAgent sp_sysmon report is only handy from the unique perspective of DDL replication.

per sec per xact count % of total ------------ ------------ ---------- ---------- Schema Cache Lookups Forward Schema Count n/a n/a 0 n/a Total Wait (ms) n/a n/a 0 n/a Longest Wait (ms) n/a n/a 0 n/a Average Time (ms) n/a n/a 0.0 n/a Backward Schema Count n/a n/a 0 n/a Total Wait (ms) n/a n/a 0 n/a Longest Wait (ms) n/a n/a 0 n/a Average Time (ms) n/a n/a 0.0 n/a

When a table is altered via alter table, the RepAgent may have to scan forward/backward to determine the correct column names, datatypes, etc. to send to the Replication Server. One way to think of this is from the perspective of someone doing an alter table and then shutting down the system. On startup, the RepAgent can’t just use the schema from sysobjects/syscolumns because some of the log records may contain rows that had extra or fewer columns. Consequently, it will have to scan backwards (possibly) to find the alter table record to determine the appropriate columns to send. Incidentally, this is done using an auxiliary scan from the main log scan and is why often the Rep Agent will be seen with two scan descriptors active in the transaction log.

The next section is one of the more useful:

per sec per xact count % of total ------------ ------------ ---------- ---------- Truncation Point Movement Moved n/a n/a 107 n/a Gotten from RS n/a n/a 107 n/a

As expected, this is reporting the number of times the RepAgent has asked the RS for a new secondary truncation point and then moved the secondary truncation point in the log. If ‘Moved’ is more than one less than ‘Gotten’, the likely

Final v2.0.1

47

cause is that a large or open transaction exists from the Replication Server’s perspective (either it is indeed still open in ASE or the RepAgent just hasn’t forwarded the commit record yet). The number above is not necessarily high – you can gauge the number by dividing the ‘Log Records Processed’ by the RepAgent ‘scan_batch_size’ configuration – which was the default of 1,000 in this case. With 105,000 records processed, you would expect at least 105 truncation point movements plus one when it reached the end of the log, so 107 is not abnormal. However, that is about 2/sec – so increasing scan_batch_size in this case should not have too much a detrimental impact on recovery. Note in this discussion we are talking about ‘Log Records Processed’ and not ‘Scanned’ – while the record count can happen quickly, the RepAgent isn’t foolish enough to ask for a new truncation point every 1,000 records scanned – it actually is based off of the number of records sent to the RS.

The connection section is really only useful when you are having network problems – and should be accompanied by the normal errors in the ASE errorlog:

per sec per xact count % of total ------------ ------------ ---------- ---------- Connections to Replication Server Success n/a n/a 0 n/a Failed n/a n/a 0 n/a

The next section is also when to pay attention to – the network activity:

per sec per xact count % of total ------------ ------------ ---------- ---------- Network Packet Information Packets Sent n/a n/a 16860 n/a Full Packets Sent n/a n/a 14962 n/a Largest Packet n/a n/a 2048 n/a Amount of Bytes Sent n/a n/a 30955391 n/a Average Packet n/a n/a 1836.0 n/a

In the above case, it shows that possibly bumping up the send_buffer_size may help. We were using the default 2K packets between the RepAgent and the Replication Server and nearly 90% of them were full. The bottom statistic ‘Average Packet’ is simply ‘Amount of Bytes Sent’ – ‘Packets Sent’ and can be misleading. Remember, 107 times, the RepAgent requested a new truncation point – requests sent in separate packets from the LTL buffers – requests that can skew the average. The more important statistic to watch is Full vs. Sent. For those who have been following this, this 45 second update sent ~30MB to the RS – a rate of 40MB/min or >2GB/hr. Of course having the RS on a box with fast CPU’s greatly helped in this case as will be discussed later. Note too that we are sending 2K buffers which include column names – hence the number of packets in this case is probably much bigger than the number of log pages scanned – perhaps 20% more.

The final section of the report offers perhaps the biggest clues into why the RepAgent may be lagging:

per sec per xact count % of total ------------ ------------ ---------- ---------- I/O Wait from RS Count n/a n/a 16966 n/a Amount of Time (ms) n/a n/a 11002 n/a Longest Wait (ms) n/a n/a 63 n/a Average Wait (ms) n/a n/a 0.6 n/a

In this sample, the RepAgent waited on the RS to conduct I/O nearly 17,000 times. Now then, check the above statistic on the number of packets and you will see the problem with RepAgent performance – a lot of hurry-up-and-wait. It can scan from the log at fairly tremendous speed, but then has to wait for the RS to parse the LTL, normalized the LTL according to the replication definitions, pack the LTL into a binary format and send to the SQM – an average of ½ of a second wait every time a packet is sent (and yes, we did ask 2x a second for a truncation point as well).

Let’s take a look at an actual snapshot from a customer’s system. The following statistics are from a 10 minute sp_sysmon – only the transaction and RepAgent sections are reported here:

Engine Busy Utilization CPU Busy I/O Busy Idle ------------------------ -------- -------- -------- Engine 0 3.8 % 2.1 % 94.2 % Transaction Summary per sec per xact count % of total ------------------------- ------------ ------------ ---------- ---------- Committed Xacts 1.2 n/a 726 n/a Transaction Detail per sec per xact count % of total ------------------------- ------------ ------------ ---------- ----------

Final v2.0.1

48

Inserts APL Heap Table 0.7 0.6 419 13.1 % APL Clustered Table 0.8 0.6 468 14.7 % Data Only Lock Table 3.8 3.2 2301 72.2 % ------------------------- ------------ ------------ ---------- ---------- Total Rows Inserted 5.3 4.4 3188 99.1 % Updates Total Rows Updated 0.0 0.0 0 n/a ------------------------- ------------ ------------ ---------- ---------- Total Rows Updated 0.0 0.0 0 0.0 % Data Only Locked Updates Total Rows Updated 0.0 0.0 0 n/a ------------------------- ------------ ------------ ---------- ---------- Total DOL Rows Updated 0.0 0.0 0 0.0 % Deletes APL Deferred 0.0 0.0 20 71.4 % APL Direct 0.0 0.0 4 14.3 % DOL 0.0 0.0 4 14.3 % ------------------------- ------------ ------------ ---------- ---------- Total Rows Deleted 0.0 0.0 28 0.9 % ========================= ============ ============ ========== Total Rows Affected 5.4 4.4 3216 Replication Agent ----------------- count ---------- Log Scan Summary Log Records Scanned 81061 Log Records Processed 19015 Log Scan Activity Updates 0 Inserts 15845 Deletes 0 Store Procedures 0 DDL Log Records 0 Writetext Log Records 0 Text/Image Log Records 0 CLRs 0 Transaction Activity Opened 1585 Commited 1585 Aborted 0 Prepared 0 Maintenance User 0 Log Extension Wait Count 0 Amount of time (ms) 0 Longest Wait (ms) 0 Average Time (ms) 0.0 Schema Cache Lookups Forward Schema Count 0 Total Wait (ms) 0 Longest Wait (ms) 0 Average Time (ms) 0.0 Backward Schema Count 0 Total Wait (ms) 0 Longest Wait (ms) 0 Average Time (ms) 0.0

Final v2.0.1

49

Truncation Point Movement Moved 19 Gotten from RS 19 Connections to Replication Se Success 0 Failed 0 Network Packet Information Packets Sent 9794 Full Packets Sent 8698 Largest Packet 2048 Amount of Bytes Sent 18436223 Average Packet 1882.4 I/O Wait from RS Count 9813 Amount of Time (ms) 107316 Longest Wait (ms) 400 Average Wait (ms) 10.9

Now the interesting thing about the above – of course the RepAgent was lagging – in fact it was way behind. Consider the usual suspects:

“RepAgent not scanning the transaction log fast enough” – common myth closely followed with a “multi-threaded RepAgent is needed”. As you can see from above, however, the application (bcp in this case) only inserted 3,000 rows in the 10 minutes at a rate of 5 rows/sec. The RepAgent processed 15,000 inserts during the same period – about 5x the rate – so the RepAgent scan isn’t the issue.

“RepAgent is contending for cpu – need to raise the priority” – another commonly blamed problem (with sp_sysmon, this can now be refuted easily). Looking at the system, we see that ASE is only using ~4% of the cpu – idle for the other 96% of the time.

The problem is in the waits on sending to the Replication Server – from the above, it waited ~2 minutes (107 seconds) of the 10 to send – the key is the long wait and high average wait. In fact, a 1 minute sp_sysmon showed the following:

I/O Wait from RS Count n/a n/a 4869 n/a Amount of Time (ms) n/a n/a 54363 n/a Longest Wait (ms) n/a n/a 323 n/a Average Wait (ms) n/a n/a 11.2 n/a

Ugly. The RepAgent is literally waiting 54 seconds of the 1 minute – so it is only scanning for 6 seconds. Interestingly enough, the issue was not Replication Server – traceflags were enable that turned RS into a datasink with no appreciable impact on performance. The problem is believed to be caused by the ASE scheduler not processing the RepAgent’s network traffic fast enough. Upgrading to ASE 12.5.2 from 12.5.0.3 did solve the problem and a 15 minute stress test dropped to 3 minutes.

The point to be made here is that the RepAgent speed is directly proportionate to the speed of ASE processing the network send requests coupled with the speed of the Replication Server processing. Any contention at the inbound queue (readers delay writers), delay in getting repdefs into cache (too small sts_cache_size), cpu time spent on other threads (DSI, etc.), directly slows down the RepAgent speed.

Key Concept #6 – The biggest determining factor in RepAgent performance is the speed of ASE sending network data and the speed of the Replication Server – hence, from an RS perspective fewer, faster CPU’s is much better than slow CPU’s on a monster SMP machine. Additionally, enabling SMP, even for a Warm Standby may boost RepAgent performance 10-15% by eliminating CPU contention.

So…don’t put RS on the old Sun v880 or v440 with the ancient 1.2GHz CPU’s when a quad cpu Opteron based Linux box or small SMP with fast CPU’s (less than $20,000). Even worse, don’t put RS on the Sun6900 that is apparently under utilized as it is the DR host machine. DBA’s have often fallen into the trap of buying a bigger SMP machine for the DR site and hosting both RS and the standby ASE server on it. It not only would be cheaper – but better performance to have bought a smaller SMP machine for the standby ASE server and a 2-4 way screamer entry level server for RS to run on.

Final v2.0.1

50

Utilizing monProcessWaits

One of the key MDA monitoring tables added in 12.5.0.3 is the monProcessWaits table. This can be especially useful for RepAgent performance analysis when determining whether the hold up is the time spent doing the log scan or whether it is due to waiting on the network send aspect.

To understand what how to use monProcessWaits, the key is to realize that it requires at least two samples to be effective. The reason for this is that the output values are counters that are incremented from the server boot infinitely until shutdown. If the counter hits 2 billion, a rollover occurs and it re-increments from the rollover. Consequently the time waiting is the difference between samples. The other key is that the monProcessWaits table has two parameters – KPID and SPID. Consequently, when focusing on a specific Replication Agent, it will be much faster to first retrieve the RepAgent’s KPID and SPID from master..sysprocesses and supply it as SARG values such as:

declare @ra_kpid int, @ra_spid int select @ra_kpid=kpid, @ra_spid=spid from master..sysprocesses where program_name=’repagent’ and dbid=db_id(‘<tgt_db_name>’) select * From monProcessWaits where KPID=@ra_kpid and SPID=@ra_spid waitfor delay “00:05:00” select * from monProcessWaits where KPID=@ra_kpid and SPID=@ra_spid

There are very few RepAgent specific wait events as most of the causes of RepAgent waits are due to generic ASE processing vs. RepAgent specific. The RepAgent wait events are:

Wait Event Event Class Description

221 9 replication agent sleeping in retry sleep

222 9 replication agent sleeping during flush

223 9 replication agent sleeping during rewrite

To illustrate the point about the most frequent causes of RepAgent waits, consider the following event descriptions and counter values. In this case the before and after samples are illustrated side by side with the first column being the first sample and the second column being the end sample. In each case, only WaitEventID’s that showed a difference between the samples is reported.

WaitEventID WaitClassID Description ----------- ----------- -------------------------------------------------- 29 2 wait for buffer read to complete 31 3 wait for buffer write to complete 171 8 waiting for CTLIB event to complete 214 1 waiting on run queue after yield 222 9 replication agent sleeping during flush Wait Time from Mon Tables on ASE 12.5.0.3 WaitEventID t1.Waits t2.Waits totWaits WaitTime WaitTime totWaitTime ----------- ----------- ----------- ----------- ----------- ----------- ----------- 31 3 120 117 0 100 100 171 2178 75597 73419 21800 747900 726100 222 2 4 2 17000 54300 37300 Wait Time from Mon Tables on ASE 12.5.2 WaitEventID t1.Waits t2.Waits Tot.Waits t1.WaitTime t2.WaitTime Tot.WaitTime ----------- ----------- ----------- ----------- ----------- ----------- ------------ 29 2 2 0 0 0 0 31 283 403 120 1900 3200 1300 171 223623 306426 82803 410700 593700 183000 214 3 3 0 0 0 0 222 13988 13990 2 1032636100 1032659600 23500

In both cases illustrated above, the RepAgent spid is spending much more time waiting on CT-Lib events than anything else. However, as is evident, there was a 400% drop in the waits moving from 12.5.0.3 to 12.5.2 finally showing the

Final v2.0.1

51

RepAgent waiting for a buffer read (logical IO from the log cache) and cpu access. Not surprisingly, the application performance stress test also improved from 13-14 minutes to 3 minutes – a matching 400%. Some of the wait events and how they could be interpreted are summarized in the following table:

ID WaitEvent Description Possible Intepretation/Reaction

29 wait for buffer read to complete Waiting on log scan – check to see if log page for scan point is within the log cache – possibly use more cache partitions

31 wait for buffer write to complete Typically, the only writing RepAgent does is to update the dbinfo structure with the new secondary truncation point – so any large values here could be indication of more serious problems

171 waiting for CTLIB event to complete This corresponds directly to RepAgent transferring data to the RS. This will be the most common and can be as the result of several things: • Slow network access from ASE • Slow network access at RS • Slow inbound queue SQM (exec cache full)

214 waiting on run queue after yield CPU contention with other processes – unless you see a fair number of waits, adjusting the RepAgent priority is likely not going to help throughput issues. In this case, the RepAgent was scanning and got bumped off the cpu at the end of its timeslice (i.e. didn’t reach the scan_batch_size before the timeslice) and had to wait to regain access to the cpu.

215 waiting on run queue after sleep Same as above (cpu contention). In this case, the RepAgent was sleeping (due to sending data to RS - any network or physical disk io results in the spid being put on the sleep queue) and when the network operation was complete, it had to wait on other users before it could reclaim the cpu and continue scanning.

222 replication agent sleeping during flush This typically is an indication of the rep agent reaching the end of the transaction log and sleeping on the log flush.

An important point about WaitEventID=222 – if you are doing benchmarking, if you sample the counters prior to the stress test starting and then at the end, you will minimally see “2” waits – the reason is likely the RepAgent was at the end of the log when the first sample was taken and the last sample was taken after the RepAgent had finished reading out the test transactions. As with any monitoring activity, strictly a before and after snapshot is not that informative. It is much better to take samples at timed intervals – such as every minute or so. This will help eliminate false highs/lows at the boundary of the tests such as 222.

If you see a significant number of 171’s (as illustrated above) and as will be the most common, the next step to determine the cause of the slow LTL transfer process. One method is to use Replication Server’s Monitors & Counters feature – focusing on the EXEC, SQM, and STS thread counters. While the first two may be obvious, the STS counters is useful to determine if the RS is hitting the RSSD server to read in repdefs – which are used by the EXEC thread during normalization.

Fault Isolation Questions (FIQ’s)

Most of us are familiar with FAQ’s – which strive to serve as a loosely defined database of previously asked questions and answers. We’ll morph that a bit to make FIQ’s – questions you should ask when troubleshooting. A common problem is that most programmers and database administrators today have poor fault isolation skills largely due to the lack of organized fault isolation tree use by the DBA’s, etc. – including vendors. Unfortunately, this often leads to phone calls to Technical Support with “it’s slow” and no information about what may be going on to help identify why it may be slow. The following questions are useful in helping to isolate the potential causes of RepAgent performance:

• How far behind is the Replication Agent (MB)? (current marker vs. end of log)

Final v2.0.1

52

• What is the rate at which Replication Agent appears to be processing log pages (MB/min)? • What is the rate at which pages are being appended onto the transaction log (MB/min)? (monDeviceIO) • How much cpu time is the RepAgent getting? (monSysWaits/monProcessWaits) • Is there a named cache specifically for the transaction log with most of the cache defined for a 4K (or other)

pool and sp_logiosize set to the pool size? • What are the configuration values for the RepAgent? • Do any columns in the schema contain text? If so, what is the replication status for the text columns (always,

if changed, etc.)? Is there a named cache pool for the text columns (later discussion)? • Is it the latency the result of executing large transactions (how many commands show up in admin who, sqt)? • Where is the RSSD in relationship to the RS? Is the RepAgent waiting a long time for the secondary

truncation point for the RS at the end of each ltl_batch_size? • What do the contents of the last several blocks of the inbound queue look like? • Were lengthy DBA tasks such as “reorg reclaim_space <tablename>” issued without turning replication off

for the session (via “set replication off”)?

The last question is a bit strange, but it turns out that some DBA tasks issue a huge number of empty begin/commit transaction pairs. As was described earlier, not knowing if the transaction will contain replicated commands, RepAgent prior to ASE 12.5.2 forwards these BT/CT pairs to the RS where eventually they are filtered out. As a result, it is a good practice to put “set replication off” at the beginning of most DBA scripts such as dbcc or reorg commands. As of ASE 12.5.2, the RepAgent is smart enough to filter out empty BT/CT pairs caused by system transactions.

Final v2.0.1

53

Replication Server General Tuning

How much resources will Replication Server require? The above is a favorite question – and a valid one – of nearly every system administrator tasked with installing a Replication Server. The answer, of course, is all depends – it depends on the transaction volume of the primary sites, how many replicate databases involved and how much latency the business is willing to tolerate.

The object of this section is to cover basic Replication Server tuning issues. It should be noted that these are general recommendations that apply to many situations, however, your specific business or technology requirements may prevent you from implementing the suggestions completely. Additionally, due to environment specific requirements, you may achieve better performance with different configurations than those mentioned here.

The recommendations in this section are based on the assumption of an enterprise production system environment and consequently are significantly higher than the software defaults.

Replication Server/RSSD Hosting

A common mistake is placing the RSSD database in one of the production systems being replicated to/from. While this in itself has other issues, one of the main problems stemming from this is that this frequently places the RSSD across the network from the Replication Server host. As you saw earlier in the Rep Agent discussion on secondary truncation point management, the volume of interaction between the Replication Server and RSSD can be substantial – just in processing the LTL. Add the queue processing, catalog lookups and other RSSD accesses increase this load considerably. This leads a critical key performance concept for RSSD hosting:

Key Concept #7: Always place the RSSD database in an ASE database engine on the same physical machine host as the Replication Server or use the Embedded RSSD (eRSSD). In addition, make sure that the first network addresses in the interfaces file for that ASE database engine are ‘localhost’ (127.0.0.1) entries.

The latter part of the concept may take a bit of explaining. If you took in the host file on any platform (/etc/hosts for Unix; %systemroot%\system32\drivers\etc\hosts for WindowsNT), you should see an entry similar to:

127.0.0.1 localhost #loopback on IBM RS6000/AIX

In addition the Network Interface Card (NIC) IP addresses, the localhost IP address refers to the host machine itself. The difference is in how communication is handled when addressing the machine via the NIC IP address or the localhost IP address. If using the NIC IP address, packets destined for the machine name may not only have to hit the NIC card, but may also require NIS lookup access or other network activity (routing) that result in minimally the NIC hardware being involved. On the other hand, when using the localhost entry, the TCP/IP protocol stack knows that no network access is really required. As a result, the protocol stack implements a “TCP loopback” in which the packets are essentially routed between the two applications only using the TCP stack. An illustration of this is shown below:

Final v2.0.1

54

SybaseHybridStack

IPTCP

NetLib

RS RSSD

CT-Lib

NetworkNetwork

Hostname:port

IPTCP

NetLib

RS RSSD

CT-Lib

NetworkNetwork

localhost:port

Figure 9 – hostname/NIC IP vs. localhost protocol routing

As you could guess, this has substantial performance and network reliability improvements over using the network interface. Typically, this can be implemented by modifying the Sybase interfaces file to include listeners at the localhost address. However, these must be the first addresses listed in the interfaces file in order for this to work. For example:

NYPROD master tcp /dev/tcp localhost 5000 master tcp /dev/tcp nymachine 5010 query tcp /dev/tcp localhost 5000 query tcp /dev/tcp nymachine 5010 NYPROD_RS master tcp /dev/tcp localhost 5500 master tcp /dev/tcp nymachine 5510 query tcp /dev/tcp localhost 5500 query tcp /dev/tcp nymachine 5510

Note that many of today’s vendors have added the ability for the TCP stack to automatically recognize the machine’s IP address(es) and provide similar functionality without specifically having to use the localhost address. Even so, there may be a benefit to using the localhost address on machines in which the RSSD is co-hosted with application databases and the RS would have to contend with application users for the network listener with ASE. By using the localhost address, RS queries to the RSSD may by-pass the “traffic jam” on the network listener used by all the other clients. A word of warning. On some systems, implementing multiple network listeners – especially one on localhost – could result in severe performance degradation (especially when attempting large packet sizes). One such was AIX 4.3 with ASE 11.9.2 (neither of which is currently supported having been end-of-lifed by both IBM and Sybase years ago).

Additionally, the machine should have the following minimal specifications (NOTE: The following specifications are not the bare minimums, but probably are the minimum a decent production system should consider to avoid resource contention or swapping):

Resources Recommendation

# of CPU’s 1 for each RS and ASE installed on box plus 1 for OS and monitoring (RSM). (min 2, 3 preferred). If planning on high volume with multiple connections and using RS 12.6/SMP, suggestion is 2-3 for the RS, 1-2 for ASE/RSSD and 1 for the OS & RSM (4-6 cpu’s). Using the eRSSD reduces the cpu load significantly such that it would be rare to need more than 4 cpu’s unless 3 or more active DB connections are in the RS.

Final v2.0.1

55

Resources Recommendation

Memory 128-256MB for ASE (32-64 for eRSSD instead) plus memory for each RS (64-128MB min) and operating system (32-64MB). Min of 256MB with 1GB recommended

Disk Space ASE requirements plus RAID 0+1 device for stable queues – separate controllers/disks for ASE and RS. Although the default creation for the RSSD is only 20MB (2KB pages), recommend 256-512MB data and 128-256MB log due to monitoring tables. Although the eRSSD uses significantly less system space (i.e. no 20MB master, 120MB sybsystemprocs, 100+MB tempdb), because of the autoexpansion, it can grow rapidly if logging exceptions.

Network Switched Gigabit Ethernet or better (10GB Ethernet or infiband)

The rationale behind these recommendations will be addressed in the discussions in the following sections.

Author’s Note: As of this writing, there should be no licensing concern to restricting the use of an ASE for the RSSD. Each Replication Server license includes the ability to implement a “limited use” ASE solely for the purpose of hosting the RSSD (“limited use” means that the ASE server could only be used the RSSD – no application data, etc. permitted). Consequently, each RS implemented at a site could have it’s own ASE specifically for the RSSD. However, is assumed you already have the ASE software, consequently it is not shipped as part of the RS product set. For ASE 15.0 and higher, the SySAM 2 license manager will require a restricted use license key - customers may have to coordinate with Sybase Customer service to ensure that the correct number of keys are available.

RS Generic Tuning

Generally speaking, the faster the disk I/O subsystem and the more memory , the faster Replication Server will be. In the following sections Replication Server resource usage and tuning will be discussed in detail.

Replication Server Memory Utilization

A common question is how much memory is necessary for Replication Server performance to achieve desired levels. The answer really depends on several factors:

1. Transaction volume from primary systems 2. Number of primary and replicate systems 3. Number of parallel DSI threads 4. Number of replicated objects (repdefs, subscriptions, etc.)

Of course, life isn’t that simple. Based on the above considerations, you have to adjust several configuration settings within the Replication Server for optimal settings with certain minimums required based on your configuration. Some of these are documented below:

Parameter Ver. Explanation

Replication Server Settings

num_threads Default: 50 Suggest: 100+

10.x The number of internal processing threads, client connection threads, daemons, etc. The old formula for calculating this was (#PDB * 7) + (#RDB * 3) + 4 + (num_client_connections) + (parallel DSI’s) + (subscriptions) + …

The new formula is: 30 + 4*IBQ + 2*RSIQ + DSIQ*(3+max(DSIE))

IBQ=Inbound queues, RSIQ=Route queues, DSIQ=Outbound Queues; max(DSIE)=max parallel DSI threads (max(dsi_num_threads)). Recommendation is minimally 100, particularly for RS 12.5+

Final v2.0.1

56


Num_msgqueues Default: 178 Suggest: 250+

10.x Specifies the number of OpenServer message queues that will be available for the internal RS threads to use. The old formula for calculating this was: 2 + (#PDB * 4) + (#RDB * 2) + (#Direct Routes)

However, given that this number must always be larger than num_threads, a simpler formula would be: num_threads*2

Recommendation is 250 (2.5*num_threads)

Num_msgs Default: 45,586 Suggest: 128,000

10.x The number of messages that can be enqueued at any given time between RS threads. The default settings suggest a 1:256 ratio, although a 1:512 may be more advisable. Based on the above settings, num_msgs may need to be set to 128000

Num_stable_queues Default: 32 Suggest: 32*

10.x Minimum number of stable queues. This should be at least twice the number of database connections + num_concurrent_subs.

num_client_connections Default: 30 Suggest: 20

10.x Number of isql, RSM and other client connections (non-Rep Agent or DSI connections). The default of 30 is probably a little high for most systems - 20 may be a more reasonable starting point

num_mutexes Default: 128 pre 12.6,

1024 in 12.6 Suggest: see formula

10.x Used to control access to connection and other internal resources. The old formula for calculating this was: 12+(#PDB * 2) + (#RDB)

As of RS12.5 and native threaded OpenServer, the formula was changed to: 200 + 15*RA_USER + 2*RSI_USER + 20*DSI + 5*RSI_SENDER + RS_SUB_ROWS + CM_MAX_CONNECTIONS + ORIGIN_SITES

RA_USER=RepAgents connecting; RSI_USER=inbound routes; RSI_SENDER=outbound routes; RS_SUB_ROWS and CM_MAX_CONNECTIONS are from rs_config; ORIGIN_SITES=number of inbound queues

sqt_max_cache_size Default: 131,072 Suggest: 4,194,304(4MB)

11.x Maximum SQT (Stable Queue Transaction) interface cache memory (in bytes) for each connection (Primary and Replicate). Serious consideration should be give to setting this to 4-8MB or higher, depending on transaction volume. Settings above 16MB are likely counter productive.

Connection (DSI) Settings

dsi_sqt_max_cache Default: 0 Suggest: 2,097,152

11.x Maximum SQT interface cache memory for a specific database connection, in bytes. The default, 0, means the current setting of the sqt_max_cache_size parameter is used as the maximum cache size for the connection. If sqt_max_cache_size is fairly high, you may want to set this at the 2-4MB range to reserve memory. To calculate a start point, consider num_dsi_threads * 64KB (Note that this is a per connection. It is mentioned here to emphasize a connection setting that should be changed as the result of changing sqt_max_cache_size).

exec_sqm_write_request _limit Default: 16384 Suggest: 983,040

12.1 Amount of memory available to a Rep Agent User/Executor thread for messages waiting in the inbound queue before the SQM writes them out. Must be set in even multiples of 16K (block size). Max is 60 blocks or 983,040

Final v2.0.1

57


md_sqm_write_request _limit Default: 16384 Suggest: 983,040

11.x Amount of memory available to a DIST thread’s MD module for messages waiting in the outbound queue before the SQM writes them out. Must be set in even multiples of 16K (block size). Max is 60 blocks or 983,040. Note that the name was changed in RS 12.1 from md_memory_pool to the current name to correspond with the exec_sqm_write_request_limit parameter.

Each of these resources consumes some memory. However, once the number of databases and routes are known for each Replication Server, the memory requirements can be quickly determined. For the sake of discussion, let’s assume we are trying to scope a Replication Server that will manage the following:

• 20 databases (10 primary, 10 replicate) along with 5 routes • 2 of the 10 replicate databases have Warm Standby configurations as well. • 4 of the replicate databases have had the dsi_sqt_max_cache_size set to 3MB • The RSSD contains about 5MB of raw data due to the large number of tables involved. • md_sqm_write_request_limit and exec_sqm_write_request_limit are maxed at 983,040, sqt_max_cache_size

to 1MB. • num_threads is set to 250 for good measure (system requires nearly 200)

The memory requirement would be:

Configuration value/formula Memory Example (KB)

num_msgqueues * 205 bytes each 36KB default 100

num_msgs * 57 bytes each 2.5MB default 7,125

num_mutexes * 205 bytes each 205KB default 205

num_threads * 2800 bytes each 140KB default 684

# databases * 64K + (16K * # Warm Standby) 1 MB min 1,312

# databases * 2 * sqt_max_cache_size 40,960

dsi_sqt_max_cache_size – sqt_max_cache_size 8,192

exec_sqm_write_request_limit * # databases 960K@ if maxed 19,200

Md_sqm_write_request_limit * # databases 960K@ if maxed 19,200

size of raw data in RSSD (STS cache) 5,120

exec_sqm_write_request_limit * #databases 983,040 (max) 19,200

Minimum Memory Requirement (MB) ~128MB

Of course, the easy way is to just use the values below as starting points (assumes a normal number of databases ~10 or less - if more/less adjust memory by same ratio):

Normal Mid Range OLTP High OLTP

sqt_max_cache_size 1-2MB 1-2MB 2-4MB 8-16MB

dsi_sqt_max_cache_size 512KB 512KB 1MB 2MB

memory_limit 32MB 64MB 128MB 256MB

The definitions of each of these are as follows:

Normal – thousands to tens of thousands of transactions per day Mid Range – tens to hundreds of thousands of transactions per day OLTP – hundreds of thousands to millions of transactions per day

Final v2.0.1

58

High OLTP – millions to tens of millions of transactions per day

Now, then, it is easy to run out of memory if you are not careful. One of the best ways to improve RS speed if there is latency in the inbound queue (other than Warm Standby) is to increase the sqt_max_cache_size. However, bumping it up when admin who, sqt doesn’t show any removed doesn’t help and also can cause you to run out of memory fairly quickly. For example, a system with 7 connections (3 Warm Standby pairs + RSSD) with a sqt_max_cache_size of 32MB may run great as long as only 2 of the connections are active (i.e. one WS pair) with 500MB of memory. As soon activity starts on another, the following happens:

T. 2004/09/30 00:11:55. (111): Additional allocation of 496 bytes to the currently allocated memory of 524287924 bytes would exceed the memory_limit of 524288000 specified in the configuration. F. 2004/09/30 00:11:55. FATAL ERROR #7035 REP AGENT(SERVER.DB) – s/prstok.c(493) Additional allocation would exceed the memory_limit of '524288000' specified in the configuration. T. 2004/09/30 00:11:55. (111): Exiting due to a fatal error

If you get the above error, it is probably a clue that you have sqt_max_cache_size set too high (checked it during a large transaction perhaps) – or you need to raise the memory_limit due to the number of connections.

General RS Tuning

In addition to the memory configuration settings, there are several other server-level Replication Server configuration parameters that should be adjusted. These configuration settings include (note that this list does not include previously mentioned memory configuration settings):


init_sqm_write_delay Default: 1000 Suggest: 50

11.5? Write delay for the Stable Queue Manager if queue is being read. The impact of this is that the RepAgent (inbound) and DIST (outbound) threads are slowed down to provide time for the SQM to read data from disk for the SQT or DSI threads. Typically, if exec_sqm_write_request limit is set appropriately, the SQT is likely rescanning a large transaction – consequently increasing this value to favor queue reading will likely result larger overall latency.

init_sqm_write_max_delay Default: 10000 Suggest: 100

11.5? The maximum write delay for the Stable Queue Manager if the queue is not being read. See above for discussion about why this should be set lower.

sqm_recover_segs Default: 1 Suggest: 10

12.5+ Specifies the number of stable queue segments Replication Server scans during initialization. This also impacts how frequently RS updates rs_oqid table as segments are allocated/deallocated. During periods of high volume activity, the default settting can result in near-OLTP loads of 2+ updates/sec in the RSSD. Setting it higher allows the RS to spend more time writing to the queue vs. waiting for the RSSD.

sqm_write_flush Default: on Suggest: off

12.5 Similar to the dsync option in ASE, this parameter controls whether RS waits for I/O’s to be flushed to disk for stable queues (effectively, RS uses the O_SYNC flag). This should be ignored for raw partitions, however, if using UFS devices, this could impact performance by a factor of 30% or greater (the price of insuring recoverability). While a theoretical data loss can occur with this off (and UFS devices), the built-in recoverability within RS mitigates most of the risk to the point that in testing, no data has ever been lost.

sqt_init_read_delay Default: 2000 Suggest: 1000 for 12.6, 100 for 15.0

12.6 The length of time an SQT thread sleeps while waiting for a Stable Queue read to complete before checking to see if it has been given new instructions in its command queue. With each expiration, if the command queue is empty, SQT doubles its sleep time up to the value set for sqt_max_read_delay. In high-volume systems, this may not have much of an impact, but in low volume systems, the rate of space being released from the queue may be impacted negatively.

Final v2.0.1

59


sqt_max_read_delay Default: 10000 Suggest: 1000 for 12.6, 100 for 15.0

12.6 The maximium length of time an SQT thread sleeps while waiting for a Stable Queue read to complete before checking to see if it has been given new instructions in its command queue. In high-volume systems, this may not have much of an impact, but in low volume systems, the rate of space being released from the queue may be impacted negatively.

SMP Tuning

As mentioned earlier in the discussion on RepAgent performance, fewer really fast cpu’s in a small entry level server (2-4 cpu’s) is ideal. With RS 12.6 or RS 15.0, enabling SMP capabilities with

-- not supported for Max OSX nor Tru-64 (DEC) configure replication server set ‘smp_enable’ to ‘on’

is probably beneficial even in uniprocessor boxes (although probably only slightly with 1 cpu). If more than one cpu is available, you should definitely configure this option.

Now, then, RS/SMP as of RS 12.5+ is built on native threading via OpenServer 12.5 support for native threads. Different from ASE (which loves kernel threading for I/O and context switching for user processes, consequently employs a multi-engine SMP environment), OpenServer native threading can only take advantage of multiple processors when the O/S schedules a thread on another CPU. For most machines, this is most efficient when employing POSIX threading as kernel threads typically operate on the same cpu as the parent task. As a result, for example, in engineering tests on Solaris, adding /usr/lib/lwp to $LD_LIBRARY_PATH ahead of /usr/lib resulted in a 30% performance gain.

Note that because RS is POSIX thread based vs. engine based, you will need to constrain how many CPU’s it can run on by creating processor sets (or similar term for your hardware vendor) and then binding RS to that processor set. There is a separate white paper on RS SMP that describes this in detail.

Stable Device Tuning

Stable Queue I/O

As you are well aware, Replication Server uses the stable device(s) for storing the stable queues. The space on each stable device is allocated in 1MB chunks divided into 64 16K blocks. Individual replication messages are stored in “rows” in the block, from an I/O perspective, all I/O is done at the block (16K) level, while space allocation is done strictly at the 1MB level. As each block is read and messages processed, the block is marked for deletion as a unit. Only when all of the blocks within a 1MB allocation unit have been marked for deletion and their respective save intervals expired, will the 1MB be deallocated from queue. Often, this is a source of frustration with novice Replication Server Administrators who vainly try to drop a partition and expect it to go away immediately.

The reason for this discussion is that administrators need to understand that the RS will essentially be performing 16K I/O’s using sequential I/O unless the queue space gets fragmented with several different queues on the same device (see below). Ordinarily, this would lend itself extremely well to UFS (file system) storage as UFS drivers and buffer cache are tuned to sequential i/o – especially using “read-ahead” logic to reduce time waiting for physical I/O. The problem with using UFS is two fold:

• Replication Server uses asynchronous I/O (dAIO daemon) to ensure I/O concurrency with different SQM threads for the different queues. Still today, some UFS systems such as HP-UX 11 do not allow asynchronous I/O to file systems. The net result is that writing to a UFS effectively single threads the process as the operating system performs a synchronous write and blocks the process.

• While most vendors (HP included) have enabled the ability to specify raw I/O (unbuffered) to UFS devices, the I/O routines with RS have not been updated to take advantage of this fact. As a result, using UFS devices could cause a loss of replicated data should there be a file system error.

With the exception of the SQM tuning parameters discussed later, there is not much manual tuning you can do to improve I/O performance.

Final v2.0.1

60

Async, Raw, UFS, and Direct I/O

In an earlier version of this document, the last bullet caused a bit of misunderstanding that UFS devices would be faster than raw partitions for stable queue devices, but RS was not engineered to take advantage of it. Actually, this is not quite correct, but rather the purpose was to illustrate how O/S vendors are changing their respective UFS device implementations to mirror the concurrent I/O capabilities of raw devices. The misunderstanding is due to a common misconception (that unfortunately was further spread in early ASE 12.0 training materials) that UFS devices are faster than raw devices. As a result, many were tempted to switch to UFS devices using the dsync flag to ensure recoverability under the hope of getting greater performance. The answer simply is false. Raw devices historically have been used to provide unbuffered I/O as well as multi-threaded/concurrent I/O against the same device – two distinctly different features. Unbuffered I/O, of course, guaranteed recoverability. Asynchronous I/O provided concurrent I/O and consequently scalability for parallel processing.

UFS devices, as mentioned above, typically do not allow concurrent I/O operations against the same device when using buffered IO. While the buffer cache can reduce the I/O wait times for highly serialized access and consequently have in the past provided performance improvements for single threaded processes or areas where a spinlock or other access restriction single threads the access, performance improvements can be achieved. This can easily be illustrated using the transaction log, select/into (single threads I/O’s due to system table locks), bcp or other environment in which I/O concurrency is not involved. As stated, however, this is largely due to the buffer cache and not due to the UFS device implementation. When the dsync flag is enabled, the buffer cache is forced to flush each write – and consequently the performance advantage immediately disappears. In fact, even with the buffer cache, the more concurrent the I/O activity, the better the performance of raw devices vs. UFS devices. On earlier versions of HP-UX (9.x & 10.x), when 5 concurrent users were attempting database operations, raw partitions were able to outpace UFS devices. The same was true on SGI’s IRIX at 75% write activity for a single user.

The advantage comes from the fact that raw partitions allow concurrent I/O from multiple threads or processes by using the asynchronous I/O libraries. Consequently, server processes such as ASE or RS can submit large I/O requests in parallel for the same internal user task. Overall, in even low concurrent environments, buffered UFS devices using dsync suffer such performance degradation, that a boldface warning was even placed in the ASE 12.0 manuals. For years, one way to get around this and get similar performance on UFS as with raw partitions was to use a tool such as Veritas’s Quick I/O product – which enables asynchronous I/O for UFS devices. Quick I/O has been certified with ASE, however, it has not been tested with RS (RS engineering typically does not certify hardware such as solid state disks or third party O/S drivers as these features should be transparent to the RS process and managed by the O/S). In fact, an interesting benchmark clearly illustrating the problem with UFS devices was published by Veritas a while ago, titled: “Veritas Database Edition 1.4 for Sybase: Performance Brief – OLTP Comparison on Solaris 8”.

For Solaris customers, a really good book on devices you can use to justify the use of raw partitions over file systems to paranoid Unix admins is the book Database Performance and Tuning, by Alan Packer published by Sun (http://www.sun.com/books/catalog/Packer/index.html). While it does not give a lot of detailed device about database tuning from a DBA’s perspective, it is a very good book for describing the O/S features that enable top performance as well as understanding the architectures of the major DBMS vendors and their implementation on the Solaris platform.

What that last bullet on the previous page referred to was that over the past two years, Sun, HP and others have implemented a version of asynchronous I/O for UFS devices, called “Direct I/O”. Unfortunately, the degree of concurrency is limited in some operating systems to only 64 concurrent I/O operations – far below the capacity of ASE or RS. Additionally, it may require changes to the O/S kernel to enable Direct I/O for UFS I/O activity. As RS 12.x was engineered prior to “Direct I/O” availability, UFS devices still use synchronous I/O operations. With an EBF released in late 2001 (EBF number is platform dependent), RS 12.1 did implement the dsync flag similar to ASE 12.0 via the sqm_write_flush configuration (configure replication server set ‘sqm_write_flush’ to ‘on’ - it may already be on as it is the default). This, of course, does provide the capability for RS to use UFS devices in a recoverable fashion. However, one customer who had been using UFS devices immediately noticed a 30% drop in performance.

One thing should be noted after all this discussion. RS typically is not I/O bound (unless a large number of highly active connections are being supported, or when processing large transactions). As a result, unless you have reason to believe that you will be I/O bound, using UFS devices with the sqm_write_flush option ‘off’ may be a usable implementation if raw partitions are not available – particularly in Warm Standby implementations where only a single SQM thread may be performing I/O. The preference, for now (12.6 and 15.0 ESD #1), is still raw partitions.

FSync I/O RS Future Enhancement

This will change in a feature being considered for a future RS release. Currently, RS flushes each block as it is full, but only updates the RSSD every sqm_recover_seg - which defaults to 1MB - and typically should be set to 10MB. As a result, as part of recovery, RS reads the RSSD to find its location within each queue and then begins checking each block after that point to see if it is still active or if already processed and comparing to the OQID last received by the next stage. Note that the OQID that is provided to the previous stage of the pipeline (RepAgent, inbound SQM, etc.) is

http://www.sun.com/books/catalog/Packer/index.html

Final v2.0.1

61

based on the even segment point defined by sqm_recover_seg. As a result, each previous stage of the pipeline will often start reprocessing from that point and the next stage will simply treat any repeats as duplicates. Effectively, this means that any data processed after the OQID is written to the RSSD and until the RS crashed or was shutdown is redundant since if it was not there, it would get resent.

The future enhancement to RS is to use the fsych i/o call on UFS devices in synchronization with the sqm_recover_seg. This would allow RS to leverage the file system buffering to speed I/O processing by caching most of the writing to the file system buffer cache. Since the O/S destages these as necessary to the devices, when the fsych is invoked, hopefully few, if any, actual writes will have to occur. It is anticipated that this technique will achieve the following benefits:

• RepAgent throughput will increase as the write to the inbound queue will be faster. • DIST throughput will increase for a similar reason as the writes to the outbound queue will be faster. • SQT and DSI/SQT cache will effectively be extended by the file system buffer cache, eliminating expensive

physical reads when the sqt_max_cache_size is exceeded by large transactions.

As a result, some future release of Replication Server will likely recommend file system devices over raw partitions.

Stable Queue Placement

One often requested feature new to the 12.1 release, was the concept of “partition affinity”. Partition affinity refers to the ability to physically control the placement of individual database stable queues on separate stable device partitions. This will help alleviate the I/O contention between:

• Between two different high volume sources or destinations. • Warm Standby inbound queue and other replication sources and targets.

Some people would quickly point out that separation between the inbound and outbound queues for the same connection is not possible with this scheme. True. However, this is not necessarily a problem. Remember, for any source system, the inbound queue is used for data modifications from the source while the outbound queue is used for data modifications from other systems destined to the connection. Consequently, unless two high volume source systems are replicating to each other, this should not pose a problem. One place where it could occur (and consequently bears some monitoring) is if a corporate roll-up system also supports a Warm-Standby.

Default Partition Behavior

Previous to 12.1, you can get similar behavior through an undocumented behavior. If all of the stable device partitions are added prior to creating the database connections, the Replication Server will round-robin placement of the database connections on the individual partitions. The difference between this and adding the database connections prior to adding the extra partitions is illustrated below.

part1 part2 part3 part1 part2 part3

Connections created prior to stable devices part2 and part3

Connections created after stable devices part2 and part3

part1 part2 part3 part1 part2 part3

Connections created prior to stable devices part2 and part3

Connections created after stable devices part2 and part3

Figure 10 – Stable Device Partition Assignment & Database Connection Creation Obviously, the situation on the right is more preferable. However, even though it may start this way, due to much higher transaction volume to/from one connection vs. another or longer save interval, one queue may end up “migrating” onto another connection’s partitions.

Final v2.0.1

62

Partition Affinity

In Replication Server 12.1, you can specifically assign stable queues to disk partitions through the following command syntax:

Alter connection to dataserver.database set disk_affinity to ‘partition_name’ [ to ‘off’]

or Alter route to replication_server

set disk_affinity to ‘partition_name’ [to ‘off’]

Any disk partition can have multiple connections queues assigned to it, however, currently, each connection can only be affinitied to a single partition. This latter restriction can be a bit of a nuisance where multiple high volume connections need more than 2040MB of queue space (particularly where the save_interval creates such situations).

Assigning disk affinity is actually more of a “hint” than a restriction. If space is available on the partition and the partition exists (i.e. not pending to be dropped or dropped), then the space will be allocated for that stable queue on that partition. If the space is not available, then space will be allocated according to the default behavior

Stable Partition Devices

Another common mistake that system administrators make, is placing the Replication Server on a workstation with only a single device (or a server but only allowing the Rep Server to address a single large disk). First, this causes a problem in that while a Rep Server can manage large numbers of stable partition devices, each one is limited to 2040MB (less than 2GB). This has nothing to do with 32-bit addressing or the Rep Server could address a full 2GB (2048MB). The real reason is that the limit in the RSSD system table rs_diskpartitions which tracks space allocations.

create table rs_diskpartitions ( name varchar(255), logical_name varchar(30), id int, num_segs int, allocated_segs int, status int, allocation_map binary(255), vstart int ) go

In the above DDL for rs_diskpartitions, the column allocation_map is highlighted. As each 1MB allocation is allocated/deallocated within the device a bit is set/cleared within this column. Those quick with math realize that 255 bytes*8 bits/byte=2040 bits – and hence the reason for the partition sizing limits. Consequently, try as one might, without volume management software, Rep Server will never be able to use all of the space in a 40GB drive. The reason is that the 7 partition limit in Unix would restrict it to ~14GB of space. Those who are familiar with vstart would be quick to claim this could be overcome simply by specifying a ‘large’ vstart and allowing 2-3 stable devices per disk partition. Well, it doesn’t quite work that way with Replication Server. For example, consider the following sample of code:

add partition part2 on ‘/dev/rdsk/c0t0d1s1’ with size=2040 starting at 2041

The above command will fail. The reason is the vstart is subtracted from the size parameter to designate how far in to the device the partition will start. Consequently, as documented in the Replication Server Reference Manual, the following command only creates a 19MB device instead of a 20MB device 1MB inside the specified partition (and the above command would have attempted a partition of –1MB!!).

add partition part2 on ‘/dev/rdsk/c0t0d1s1’ with size=20 starting at 1

Now that we understand the good, bad, and ugly of Replication Server physical storage, you will understand the reason for the next concept:

Key Concept #8: Replication Server Stable Partitions should be placed on RAID subsystems with significant amounts of NVRAM. While RAID 0+1 is preferable, RAID 5 can be used if there is sufficient cache. Logical volumes should be created in such a way that I/O contention can be controlled through queue placement.

Final v2.0.1

63

RSSD Generic Tuning

You knew this was coming. Or, at least you should have after all the discussions on the number and frequency of calls between the Replication Server and the RSSD. If you are using the embedded RSSD, you can skip this section (go to STS Tuning below) as it really only applies to ASE based RSSD’s.

Key Concept #9: Normal good database and server tuning should also be performed on the RSSD database and host ASE database server.

What does this mean? Consider the following points:

• Place the RSSD database in a separate server from production systems. This provides the best situation for maintaining flexibility should a reboot of the production database server or RSSD database server is required. However, the main reason is that it reduces or eliminates CPU contention that the RSSD primary user might have with production system long running queries (don’t let the parallel table scans hold your replication system hostage).

• Raise the priority for the RSSD primary user. • Place the tempdb in a named cache. • Place the RSSD catalog tables in a named cache separate from the exceptions log (although rs_systext

presents a problem – put it in with the system catalog tables) and also use a different cache for the queue/space management tables. This is as much to decrease spinlock contention as much as it is to ensure that repeated hits on one RS system table don’t flush another from cache.

• Dedicate a CPU to the RSSD database server. If more than one RSSD is contained in the same ASE server, monitor CPU utilization.

• Set the log I/O size (i.e. bind the log to a log cache with 4K pool) for the RSSD. There are a few triggers in the RSSD, including one on the rs_lastcommit table (fortunately not there in primary or replicate systems) that is used to ensure users don’t accidentally delete the wrong rows from the RSSD.

Depending on requirements, if multiple Replication Servers are in the environment, it might make sense to consolidate them on a single host (providing enough CPU’s exist) and share a common ASE. In such a case, the common ASE may only need 2 engines vs. the minimum of 1 for individual installations, and the ASE can be tuned specifically for RSSD operations (i.e. turn on TCP no delay).

STS Tuning

In the illustration of RS internals, the System Table Services (STS) module is illustrated as the interface between the Replication Server and the RSSD database. The STS is responsible for submitting all SQL to the RSSD – object definition lookups, segment allocations, oqid tracking, subscription materialization progress, recovery progress, configuration parameters, etc. As you can imagine, this interaction could be considerable. While it is not exactly possible to improve the speed of writes to the RSSD from the STS perspective, obviously, any improvement that caches RSSD data locally will help speed RS processing of replication definitions, subscriptions and function strings.

STS Cache Configuration

Prior to RS 12.1, only a single tuning parameter was available – sts_cache_size, while in version 12.1 another set of parameters was added to enforce a much desired behavior (sts_full_cache_XXX) as described below.

Parameter (Default) Explanation

sts_cache_size Default: 1000 Suggest: 5000

11.x+ Controls the number of rows from each table in the RSSD that can be cached in RS memory. Recommended setting is the number of rows in rs_objects plus some padding

sts_full_cache_{table_name} Default: see notes Suggest: see notes

12.1 Controls whether a specific RSSD table is fully cached. See discussion below. If a table is fully cached, the sts_cache_size limit does not apply. Note that the default is on for rs_repobjs and rs_users, but off for all other tables. Suggest enabling for rs_objects, rs_columns, and rs_functions as well as the defaults.

Final v2.0.1

64

Unfortunately, prior to RS 12.1, only the rs_repobjs (stores autocorrection status of replication definitions at replicate RS’s for routes) and rs_users tables could be fully cached. That does not infer that other RSSD table rows were not in cache, but rather that the RS only ensured that the rs_repobjs and rs_users tables were fully cached.

RS 12.1 STS Caching

As of RS 12.1, most RSSD tables could be specified to be fully cached in the STS memory pool. A complete list of tables includes:

rs_classes rs_locater rs_translations

rs_columns rs_objects rs_routes

rs_config rs_publications rs_sites

rs_databases rs_queues rs_systext

rs_datatype rs_repdbs rs_users

rs_functions rs_repobjs rs_versions

At a minimum, it is recommended that you cache rs_objects, rs_columns and rs_functions in addition to the rs_users and rs_repobjs that are cached by default. Additionally, if memory permits, you may also want to cache rs_functions, rs_publications (if using publications), rs_translations (if using the HDS feature for heterogeneous support). If your system has sufficient memory, you may even want to cache rs_systext, particularly in non-Warm Standby implementations where function strings are implemented. However, care must be taken as large function string definitions could consume a lot of RS memory. The syntax to cache an RSSD table is:

configure replication server set sts_full_cache_rs_columns to ’on’

It is notable that rs_subscriptions, rs_funcstrings, rs_rules, rs_whereclauses are excluded in the above list. The table rs_funcstrings uses it's own cache outside of the STS cache pool. Additionally, if creating subscriptions, etc., you may want to disable sts_full_cache as the cache refresh mechanism effectively rescans the RSSD after each object creation – noticeably slowing object creation. Some tables it doesn't make any sense to fully cache as they are considerably smaller that the sts_cache_size parameter. For example, specifying sts_full_cache_rs_routes is probably not effective as it likely would be fully cached anyhow (most likely with <10 rows). Ditto for rs_databases, rs_dbreps, etc. While rs_locater is small and likely would be fully cached anyhow, it also is updated frequently – which involves updates to the RSSD anyhow (similar to rs_diskpartitions).

STS RSSD Table Access

The STS module is literally just that. It is not a thread nor is it a separate daemon process. Consequently, multiple threads within the Replication Server could be simultaneously using the STS module and creating concurrent connections/queries within the RSSD itself. Unfortunately, the RS is not tied to any specific version of ASE, consequently, no assumptions were made regarding ASE features that could reduce contention between queries or enhance performance. That is not to say that a lot of contention exists within the RSSD, in fact, rather the opposite. Typically, each query will retrieve an atomic row through specifying discrete primary key values. On the other hand, tables frequently updated could have contention when multiple sources or destinations are involved as the tables modified often have rowsize far less than 1000 bytes (as anything with a rowsize minimally half of 1962 bytes would result in 1 row per page anyhow). You may wish to monitor the following tables for blocking (not deadlocks, but blocking) using Historical Server or sp_object_stats. The syntax for the latter is similar to:

sp_object_stats "00:20:00", 10, MY_RS_RSSD

If the following tables show any contention, you may wish to alter the tables to a datarows locking scheme:

rs_diskpartitions rs_locater rs_oqid

rs_queues rs_segments

You also may want to closely monitor the amount of i/o required to fulfill a STS request. Since most queries will use the primary key, only 2-3 i/o’s should be required to fulfill the request including index tree traversal, although small tables may be scanned. In addition to adding the monitoring tables, RS 12.1 also modified some indexes within the RSSD for faster access. If using a version prior to 12.1 and you notice excessive i/o’s you may want to consider adding the following indexes or other indexes as applicable.

Table Added indexes in 12.1 Deleted indexes in 12.1

rs_columns Non-unique index on (objid)

Final v2.0.1

65

Table Added indexes in 12.1 Deleted indexes in 12.1 Unique clustered index on (objid, colnum)

rs_databases Unique clustered index on (ltype, dsname, dbname) Unique index on (ltype, dbid) Non-unique index on (ltype)

rs_functions Clustered index on (funcname) Clustered index on (objid)

rs_objects Non-unique index on (dbid, objtype, phys_tablename, phys_objowner)

rs_systext Non-unique index on (parentid)

rs_translation Non-unique index on (classid)

It should be noted that the above indexes where added/deleted due to observed behavior and changes in SQL submitted by the STS. Consequently, simply modifying a 12.0 installation with the above changes may degrade performance. You should always verify index changes through proper tuning techniques before and after the modification. Keep in mind that any RSSD changes you make will be lost during an RS upgrade or re-installation. In addition, it is highly recommended that you contact Sybase Technical Support before making such changes and that you clearly think through all the impact of the changes to ensure that RS correct operation is not compromised. Adding indexes or changing the locking scheme are fairly benign operations (assuming the RS is shutdown during the modification and taking into consideration the extra i/o required to maintain the new indexes), while others – particularly any direct row modifications – could result in loss of replicated data.

On the subject of indexes, it is also advisable to run update statistics after any large RCL changes – such as adding or deleting large batches of replication definitions, subscriptions, etc. – along with including the RSSD in any normal maintenance activities such as running update statistics on a periodic basis or using optdiag to monitor tables with data only locking schemes (you should never allow rows to be forwarded in the RSSD). After making any indexing changes, you may need to issue sp_recompile against the table to ensure that stored procedures will pick up the new index – although few stored procedures are issued by the STS (most are admin procedures issued by users directly in the RSSD such as rs_helpsub).

STS Monitor Counters

In RS 12.1, the following monitor counters were added to track RSSD requests from the RS via the STS. As of 12.6, these counters have been expanded from the initial 8 to the following 9 with the addition of STSCacheExeceeded:

Counter Explanation

QueriesTotal Total physical queries sent to RSSD.

SelectsTotal Total Select statements sent to RSSD.

SelectDistincts Total Select Distinct statements sent to RSSD.

InsertsTotal Total Insert statements sent to RSSD.

UpdatesTotal Total Update statements sent to RSSD.

DeletesTotal Total Delete statements sent to RSSD.

BeginsTotal Total Begin Tran statements sent to RSSD.

CommitsTotal Total Commit Tran statements sent to RSSD.

STSCacheExceed Total number of time STS cached was exceeded.

Obviously the goal is to reduce the amount of select statements issued – updates possibly can be reduce by sqm_recover_segs, but the other write activity is necessary for recovery and can't be reduced much (except inserts due to counters). In addition to the usual error in the errorlog, you can watch the STSCacheExceed to determine if you need to bump up the sts_cache_size configuration parameter.

Note that some specific types of STS activity can be monitored with counters for other modules. For example, the SQM module includes a counter for tracking the number of updates to the rs_oqid table. Later we will discuss how to set up these counters and how to sample them, but for now, it is somewhat useful to know they exist.

Final v2.0.1

66

A key point about the STS counters, however, is that they will reflect the STS activity to record M&C activity. For example, let’s say you activate M&C for all the modules – and notice a huge number of inserts via the STS. Rather than think the RSSD is getting hammered by inserting into general RSSD tables, you need to subtract the number of counter values inserted during that time period from the InsertsTotal to derive the non-statistics related insert activity.

RSM/SMS Monitoring

Installing Replication Server Manager (RSM) is an often neglected part of the installation. Any site that is using Replication Server in a production system without using RSM or equivalent third-party tool (such as BMC patrol) has made a grave error that they will pay for within the first 3 months of operation. Why is this true?? Simply because most sites don’t test their applications today and as a consequence the transaction testing which is crucial to any distributed database implementation is missed. This virtually guarantees that a transaction, such as a nightly batch job, will fail at the replicate due to a replicate database/ASE issue. For example, the classic “ran out of locks” error from the replicate ASE during batch processing.

RSM Implementation

Having established the need for it, the next question is “How is it best implemented?” The answer, of course, is it depends. However, consider the following guidelines:

1. Configure one RSM Server on each host where a Replication Server or ASE resides. These RSM Servers will function as the monitoring “agents”.

2. Configure one RSM Server on primary SMS monitoring workstation per replication domain. This RSM will function as the RSM “domain manager”. All interaction to the RSM monitoring “agents” will be done through the SMS RSM “domain manager”

3. Configure RSM Client (Sybase Central Plug-In) or other monitoring scripts to connect to the SMS RSM “domain managers”.

4. If Backup Server, OpenSwitch or other OpenServer process is critical to operation, consider having one of the RSM “monitoring agents” on that host also monitor the process if no other monitoring capability has been implemented.

5. RSM load ratios: 1 RS = 3 ASE = 20 OpenServers. If more than one RS is on a host, consider adding multiple RSM monitoring agents every 3-5 RS’s (depending on RS load).

6. Do NOT allow changes to the replication system be implemented through RSM. The main reason for this is that it is a GUI. You will have no record of these changes and it is too easy to make mistakes. Have developers create scripts that can be thoroughly tested and run with high assurance that “fat fingers” won’t crash the system.

The last bullet is important. Similar to not keeping database scripts, if you don’t have a record of your replication system, you will after the first time you have to recreate them. Following the above, a sample environment might look like the following:

Final v2.0.1

67

PDS

MonitorServer

RSM

RSM

RDS

MonitorServer

RSM

HistoricalServer

SMS TrendsDatabase

RSSD

SMS Server

DBA

PDS

MonitorServer

RSM

RSM

RDS

MonitorServer

RSM

HistoricalServer

SMS TrendsDatabase

RSSD

SMS Server

DBA

Figure 11 – Example Replication System Monitoring

RSM vs. Performance

RSM or other SMS software monitoring can impact Replication performance in several ways:

• Unlike ASE’s shared memory access to monitor counters, the Replication Server and RSSD must be “polled” to determine system status information. If the polling cycle is set too small – or too many individual administrators are independently attempting to monitor the system, this polling could degrade RS and RSSD performance.

• Excessive use of the heartbeat feature can interfere with normal replication.

On one production system with a straight Warm Standby implementation, between the RS accesses and the RSM accesses to the RSSD, replication increased tempdb utilization by 10% (100,000 inserts out of 1,000,000) during a single day of monitoring. Because the way RSM “re-uses” many of the same users, it was impossible to differentiate between RS and RSM activity. However, it is clearly enough of a load to consider the separate RSSD server vs. using an existing ASE in high volume environments.

All of this is leading up to one point:

Key Concept #10 – Monitoring is critical – but make the heart beat, not race!

RS Monitor Counters

One of the major enhancements to Replication Server 12.1 was performance monitoring counters. Similar to the partition affinity feature, the monitors & counters (M&C) were originally slated for the 12.0 release, but did not quite make it in time. As a result, special EBF’s have been created to “backfit” RS 12.0 and 11.5 with the M&C for testing and debugging purposes only.

While in discussion it has often been compared to sp_sysmon, in reality they are closer to Historical Server or the MDA monitoring tables in ASE 12.5.0.3+. The rationale is that sp_sysmon in ASE simply reports the total of any counter during the entire monitoring period. Historical Server and Replication Server have both implemented a “sample interval” type mechanism in that counter values are flushed to disk on a periodic basis during the sample run. This allows peaks to be identified as well as actual cost of individual activities.

The statistics are implemented via a system of counters that can either be viewed through an RS session or can be flushed to the RSSD for later viewing/analysis. Currently, in RS12.6, nearly 300 counters exist, with the possibility of more being added in future releases. Obviously, with 300 counters, it is difficult to document them all in the product

Final v2.0.1

68

documentation. However, you can view descriptive information about the current counters by using the rs_helpcounter stored procedure. Since it is extremely applicable to performance and tuning, this document will discuss the counters in detail as well as provide a list of counters that apply to each of the applicable threads in later sections. This section will provide an overview of the counters as well as those counters specific to RSSD activity, etc.

RS Counters Overview

The monitoring counters implementation and their use can be divided into five basic areas:

1. Monitor counter system tables in the RSSD 2. RCL commands to enable and sample the counters 3. SQL commands to sample counters flushed to the RSSD 4. RCL commands to reset the counters 5. The dStats daemon which performs the statistics sampling

RSSD M&C System Tables.

In addition to the logic and RCL commands added to implement the counters, three additional tables were added to the RSSD to track the counter values and store counter specifics. These tables for RS 12.6 are illustrated below (along with rs_databases due to the relationship with rs_statdetail):

counter_id = counter_idrun_id = run_id

dbid = instance_id

rs_databases

dsnamedbnamedbiddist_statussrc_statusattributeserrorclassidfuncclassidprsidrowtypesorto_statusltypeptypeldbidenable_seq

varchar(30)varchar(30)inttinyinttinyinttinyintrs_idrs_idinttinyinttinyintchar(1)char(1)intint

<pk><pk>

rs_statcounters

counter_idcounter_namemodule_namedisplay_namecounter_typecounter_statusdescription

intvarchar(60)varchar(30)varchar(30)intintvarchar(255)

<pk>

rs_statdetail

run_idinstance_idinstance_valcounter_idcounter_vallabel

rs_idintintintintvarchar(255)

<pk,fk2><pk,fk3><pk><pk,fk1>

rs_statrun

run_idrun_daterun_intervalrun_userrun_status

rs_iddatetimeintvarchar(30)int

<pk>

Figure 12 – RSSD Monitor & Counter Tables

In Replication Server 15.0, the rs_statdetail table changed slightly due to a different method of recording average, max, and last counter values. A comparison of the RS 12.x rs_statdetail table and the RS 15.0 rs_statdetail table is illustrated below:

Final v2.0.1

69

Figure 13 – RS 12.x and 15.0 rs_statdetail table comparison

The main difference is that while in RS 12.6 there is a single counter_val column, RS 15.0 records the number of observations (counter_obs), the total for the counter (counter_total), the last value for the counter (counter_last) and the maximum for the counter (counter_max). As a result, where RS 12.6 had counters for the last, max and total for some counters such as DSIEResultTimeLast, DSIEResultTimeMax and DSIEResultTimeAve, RS 15.0 has a single counter DSIEResultTime. If using RS 15.0 and you want the last value for DSIEResultTime, you simply select counter_last, similarly for counter_max. The average is the only change - to get the average DSIEResultTime, you would simply derive the average by selecting counter_total/counter_obs. This difference mainly effects counters tracking rates, time, and memory utilization for the various modules.

Information about the individual counters is stored in the rs_statcounters, while counter values from each run are stored in the rs_statdetail table with the run itself stored in the rs_statrun table. The rs_statcounters table is highly structured:

Column Name Example Value Explanation

counter_id 4000 The id for the counter – counter id’s are arranged by module as detailed below.

counter_name RSI: Bytes sent Descriptive external name for the counter

module_name RSI Module that the counter applies to

display_name BytesSent Used to identify the counter through RCL

counter_type 1 The type of counter as detailed below

counter_status 140 The relative impact of the counter on RS performance as detailed below

description Total bytes delivered by an RSI sender thread.

The counter explanation

instance_id 2 The particular instance of the module or thread. For example, with a minimum of 2 connections, you will have 2 instances of DSI-S threads (or with parallel DSI, multiple instances of DSI-E).

As mentioned earlier, the counter ids are arranged by internal RS module that the counter is used for. The following table lists, the counter id ranges and modules used in the rs_statcounters table:

Counter Id Range Module

4000-4999 RSI

5000-5999 DSI

6000-6999 SQM

11000-11999 STS

13000-13999 CM

24000-24999 SQT

Final v2.0.1

70

Counter Id Range Module

30000-30999 DIST

57000-57999 DSIEXEC

58000-58999 RepAgent (EXEC)

60000-60999 Sync (SMP sync points)

61000-61999 Sync Elements (mutexes)

62000-62999 SQMR (SQM Reader)

The counter type and status designate whether the counter is a total sampling, average, etc., as well as the impact of the counter on performance and other status information. These are described in the following table:

Value Variable Explanation

Counter Types (Enumerated)

1 @CNT_TOTAL Keeps the total of values sampled

2 @CNT_LAST Keeps the last value of sampled data

3 @CNT_MAX Keeps only the largest value sampled

4 @CNT_AVERAGE Keeps the average of all values sampled

Counter Status (Bitmask)

1 @CNT_INTRUSIVE Counters that may impact Replication Server performance.

2 @CNT_INTERNAL Counters used by Replication Server and other counters.

4 @CNT_SYSMON Counters used by admin statistics, sysmon command.

8 @CNT_MUST_SAMPLE Counters that sample even if sampling is not enabled.

16 @CNT_NO_RESET Counters that are not reset after initialization.

32 @CNT_DURATION Counters that measure duration.

64 @CNT_RATE Counters that measure rate.

128 @CNT_KEEP_OLD Counters that keep both the current and previous value.

256 @CNT_CONFIGURE Counters that keep the run value of a Replication Server configuration parameter.

From this, you can determine that the sample counter listed above (RSI: Bytes Sent), in addition to being a RSI counter, keeps a running total of bytes sent (counter_type=1), and retains the both the current and previous value, is sampled even when sampling is not enabled, and is also used by the admin statistics, sysmon command (counter_status=140 ⇐ 128 & 8 & 4 = CNT_KEEP_OLD & CNT_MUST_SAMPLE & CNT_SYSMON).

When looking at rs_statrun and rs_statdetail, many of the values are encoded. For example, run_id itself is composed of two components – the monitored RS’s site id (from rs_sites) in hex form and the run sequence also in hex. For example, consider the following example run_id and the decomposition:

Figure 14 – Example rs_statrun value and composition

Final v2.0.1

71

The site id is especially needed if trying to analyze across a route and you have combined statistics from more than one RSSD to perform the analysis. If you need to do this, to isolate statistics from one RS from the other, you need to focus on the RS site_id by using a where clause similar to:

strtobin(inttohex(@prs_id))=substring(run_id,1,4)

In which @prs_id is the site id of the RS in question from rs_sites. One slight gotcha with this formula is that the strtobin() function is as of yet undocumented in ASE – but also unfortunately is the only way of performing this criteria comparison (attempts to use convert(binary(4),@prsid) failed as it appears to suffer from byte swapping issues).

Probably the two most confusing values to decode are the instance_val and instance_id values. The instance_val typically maps to the connection’s dbid or rsid (for routes). With warm standby systems, the queue related instance_val values will be reported for the logical connection due to the single queue used for the each inbound and outbound queue vs. the individual connection queues. The instance_id column values depend on the thread and more specifically the counter module. Consider the following table that illustrates the various thread and instance_id values:

Counter Module Instance_id Instance_val column value

REPAGENT ldbid for RS 12.6 dbid for RS 15.0

-1 (not applicable)

SQM ldbid for inbound dbid for outbound

0 outbound queue; 1 inbound queue

SQMR ldbid 10 outbound queue; 11 inbound queue; 21 Warm Standby DSI reader

SQT ldbid 0 outbound queue (DSI SQT); 1 inbound queue

DIST dbid -1 (not applicable)

DSI dbid 0 normal DSI; 1 Warm Standby DSI corresponds similarly to 0/1 for outbound/inbound SQM queue identifiers

DSIEXEC dbid 1 - #dsi_num_threads This number is the specific DSIEXEC thread number.

RSI rsid -1 (not applicable)

You can view descriptive information about the counters stored in the rs_statcounters table using the sp_helpcounter system procedure. To view a list of modules that have counters and the syntax of the sp_helpcounter procedure, enter:

sp_helpcounter

To view descriptive information about all counters for a specified module, enter:

sp_helpcounter module_name [, {type | short | long} ]

If you enter type, sp_helpcounter prints the display name, module name, counter type, and counter status for each of the module’s counters. If you enter short, sp_helpcounter prints the display name, module name, and counter descriptions for each counter. If you enter long, sp_helpcounter prints every column in rs_statcounters for each counter. If you do not enter a second parameter, sp_helpcounter prints the display name, the module name, and the external name of each counter. To list all counters that match a keyword, enter:

rs_helpcounter keyword [, {type | short | long} ]

To list counters with a specified status, the syntax is: rs_helpcounter { ’intrusive’ | ’sysmon’ | ’rate’ | ’duration’ | ’internal’ |

’must_sample’ | ’no_reset’ | ’keep_old’ | ’configure’ }

Note the difference between the two procedures – sp_helpcounter is used to list the counters for a module (or all modules), while rs_helpcounter is used to find a counter by keyword in the name or by a particular status.

Final v2.0.1

72

Enabling M&C Sampling (RS 12.6)

The very first thing that must be done prior to enabling M&C is to increase the size of the RSSD – hopefully you did this when you installed the RS or you will need to now. As far as the rest of this section, much of the information was pulled straight from the RS 12.1 release bulletin and is simply repeated here for continuity (one of the benefits of working for the company is that plagiarism is allowed). Generically, enabling the monitors and counters for sampling is accomplished through a series of steps outlined below:

1. Enable sampling of non-intrusive counters 2. Enable sampling of intrusive counters 3. Enable flushing of counters to the RSSD (if desired) 4. Enable resetting of counters after flush to the RSSD 5. Set the period between flushes to the RSSD (in seconds). 6. Configuring the flush interval for specific modules or connections

Each of these steps will be discussed in more detail in the following paragraphs.

Enabling sampling of non-intrusive counters

You enable or disable all sampling at the Replication Server level using the configure replication server command with the stats_sampling option. The default is “on.” The syntax is:

configure replication server set ’stats_sampling’ to { ’on’ | ’off’ }

If sampling is disabled, the counters do not record data and no metrics can be flushed to the RSSD.

Enabling sampling of intrusive counters

Most counters sample data with minimal effect on Replication Server performance. Counters that may affect performance—intrusive counters—are enabled separately so that you can enable or disable them without affecting the settings for non-intrusive counters. You can enable or disable intrusive counters using the admin stats_intrusive_counter command. The default is “off.” The syntax is.

admin stats_intrusive_counter, { ’on’ | ’off’ }

It is highly recommended that you enable intrusive counters. Initially, it was assumed that these counters would impact performance as they primarily tracked execution times of various processing steps. It turned out in reality that these counters had much much less impact than anticipated - and in RS 15.0, the notion of intrusive counters was eliminated. Additionally, these were some of the more useful counters - especially in determining the holdup of the DSI/DSIEXEC processing.

Enabling flushing

Use the configure replication server command with the stats_flush_rssd option to enable or disable flushing. The default is “off.” The syntax is:

configure replication server set ’stats_flush_rssd’ to { ’on’ | ’off’ }

You must enable flushing before you can configure individual modules, connections, and routes to flush. This step is optional in a sense that you can view the statistics without flushing them, however, the most beneficial use of the monitors will only be achieved via flushing them to the RSSD for later analysis and baselining configuration settings.

Enabling reset after flushing

Use the configure replication server command with the stats_reset_afterflush option to specify that counters are to be reset after flushing . The default is “on.” The syntax is:

configure replication server set ’stats_reset_afterflush’ to { ’on’ | ’off’ }

Certain counters, such as rate counters with CNT_NO_RESET status, are never reset.

Setting seconds between flushes

You set the number of seconds between flushes at the Replication Server level using the configure replication server command with the stats_daemon_sleep_time option. The default is 600 seconds. The syntax is:

configure replication server set ’stats_daemon_sleep_time’ to sleeptime

The minimum value for sleeptime is 10 seconds; the maximum value is 3153600 seconds (365 days). For general monitoring, the default may be fine. However, for narrower performance tuning related issues, this may have to be decreased to 60-120 seconds (1-2 minutes) to ensure accurate latency and volume related statistics.

Final v2.0.1

73

Configuring modules, connections, and routes

A hierarchy of configuration options limit the flushing of counters to the RSSD. The command admin stats_config_module lets you configure flushing for a particular module or for all modules. For multithreading modules, you can choose to flush metrics from a matrix of available counters. For example, you can configure flushing for a module, for a particular connection, or for all connections. Configuration parameters that configure counters for flushing are not persistent; they do not retain their values when Replication Server shuts down. Consequently, it would be a good idea to place frequently used configurations used for counter flushing in a script file. Before you can configure a counter for flushing, make sure that you first enable the sampling and flushing of counters. Note Replication Server 12.x does not flush counters that have a value of zero.

You can set flushing on for all counters of individual modules or all modules using the command admin stats_config_module. The default is “off.” The syntax is:

admin stats_config_module, { module_name | ’all_modules’ }, {’on’ | ’off’ }

where module_name is dist, dsi, rsi, sqm, sqt, sts, repagent, cm, or sts.

This command is most useful for single or non-threaded modules, which have only one thread instance. For multithreaded modules, you have greater control over which threads are set on if you use the admin stats_config_connection and admin stats_config_route commands. Note If a module’s flushing status is set on, counters for all new threads for that module will be set on also.

The number of threads for a multithreaded module depends on the number of connections and routes Replication Server manages. You can configure flushing for individual threads or groups of threads.

Connections - Use the admin stats_config_connection command to enable flushing for threads related to connections. The syntax is: admin stats_config_connection, { data_server, database | all_connections },

{ module_name | all_modules }, [ 'inbound' | 'outbound' ], { ’on’ | ’off’ }

where: • data_server is the name of a data server. • database is the name of a database. • all_connections specifies all database connections. Hint, this will produce a lot of output. • module_name is dist, dsi, repagent, sqm, or sqt.

• all_modules specifies the DIST, DSI, REPAGENT, SQM, and SQT modules. Hint, this too will produce a lot of output.

• inbound | outbound identifies SQM or SQT for an inbound or outbound queue. Routes - You can use the admin stats_config_route command to save statistics gathered on routes for the

SQM or RSI modules. The syntax is: admin stats_config_route,{ rep_server | all_routes },

{ module_name | all_modules }, {’on’|’off’}

where rep_server is the name of the remote Replication Server, all_routes specifies all routes from the current Replication Server, and module_name is sqm or rsi.

Note: If you configure flushing for a thread, Replication Server also turns on flushing for the module. This does not turn on flushing for existing threads of that module, but all new threads will have flushing turned on.

Example RS 12.x Script

The typical performance analysis session might use the following series of commands to fully enable the counters:

admin statistics, reset go configure replication server set 'stats_sampling' to 'on' go admin stats_intrusive_counter, 'on' go configure replication server set 'stats_flush_rssd' to 'on' go configure replication server set 'stats_reset_afterflush' to 'on' go configure replication server set 'stats_daemon_sleep_time' to '60' go

Final v2.0.1

74

admin stats_config_module, 'all_modules', 'on' go

The first line ensures that the first sample is reset vs. holding over the cumulative counts which can distort the very first sample in the run. Along with this, you should reset the counters after each flush – this helps prevent counter rollover during sampling – particularly for the byte-related counters.

Enabling M&C Sampling (RS 15.0)

In Replication Server 15.0, there was a conscious effort to simplify the commands needed to implement counter sampling. As a result, a sample script to enable monitoring for RS 15.0 would look like the following:

admin statistics, reset go -- collect stats for “all” modules -- save them to the RSSD -- collect for 3 hours at 15 sec interval admin statistics, "all", save, 720, 10800 go admin statistics, status go

One word of warning - the admin statistics, reset command will truncate rs_statrun and rs_statdetail - so be sure to preserve the rows if you wish to keep them. The other difference is that currently there is no explicit start/stop command. Instead the admin statistics command uses the syntax (note this is an abbreviated syntax - see manual for all options/parameters):

admin statistics, <module>, <save>, <num observations>, <sample period>

The first two parameters are fairly self explanatory. However the last two take a bit of getting used to. <num observations> is the number of observations to make. What makes this tricky is the last parameter - the sample period. The first issue is that it is measured in seconds. As noted in the sample script above, 3 hours translates into 10,800 seconds. Using this duration, and given a number of observations, you can derive the sample interval. For example, 10,800 seconds with 720 observations yields a 15 second sample interval. In RS 15.0, it should be noted that smallest sample interval supported is 15 seconds.

The biggest problem with this syntax is that typically you know the interval you want (15 seconds or 1 minute) but either don’t know how long you wish to collect data for - or you know it terms of hours and minutes. Consequently, you often find out that before you execute the command, you are deriving the parameter values with formulas such as:

Sample_period = time in hrs * 60 * 60 Num_observations = sample_period / sample_interval (in seconds)

Because of the usability issues with this, a subsequent release may support an enhancement to change the syntax to accepting the sample interval directly and accepting the sample period using a notation such as 3h or 120m for entering as a number of hours/minutes.

One other difference between RS 12.x and RS 15.0 is that in RS 12.6, counters with a value of zero were not flushed to the RSSD. In RS 15.0, counters with a value of zero will be flushed if the number of observations are greater than zero for that counter.

Viewing current counter values

RS monitor counter values can be either viewed interactively via RCL submitted to the replication server via isql or other program or by directly querying the RSSD rs_statrun and rs_statdetail tables if the statistics were flushed to the RSSD. Each of these methods will be discussed.

Viewing counter values via RCL

Replication Server provides a set of admin statistics commands that you can use to display current metrics from enabled counters directly to a client application (instead of saving to the RSSD). Replication Server can display information about these modules: DSI, DSIEXEC, SQT, dCM, DIST, RSI, SQM, REPAGENT, MEM, MD, and MEM_IN_USE. To display information, use the admin statistics command as specified below:

• To view the current values for one or all the counters for a specified module, enter: admin statistics, module_name [, display_name]

where module-name is the name of the module and display_name is the display name of the counter. To determine the display name of a counter, use sp_helpcounter.

• To view current values for all enabled counters, enter:

Final v2.0.1

75

admin statistics, ’all_modules’

• To view a summary of values for the DSI, DSIEXEC, REPAGENT, and RSI modules, enter: admin statistics, sysmon [, sample_period]

where sample_period is the number of seconds for the run. Admin statistics, sysmon [, sample_period] zeros the counters, samples for the specified sample period, and prints the results. If sample_period is 0 (zero) or not present, admin statistics, sysmon [, sample_period] prints the current values of the counters.

• To display counter flush status, enter: admin statistics, flush_status

Viewing values flushed to the RSSD

You can view information flushed to the rs_statdetail and rs_statrun tables using select and other Transact-SQL commands. If, for example, you want to display flushed information from the dCM module counters, you might enter:

select counter_name, module_name, instance_id, counter_val, run_date from rs_statcounters c, rs_statdetail d, rs_statrun r where c.counter_id = d.counter_id and d.run_id = r.run_id and c.module_name = ’CM’ order by counter_name, instance_id, run_date

In this instance, the counters have been configured to save to the RSSD either by the configure replication server set 'stats_flush_rssd' to 'on' command for RS 12.6 or the admin statistics, "all", save, <num observations>, <sample period> command for RS 15.0.

While you can view the counter data directly in the RSSD, it may not be the best option. The biggest reason is that the counters in the RSSD only represent that Replication Server’s values - if a route is involved, you can not do a full analysis. Another good reason is that extensive querying of the RSSD will put a load on the server that may impact the ability for it to respond as quickly to the normal RSSD processing of RS requests. Additionally, historical trend data could take up considerable space within the RSSD. Consequently, the best option is to use an external repository to collect the RSSD statistics and necessary information to perform analysis

Resetting counters

Counters are reset when a thread starts. In addition, some counters are reset automatically at the beginning of a sampling period. You can reset counters by:

• Configuring Replication Server to ensure that counters are reset after sampling data is flushed to the RSSD. Use the configure replication server set ’stats_reset_afterflush’ to ’on’ command.

• Issuing the admin statistics, reset command to reset all counters.

You can reset all counters except counters with CNT_NO_RESET status, such as rate counters, which are never reset. Counters that can be reset, reset to zero.

dSTATS daemon thread

The dSTATS daemon thread supports Replication Server’s new counters feature by:

• Managing the interface for flushing counters to the RSSD. • Calculating derived values when the daemon thread wakes up.

dSTATS manages the interface when Replication Server has been configured to flush statistics to the RSSD using the configure replication server command and the stats_flush_rssd parameter. You can configure a sleep time for dSTATS using the configure replication server command and the stats_daemon_sleep_time parameter. When the daemon wakes up, it attempts to calculate derived statistics such as the number of DSI-thread transactions per second or the number of RepAgent bytes delivered per second.

Impact on Replication

Obviously, intrusive counters impact RS performance within the RS binary codeline itself. However, the impact is not as great as the name would imply – probably less than 15%. The difference is that the normal counter execution code is executed regardless, while enabling these counters require executing special routines including system clock function calls. One word of caution: the counters can impact RS performance indirectly. For example, by setting the flush

Final v2.0.1

76

interval to 1 second and collecting a wide-range of counters, you will notice that the RSSD has a sharp increase of 100+ inserts per second as measured by STS InsertsTotal. On a healthy ASE, this may not be that much of a problem, but on most RSSD's, this could slow down queue/oqid updates, etc. Additionally, that many inserts/second could fill the transaction log much quicker which could result in a log suspend (definitely impeding RS performance) – but do not turn on 'truncate log on checkpoint' – or the next words you will hear from Tech Support will be "I hope you had a backup – and you know how to rebuild queues".

RS M&C Analysis Repository

The second thing you will want to do (after increasing the size of the RSSD above), is to create a repository database to upload the counter data to after collection. As mentioned earlier, the reasons for this is due to:

• Enable you to perform analysis of replication performance when using routes • Avoid consuming excessive space in the RSSD retaining historical data • Prevent cpu load of analysis queries from impacting RSSD performance for RS activities • Prevent loss of statistics in RS 15.0 when the reset command truncates the tables

Creating the Repository

It is recommended that you create the repository in ASE due to a function that is not available currently in ASA or Sybase IQ. This function is the undocumented strtobin() and bintostr() functions that are used primarily systems involving routes. The repository should contain the following tables:

rs_statrun - contains a list of the statistics sample periods. rs_statdetail - contains the counter values. rs_statcounters - contains the counters descriptions/names rs_sites - contains the list of all the Replication Servers rs_databases - contains the list of all the connections rs_config - contains all the configuration values

In addition to these tables, other tables such as rs_routes, rs_objects, rs_subscriptions, etc. could be included in the extraction if doing the analysis “blind” (without knowing the topology if routing is involved or if the proper subscriptions have been created).

The structure and indexes for these tables can be obtained from the rs_install_systables_ase script in the RS directory. In addition, you may wish to create a copy of your rs_ticket table in the repository as well if you use the rs_ticket feature. About the repository in general, there are a couple of notes:

• If using a mixed RS 12.x and RS 15.0 environment, use separate databases in the same server due to differences in rs_statcounters (counter values and names) and rs_statdetail (counter columns).

• The reason for rs_sites and rs_databases is that the counter instance_id’s use the RS id instead of the connection name. By having these tables, analysis is much easier as the connection names can be used instead of continually looking up the corresponding dbid or prsid.

• While rs_config can change due to configuration changes, having a current copy from the RS allows you to quickly look at configuration values during analysis without having to log back in to the RS in question.

Additionally, you can add indexes to facilitate query performance. A sample repository is available in Sybase’s CodeXchange online along with stored procedures that can help with the analysis.

Populating the Repository

The easiest way to populate the repository is to use bcp to extract all of the above tables from the RSSD. This can be done even with an embedded RSSD that uses ASA as a bcp out is nothing more than a select of all columns/rows from the desired table. You will need to bcp out all the data from the RS’s involved in the replication if routes are involved. A sample bcp script to do this might resemble: set RSSD=CHINOOK_RS_RSSD set DSQUERY=CHINOOK mkdir .\%RSSD% bcp %RSSD%..rs_statrun out .\%RSSD%\rs_statrun.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%..rs_config out .\%RSSD%\rs_config.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%..rs_statcounters out .\%RSSD%\rs_statcounters.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%..rs_databases out .\%RSSD%\rs_databases.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%..rs_sites out .\%RSSD%\rs_sites.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192

Final v2.0.1

77

bcp %RSSD%..rs_statdetail out .\%RSSD%\rs_statdetail.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192

Once you have extracted all the counter data, loading it can be a bit tricky. The counter data tables (rs_statdetail and rs_statrun) should have no issues - even when multiple RS’s are involved. However, even if starting from truncated tables, rs_sites and rs_databases may have duplicates when loading data from multiple RS’s simply due to the fact that when routes are created, these tables are replicated between the RS’s. If you use bi-directional routes, then each server will have a full complement and you only will need to load one copy. If you use uni-directional routes, you may need to bcp in each one, but use the -m and -b switches to effectively ignore the errors. For instance, consider the following bcp command for rs_databases: bcp rep_analysis..rs_databases in .\%RSSD%\rs_databases.bcp -Usa -P -S %DSQUERY% -b1 -m200 -c -t"|"

The difference is that now bcp will commit every row and ignore up to 200 errors before bcp aborts. This is important - without the -b1 setting, an error in a batch will cause the batch to rollback. By constraining the batch size to 1, only each individual duplicate row is rolled back. Additionally, by setting -m to an arbitrarily high value, bcp will not abort all processing due to the number of duplicates.

Similarly, rs_config is a bit strange. RS specific tuning parameters and default values have an objid value of 0x0000000000000000 while connection specific parameters will have the connection id in hex in the first four bytes - such as 0x0000007200000000 (0x72h is 114d - so this configuration value corresponds to dbid 114). The important thing to remember is that dbid’s are unique within a replication domain - across all the RS’s within that domain. So the sequence for bcp-ing in rs_config values is to do the following:

1. bcp in the rs_config table from the RS of interest - likely the RRS for a connection, although the PRS may be of interest if analyzing a route

2. bcp in the other rs_config tables using the trick above, but this time use -b1 and -m1000. The reason for the -m1000 is due to the number of default configuration values and server settings.

After populating the tables, remember to run update statistics for all the tables - this can be done using a script such as:

update index statistics rs_statdetail using 1000 values update index statistics rs_statrun using 1000 values update index statistics rs_config using 1000 values update index statistics rs_statcounters using 1000 values update index statistics rs_databases using 1000 values update index statistics rs_sites using 1000 values

Using a higher step count and using update index statistics vs. update statistics is important considering that a few hours of statistics gathered every minute could mean nearly 1 million rows in the rs_statdetail table.

Analyzing the Counters

It should be noted that both Bradmark and Quest have GUI products that provide an off-the-shelf solution for monitoring RS performance. However, if you don’t have these utilities, you will need to get very familiar with the counter schema, the individual counters and their relationships. These are described in more detail in the appropriate section later in this document - for example, the RepAgent User counters are described in the section on the RepAgent User processing. If not using one of the vendor tools to facilitate your analysis, a collection of stored procedures along with the repository schema has been uploaded to CodeXchange.

RS_Ticket

In Replication Server 12.6, a new feature called rs_ticket was added – unfortunately too late for the documentation (it is documented in the RS 15.0 documentation). Perhaps it will become one of the most useful timing mechanisms – replacing heartbeats, etc. – as it records a timestamp for every thread that touches the rs_ticket record – from the execution time in the primary database, the RepAgent processing time, the various threads with Replication Server – and finally the destination database.

Pre- RS_Ticket

Prior to RS_Ticket, the only timing mechanisms were the RSM heartbeat mechanism or the use of a manually created “ping”/”latency” table. An example of the latter is illustrated below:

-- latency check table definition create table rep_ping ( source_server varchar(32) default "SLEDDOG" not null, source_db varchar(32) default db_name() not null, test_string varchar(255) default "hello world" not null, source_datetime datetime default getdate() not null, dest_datetime datetime default getdate() not null

Final v2.0.1

78

) go create unique clustered index rep_ping_idx on rep_ping (source_server,source_db,source_datetime) go --latency check table repdef create replication definition rep_ping_rd with primary at SLEDDOG_WS.smp_db with all tables named dbo.rep_ping ( source_server varchar(32), source_db varchar(32), test_string varchar(255), source_datetime datetime ) primary key (source_server,source_db,source_datetime) searchable columns (source_server,source_db,source_datetime) send standby replication definition columns replicate minimal columns go

By creating this set and subscribing to it for normal replication, any insert into the primary server would be propagated to the replicate(s). Since the destination datetime column was excluded from the definition, this would cause the replicate field to be populated using the default value – which would be the current time at execution. This could be significantly more accurate than using the rs_lastcommit table’s values which may reflect a long running transaction.

While useful, such implementations were sorely lacking as it really didn’t help identify where the latency was occurring. Hence RS engineering decided to add a more explicit timing mechanism that would help identify exactly where the latency is.

Rs_ticket setup

Rs_ticket is implemented as a stored procedure at the primary as well as a corresponding procedure (and usually a table) at the replicate. The full setup procedure is as follows:

1. Verify that rs_ticket is in the primary database – if not, extract from RS 12.6 rsinspri (rs_install_primary) script in $SYBASE/$SYBASE_REP/scripts. It should not be marked for replication as it uses the rs_marker routine that is marked for replication.

2. Customize the rs_ticket_report procedure at the replicate database(s). A sample is below. You will also need to develop a parsing stored procedure (also below).

3. Enable rs_ticket at the replicates by 'alter connection to srv.rdb1 set “dsi_rs_ticket_report” to “on”'

A sample rs_ticket_report procedure is as follows: if exists (select 1 from sysobjects where id = object_id('rs_ticket_history') and type = 'U') drop table rs_ticket_history go /*==============================================================*/ /* Table: rs_ticket_history */ /*==============================================================*/ create table rs_ticket_history ( ticket_num numeric(10,0) identity, ticket_date datetime not null, ticket_payload varchar(1024) null, constraint PK_RS_TICKET_HISTORY primary key (ticket_num) ) lock datarows go /*==============================================================*/ /* Index: ticket_date_idx */ /*==============================================================*/ create index ticket_date_idx on rs_ticket_history (ticket_date ASC) go if exists (select 1 from sysobjects

Final v2.0.1

79

where id = object_id('rs_ticket_report') and type = 'P') drop procedure rs_ticket_report go create procedure rs_ticket_report @rs_ticket_param varchar(255) as begin /* ** Name: rs_ticket_report ** Append PDB timestamp to rs_ticket_param. ** DSI calls rs_ticket_report if DSI_RS_TICKET_REPORT in on. ** ** Parameter ** rs_ticket_param: rs_ticket parameter in canonical form. ** ** rs_ticket_param Canonical Form ** rs_ticket_param ::= <section> | <rs_ticket_param>;<section> ** section ::= <tagxxx>=<value> ** tag ::= V | H | PDB | EXEC | B | DIST | DSI | RDB | ... ** Version value ::= integer ** Header value ::= string of varchar(10) ** DB value ::= database name ** Byte value ::= integer ** Time value ::= hh:mm:ss.ddd ** ** Note: ** 1. Don't mark rs_ticket_report for replication. ** 2. DSI calls rs_ticket_report if DSI_RS_TICKET_REPORT in on. ** 3. This is an example stored procedure that demonstrates how to ** add RDB timestamp to rs_ticket_param. ** 4. One should customize this function for parsing and inserting ** timestamp to tables. */ set nocount on declare @n_param varchar(2000), @c_time datetime -- @n_param = "@rs_ticket_param;RDB(name)=hh:mm:ss.ddd" select @c_time = getdate() select @n_param = @rs_ticket_param + ";RDB(" + db_name() + ")=" + convert(varchar(8), @c_time, 8) + "." + right("00" + convert(varchar(3),datepart(ms,@c_time)) ,3) -- for rollovers, add date and see if greater than getdate() -- print @n_param insert into rs_ticket_history (ticket_date, ticket_payload) values (@c_time,@n_param) end go if exists (select 1 from sysobjects where id = object_id('parse_rs_tickets') and type = 'P') drop procedure parse_rs_tickets go if exists (select 1 from sysobjects where id = object_id('sp_time_diff') and type = 'P') drop procedure sp_time_diff go create proc sp_time_diff @time_begin time, @time_end time, @time_diff time output as begin declare @time_char varchar(20), @begin_dt datetime, @end_dt datetime -- first get the hours...we need to check first for a rollover situation -- to do this, we are going to cheat and add a date since time datatype -- is a physical clock time vs. a duration (in otherwords 35 hours can not -- be stored in a time datatype) if (datepart(hh,@time_begin)>datepart(hh,@time_end)) begin

Final v2.0.1

80

select @begin_dt=convert(datetime,"Jan 1 1900 " + convert(varchar(20),@time_begin,108)) select @end_dt=convert(datetime,"Jan 2 1900 " + convert(varchar(20),@time_end,108)) end else begin select @begin_dt=convert(datetime,"Jan 1 1900 " + convert(varchar(20),@time_begin,108)) select @end_dt=convert(datetime,"Jan 1 1900 " + convert(varchar(20),@time_end,108)) end select @time_char=right("00"+convert(varchar(2),abs(datediff(hh,@begin_dt,@end_dt))),2)+":" select @time_char=@time_char + right("00"+convert(varchar(2), abs(datediff(mi,@begin_dt,@end_dt))%60),2)+":" select @time_char=@time_char + right("00"+convert(varchar(2), abs(datediff(ss,@begin_dt,@end_dt))%60),2) select @time_diff=convert(time,@time_char) return 0 end go create proc parse_rs_tickets @last_two_only bit=1 as begin declare @pos int, @ticket_num numeric(10,0), @ticket_date datetime, @rs_ticket varchar(4096), @head_1 varchar(10), @head_2 varchar(10), @head_3 varchar(10), @head_4 varchar(50), @pdb varchar(30), @pdb_ts time, @exec_spid int, @exec_ts time, @exec_bytes int, @dist_spid int, @dist_ts time, @dsi_spid int, @dsi_ts time, @rdb varchar(30), @rdb_ts time, @last_row numeric(10,0), @next_last numeric(10,0), @ra_latency time, @rs_latency time, @tot_latency time create table #tickets ( ticket_num numeric(10,0) not null, head_1 varchar(10) not null, head_2 varchar(10) null, head_3 varchar(10) null, head_4 varchar(50) null, pdb varchar(30) null, pdb_ts time null, exec_spid int null, exec_ts time null, exec_bytes int null, exec_delay time null, dist_spid int null, dist_ts time null, dsi_spid int null, dsi_ts time null, rs_delay time null, rdb varchar(30) null, rdb_ts time null, tot_delay time null ) select @last_row=isnull(max(ticket_num),0) from rs_ticket_history select @next_last=isnull(max(ticket_num),-1) from rs_ticket_history where ticket_num < @last_row

Final v2.0.1

81

declare rs_tkt_cursor cursor for select ticket_num, ticket_date, ticket_payload from rs_ticket_history where ((@last_two_only = 0) or ((@last_two_only=1) and ((ticket_num=@last_row) or (ticket_num=@next_last))) ) for read only open rs_tkt_cursor fetch rs_tkt_cursor into @ticket_num, @ticket_date, @rs_ticket while (@@sqlstatus=0) begin -- parse the first heading and then strip preceeding characters select @rs_ticket=substring(@rs_ticket,charindex("H1",@rs_ticket)+3,4096) select @pos=charindex(";",@rs_ticket) select @head_1=substring(@rs_ticket,1,@pos-1), @rs_ticket=substring(@rs_ticket,@pos+1,4096) -- parse out Heading 2 if it exists, else use null select @head_2=null, @pos=charindex("H2",@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket,@pos+3,4096) select @pos=charindex(";",@rs_ticket) select @head_2=substring(@rs_ticket,1,@pos-1), @rs_ticket=substring(@rs_ticket,@pos+1,4096) end -- parse out Heading 3 if it exists, else use null select @head_3=null, @pos=charindex("H3",@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket,@pos+3,4096) select @pos=charindex(";",@rs_ticket) select @head_3=substring(@rs_ticket,1,@pos-1), @rs_ticket=substring(@rs_ticket,@pos+1,4096) end -- parse out Heading 4 if it exists, else use null select @head_4=null, @pos=charindex("H4",@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket,@pos+3,4096) select @pos=charindex(";",@rs_ticket) select @head_4=substring(@rs_ticket,1,@pos-1), @rs_ticket=substring(@rs_ticket,@pos+1,4096) end -- parse the PDB select @rs_ticket=substring(@rs_ticket,charindex("PDB",@rs_ticket)+4,4096) select @pdb=convert(varchar(30),substring(@rs_ticket,1,charindex(')',@rs_ticket)-1)), @pdb_ts=convert(time,substring(@rs_ticket,charindex('=',@rs_ticket)+1,12)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) -- parse the EXEC select @rs_ticket=substring(@rs_ticket,charindex("EXEC",@rs_ticket)+5,4096) select @exec_spid=convert(int,substring(@rs_ticket,1,charindex(')',@rs_ticket)-1)), @exec_ts=convert(time,substring(@rs_ticket,charindex('=',@rs_ticket)+1,10)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) -- parse the EXEC bytes select @rs_ticket=substring(@rs_ticket,charindex("B(",@rs_ticket)+7,4096) select @exec_bytes=convert(int,substring(@rs_ticket,1,charindex(';',@rs_ticket)-1)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) -- parse out DIST if it exists, else use null select @dist_spid=null, @dist_ts=null, @pos=charindex("DIST",@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket,@pos+5,4096) select @dist_spid=convert(int,substring(@rs_ticket,1,charindex(')', @rs_ticket)-1)), @dist_ts=convert(time,substring(@rs_ticket, charindex('=',@rs_ticket)+1,10)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) end

Final v2.0.1

82

-- parse the DSI select @rs_ticket=substring(@rs_ticket,charindex("DSI",@rs_ticket)+4,4096) select @dsi_spid=convert(int,substring(@rs_ticket,1,charindex(')',@rs_ticket)-1)), @dsi_ts=convert(time,substring(@rs_ticket,charindex('=',@rs_ticket)+1,10)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) -- parse the RDB select @rs_ticket=substring(@rs_ticket,charindex("RDB",@rs_ticket)+4,4096) select @rdb=convert(varchar(30),substring(@rs_ticket,1,charindex(')',@rs_ticket)-1)), @rdb_ts=convert(time,substring(@rs_ticket,charindex('=',@rs_ticket)+1,12)) -- calculate horizontal latency exec sp_time_diff @pdb_ts, @exec_ts, @ra_latency output exec sp_time_diff @exec_ts, @dsi_ts, @rs_latency output exec sp_time_diff @pdb_ts, @rdb_ts, @tot_latency output insert into #tickets (ticket_num,head_1,head_2,head_3,head_4,pdb, pdb_ts,exec_spid,exec_ts,exec_bytes,exec_delay, dist_spid,dist_ts,dsi_spid,dsi_ts,rs_delay, rdb,rdb_ts,tot_delay) values (@ticket_num,@head_1,@head_2,@head_3,@head_4,@pdb, @pdb_ts,@exec_spid,@exec_ts,@exec_bytes,@ra_latency, @dist_spid,@dist_ts,@dsi_spid,@dsi_ts,@rs_latency, @rdb,@rdb_ts,@tot_latency) -- parse the DIST if present fetch rs_tkt_cursor into @ticket_num, @ticket_date, @rs_ticket end close rs_tkt_cursor deallocate cursor rs_tkt_cursor select ticket_num, head_1, head_2, head_3, head_4, pdb_time=convert(varchar(15),pdb_ts,9), exec_time=convert(varchar(15),exec_ts,9), exec_delay=convert(varchar(15),exec_delay,8), exec_bytes, dist_time=convert(varchar(15),dist_ts,9), dsi_time=convert(varchar(15),dsi_ts,9), rs_delay=convert(varchar(15),rs_delay,8), rdb,rdb_time=convert(varchar(15),rdb_ts,9), tot_delay=convert(varchar(15),tot_delay,8) from #tickets order by ticket_num drop table #tickets return 0 end go

Executing rs_ticket

Executing the rs_ticket proc is easy – it takes four optional parameters that become the headers for the ticket records:

create procedure rs_ticket @head1 varchar(10) = "ticket", @head2 varchar(10) = null, @head3 varchar(10) = null, @head4 varchar(50) = null as begin …

The full "ticket" when built and inserted into the replicate database may look like the following:

** rs_ticket parameter Canonical Form ** rs_ticket_param ::= <section> | <rs_ticket_param>;<section> ** section ::= <tagxxx>=<value> ** tag ::= V | H | PDB | EXEC | B | DIST | DSI | RDB | ... ** Version value ::= integer ** Header value ::= string of varchar(10) ** DB value ::= database name ** Byte value ::= integer

Final v2.0.1

83

** Time value ::= hh:mm:ss.ddd V=1;H1=start;PDB(pdb1)=21:25:28.310;EXEC(41)=21:25:28.327;B(41)=324; DIST(24)=21:25:29.211;DSI(39)=21:25:29.486;RDB(rdb1)=21:25:30.846

The description is as follows:

Tag Description (parenthesis) Value

V Rs_ticket version n/a 1 (current version of format)

H1 Header #1 n/a First header value

H2 Header #2 n/a Second header value

H3 Header #3 n/a Third header value

H4 Header #4 n/a Fourth header value

PDB Primary Database DB name Timestamp of PDB rs_ticket execution

EXEC RepAgent User Thread EXEC RS spid Timestamp processed by EXEC

B Bytes EXEC RS spid Bytes process by EXEC

DIST Distributor Thread DIST RS spid Timestamp processed by DIST

DSI DSI Thread DSI RS spid Timestamp processed by DSI-S

RDB Replicate Database RDB name Timestamp of insert at RDB

The “Header” values are optional values supplied by the user to help distinguish which rows bracket the timing interval. A sample execution might look like:

exec rs_ticket “start” (run replication benchmarks, DML, whatever) exec rs_ticket “stop”

rs_ticket tips

There are a couple of pointers about rs_ticket that should be discussed:

• Synchronize the clocks in the ASE & RS hosts!!!! The PDS, RS & RDS hosts should be within 1 sec of each other. This may have to be repeated often - while some systems automatically sync the clocks during boot, due the uptime or due to high clock drift, they can be off by seconds by the end of the day.

• DIST will not send rs_ticket to DSI unless there is at least one subscription from replicate site

• Do not use apostrophe/single or double quotation marks within the headers. For example, trying to use a header such as “Bob’s Test” will fail whereas “Bobs Test” is fine.

• Considering that the parsing routines look for semi-colons, you should avoid using semi-colons within the headers to avoid parsing problems.

• The DSI timestamp is the time that the DSI read the rs_ticket – which could be a few seconds before execution if there is a large DSI SQT cache.

• If using parallel DSI’s, the RDB timestamp is the time of the parallel DSI execution – which may be in advance of other statements that will need to be committed ahead of it. This means that the RDB time may be a few seconds off.

• If using routes, DSI time includes RSI & RRS DIST. Currently, only the PRS DIST timestamps the ticket. The reason for this is that within the RRS DIST thread, only the MD module is executed. Rs_ticket processing occurs prior to then in the DIST processing sequence.

rs_ticket Trace Flags

The rs_ticket can be printed into the Replication Server error log when tracing is enabled. Tracing can be enabled in the three modules that update the rs_ticket: EXEC (Rep Agent User), DIST (Distributor), and DSI (Data Server Interface). The syntax for the trace command is:

trace [ “on” | “off” ], [“EXEC” | “DIST” | “DSI”], print_rs_ticket -- examples:

Final v2.0.1

84

trace “on”, “EXEC”, print_rs_ticket trace “on”, “DIST”, print_rs_ticket trace “on”, “DSI”, print_rs_ticket

Note, that what is printed in to the errorlog is the contents of the ticket at that point - for example, the EXEC trace will only include the PDB and EXEC timestamp information. This technique can be extremely useful when running benchmarks or trying to see when a table is quiesced - simply invoke RS ticket and wait for the DSI trace record to appear in the errorlog.

Analyzing RS_Ticket

When comparing RS tickets, there are three types of calculations that can be performed: horizontal, vertical and diagonal. Each of these are described in the following sections.

Horizontal

Horizontal calculations refer to the difference in time between two threads in the same rs_ticket row. This is termed “pipeline delay” as it shows the latency between threads within the pipeline. For example, consider the following rs_ticket output (from two executions):

-- beginning V=1;H1=start;PDB(pdb1)=21:25:28.310;EXEC(41)=21:25:28.327;B(41)=324;DIST(24)=21:25:29.211;DSI(39)=21:25:29.486;RDB(rdb1)=21:25:30.846 -- end V=1;H1=stop;PDB(pdb1)=21:25:39.406;EXEC(41)=21:32:03.200;B(41)=20534;DIST(24)=21:33:43.323;DSI(39)=21:34:08.466;RDB(rdb1)=21:34:20.103

Note the two highlighted timestamps for each row. If we subtract the two in the “beginning” row, we notice that the time between when the command was executed and when the RS received it from the RepAgent was nearly immediate in the top example. In the bottom example, however, there is a difference of ~6.5 minutes – thus showing that by the end of the sample period, the RepAgent was running approximately 6.5 minutes behind transaction execution. This could be due to either a bulk operation (i.e. a single update that impacted 100,000 rows) that actually resulted in the RepAgent being behind temporarily, a slow inbound queue write speed or just poor configuration. Reviewing RS monitor counter data will help to determine the actual cause.

Overall end-to-end latency can be observed by comparing the PDB & RDB (blue highlighted value)values in the “end” row – which shows 9 minute latency overall. With 6.5 minutes of latency within the RepAgent processing, attempting to tune the RS components will not achieve a significant improvement.

Vertical

Vertical calculations show the time it takes for a single thread to process all of the activity between the two timestamps. This is termed “module time” as it shows how long a particular module was active. Note, this is a latency figure and does not imply that the module was completely consuming all cpu during that time – the delay may be been caused by a pipeline delay. Using the same output as above, consider the various threads.


By comparing the PDB timestamps between the two, we notice that the total test time was approximately 11 seconds of execution time at the primary. Now then, it gets a bit tricky. If we further look at the EXEC vertical calculation, we will see a delay of ~6.5 minutes as we noted earlier from the horizontal calculation. Taking one step further, we can notice that the DIST vertical calculation is ~8 minutes. If we subtract the two, we notice that the DIST thread adds about 1.5 minutes of processing to the overall problem. This may be an indication of one of three possibilities (in order of likelihood):

1. The commands between the two RS tickets included a large transaction – which likely could delay the DIST receiving the commands as the SQT has to wait to see the commit record before even starting to pass the commands to the DIST (likelihood: 60%)

2. The outbound queue SQM is overburdened for the associated device speed, thus slowing the delivery rate of the DIST to the outbound queue (likelihood: 35%)

Final v2.0.1

85

3. Due to insufficient STS cache, the DIST had to resort to fetching repdef & subscription metadata from the RSSD (likelihood 5%)

By analyzing RS monitor counters, we can determine which of these are applicable.

Diagonal

In the last example, we came close to performing a diagonal calculation. A diagonal calculation is termed “cross module time” and refers to the latency that can be the result of waiting access to the thread (messages cached in thread queues).


For example, in the above, the DIST starts sending data to the DSI ~8.5 minutes prior to the DSI receiving the last of the rows from the DIST. In this case, this is important. As we noted earlier, the RepAgent latency was about 6.5 minutes while the DIST processing added 1.5 minutes for a total of 8 minutes. This means that the DIST saving the data to the outbound queue and the DSI reading the commands from the outbound queue only added about 30 seconds to the overall processing. As you can see, the most useful aspect of diagonal calculations will be in determining the impact of the modules which we don’t have timestamps for – namely the SQM module(s).

Final v2.0.1

87

Inbound Processing

What comes in… Earlier we took a look at the internal Replication Server threads in a drawing similar to the following:

Figure 15 – Replication Server Internals: Inbound and Outbound Processing

In the above copy of the diagram, note that the threads have been divided into inbound and outbound processing along the dashed line from the upper-left to lower-right. An important distinction – and one than many do not understand – is that the inbound threads used for a replication from a source to a primary belong to a different connection than the outbound group of threads. Consequently, as multiple destinations are added, the same set of inbound threads are used to deliver the data to all of the various sets of outbound threads for each connection.

In the sections below, we will be addressing the three main threads in the inbound processing within Replication Server. In previous versions of this document, the RepAgent User thread was not discussed, however, with RS 12.1, some additional tuning parameters were added specifically for it, consequently it is now included.

RepAgent User (Executor)

The RepAgent User thread has been named various things during the Replication Server’s lifetime. It originally started as the Executor thread, followed by the LTM User thread, and lastly, the RepAgent User thread. The reason for this is that there actually are two different types of Executor threads – LTM-User for Replication Agents and RSI-User for Replication Server connections. Replication Server will determine which type of thread each Executor is simply by the “connect source” command that is sent. However, many of the trace flags and configuration commands are specified at the “Executor” thread generically and affect both threads. Such commands will often refer to this RS thread module as EXEC. For this module, we will simply be discussing the LTM-User or RepAgent User type of Executor thread.

RepAgent User Thread Processing

The executor thread’s processing is extremely simple. It simply receives LTL from the Replication Agent, parses and normalizes the LTL and then packs it into binary format and then passes it to the SQM to be written to disk. The full explanation of these steps can be viewed as follows:

1. Parse LTL received from Rep Agent 2. Normalize LTL – this involves comparing columns and datatypes in LTL to those in replication

definition. An extremely important part and fairly cpu intensive, normalization includes: a. Columns in the LTL stream need to be matched with those in the repdef, and those excluded

from the repdef need to be excluded from the queue. b. Column mapping needs to be performed for any renamed columns.

Final v2.0.1

88

c. Multiple repdefs – if more than one repdef exists for the object, the EXEC thread needs to put multiple rows in the inbound queue.

d. Primary key columns need to be located as they are stored separately in the row to speed SQL generation at the DSI.

e. Minimal column comparisons need to be performed and unchanged, non-key columns eliminated from the stream.

f. If autocorrection is enabled for the particular repdef, updates need to be translated into separate delete followed by inserts.

g. Duplicate detection (OQID comparison) needs to be done to ensure that duplicate records are not written to the queue.

3. Packs commands in binary form and places on the SQM’s queue. If more than one replication definition is available, one command for each will be written to the queue. If the SQM’s pending writes are greater than exec_sqm_write_request_limit (RS 12.1+), the Rep Agent User thread is put to sleep.

4. Periodically, update rs_oqids & rs_locater table in the RSSD with the latest OQID to ensure recovery.

This is illustrated in the following diagram:

Figure 16 – Rep Agent User Thread Processing

A key feature added in RS 12.1 was that writers to the SQM could cache pending writes in the respective writer’s cache - either the RepAgent User thread or the Distributor’s MD module. By default, this was set to a single block or 16K with a maximum of 60 blocks or 983040 bytes (2GB now in RS 12.6 ESD 7 and RS 15.0). For Rep Agent User threads, this cache limit is controlled by the exec_sqm_write_request limit. Once this limit has been reached, further attempts to insert write requests on the SQM Write Request queue will be rejected and Rep Agent User Thread put to sleep.

The parsing and normalization process can be fairly cpu intensive and is essentially synchronous in processing transactions from the Replication Agent all the way to the SQM. Accordingly, you can control this by adjusting the parameter exec_cmds_per_timeslice (RS 12.1+) which controls how often the Rep Agent User thread will yield the cpu. While lowering it may have some impact, raising it frequently has little impact. The reason for this behavior is that the RepAgent User Thread often has very little work to do – as will be illustrated in the section on the monitor counters later.

While it is true that Open Server messages are used to prevent it from being completely synchronous, the simple fact is that each transfer from the Replication Agent must be written to disk, the small buffer size (exec_sqm_write_request_limit by default is 16K) in the Rep Agent User thread essentially required a flush to disk. Consequently, at the end of each transfer, the Replication Agent waits an acknowledgement not only that the LTL was received, but also (in effect) that it was written to disk as the RepAgent User thread does not acknowledge to the Replication Agent that the transfer was complete until then. This may seem duplicative given the scan_batch_size and

Final v2.0.1

89

secondary truncation point movement, but in a sense, it is not quite. The secondary truncation point and OQID synchronization take more work as the RSSD update is involved and a specific log page correlation is made. Given that LTL could exceed the log page or due to text/image replication, ensuring that the LTL is written to disk for each transfer means a faster recovery

RepAgent User Tuning

Unlike the SQM, SQT threads, in RS 12.0 and prior, there were no specific commands to analyze the performance of the executor thread, nor tuning configurations. With RS 12.1, several tuning configuration parameters were added:

Parameter RS Explanation

exec_cmds_per_timeslice (Default: 5; Min: 1; Max: 2147483648; Recommendation: 20)

12.1 Specifies the number of LTL commands an LTI or RepAgent Executor thread can process before it must yield the CPU to other threads. You can set exec_cmds_per_timeslice for all Replication Server Executor threads using configure replication server or for a particular connection using configure connection.

exec_sqm_write_request_limit (Default/Min: 16384 (1 SQM block); Max: 983040 (60 SQM blocks); Recommendation: 983040) Note in 12.6 ESD #7 and 15.0 ESD #1, the max has been increased to 2GB. Recommendation for these versions is 2-4MB

12.1 Controls the amount of memory available to an LTI or RepAgent Executor thread for messages waiting in the inbound queue before the SQM writes them out. If the amount of memory allocated by the LTI or RepAgent Executor thread exceeds the configured pool value, the thread sleeps until the SQM writes some of its messages, and frees memory in the pool. You can set exec_sqm_write_request_limit for the Replication Server using configure replication server. The larger the value you assign to exec_sqm_write_request_limit, the more work the Executor thread can perform before it must sleep until memory is released.

Setting exec_sqm_write_request_limit is easy – set it to the maximum that memory will allow, ensuring that the setting is an even number of SQM blocks (i.e. a multiple of 16384) to ensure that memory is effectively utilized. The only downside to increasing the exec_sqm_write_request_limit is that if the RepAgent connection fails and the RepAgent tries to reconnect, it will not be able to until the full cache of write requests have been saved to the inbound queue. Given that the average production system table is likely 1KB per row or more as formatted by the RS, in all likelihood, a full 983,040 bytes of exec_sqm_write_request_limit is likely less than 1,000 replicated commands - which should take less than a second to save to the inbound queue.

On the other hand, the exec_cmds_per_timeslice is a bit more difficult. As mentioned earlier, the parsing and normalization process can be CPU intensive. As a result, since it may always have work to do in a high volume situation, it may be robbing CPU time from the DIST or DSI threads. Consequently, if it should appear that data is backing up in the inbound queue and all applicable SQT tuning (below) has been performed, or if the DSI connections show a lot of “awaiting command” at the replicate (taking into account the dsi_serialization_method as discussed in the section on Parallel DSI), you may want to lower this number. On the other hand, if the Replication Agent is getting behind (a much more normal problem), you may want to raise exec_cmds_per_timeslice.

However, there are a few implementation considerations that can also improve performance. Consider the following:

• Create repdefs in the same column order as the table definition (speeds normalization). • Don’t use multiple repdefs for high volume tables unless absolutely necessary (doubles I/O) • Do not leave autocorrection on any longer than necessary (doubles I/O for insert and update statements)

RepAgent User Thread Counters

In RS 12.1, several counters specifically for the RepAgent User thread were added. In RS 12.6, 8 additional counters were added and some of the original counters were renamed for clarity.

RepAgent User Thread Monitor Counters

The full list of RS 12.6 counters are:

Display Name Explanation

CmdsTotal Total commands received by a Rep Agent thread.

Final v2.0.1

90


CmdsApplied Total applied commands written into an inbound queue by a Rep Agent thread. Applied Commands are applied as the maintenance user.

CmdsRequest Total request commands written into an inbound queue by a Rep Agent thread. Request Commands are applied as the executing request user.

CmdsSystem Total Repserver system commands written into an inbound queue by a Rep Agent thread.

CmdsMiniAbort Total 'mini-abort' commands (in ASE, SAVEXACT records processed by a Rep Agent thread). Mini-abort instructs Repserver to rollback commands to a specific OQIQ value.

CmdsDumpLoadDB Total 'dump database log' (in ASE, SYNCDPDB records and 'load database log' (in ASE, SYNCLDDB records processed by a Rep Agent thread.

CmdsPurgeOpen Total CHECKPOINT records processed by a Rep Agent thread. CHECKPOINT instructs Repserver to purge to a specific OQIQ value.

CmdsRouteRCL Total create, drop, and alter route requests written into an inbound queue by a Rep Agent thread. Route requests are issued by RS user.

CmdsEnRepMarker Total enable replication markers written into an inbound queue by a Rep Agent thread. Enable marker is sent by executing the rs_marker stored procedure at the active DB.

UpdsRslocater Total updates to RSSD..rs_locater where type = 'e' executed by a Rep Agent thread.

PacketsReceived Total number of protocol packets rcvd by a Rep Agent thread when in passthru mode. When not in passthru mode, RepServer receives chunks of lang data at a time. For packet size, see counter 'PacketSize'. Lang 'chunk' size is fixed at 255 bytes.

BytesReceived Total bytes received by a Rep Agent thread. This size includes the TDS header size when in 'passthru' mode.

PacketSize In-coming connection packet size. RepAgent/ASE 12.0 or earlier versions used a hard coded 2K packet size. Later releases will allow you to change the packet size.

BuffersReceived Total number of command buffers received by a RepAgent thread. Buffers are broken into packets when in 'passthru' mode, or language 'chunks' when not in 'passthru' mode. See counter 'PacketsReceived' for these numbers.

EmptyPackets Total number of empty packets received in 'passthru' mode by a Rep Agent thread. These are 'forced' EOM's. See counter 'PacketsReceived' for these numbers.

RAYields Total number of times a RepAgent Executor thread yielded it's time on the processor while handling LTL commands.

RAYieldTimeAve (intrusive)

The average amount of time the RepAgent spent yielding the processor while handling LTL commands each time the processor was yielded.

RAWriteWaits Total number of times a RepAgent Executor thread had to wait for the SQM Writer to drain the outstanding write requests below the threshold.

RAWriteWaitsTimeAve (intrusive)

The average amount of time the RepAgent spent waiting for the SQM Writer thread to drain the number of outstanding write requests to get the number of outstanding bytes to be written under the threshold.

CmdsSQLDDL Total Repserver SQLDDL commands written into an inbound queue by a Rep Agent thread.

Final v2.0.1

91


RSTicket Total rs_ticket markers processed by a Rep Agent's executor thread.

For a typical source database, the highlighted counters are the ones to watch.

Replication Server 15.0 had a few differences and added a few counters:


CmdsRecv Commands received by a Rep Agent thread.

CmdsApplied Applied commands written into an inbound queue by a Rep Agent thread. Applied Commands are applied as the maintenance user.

CmdsRequest Request commands written into an inbound queue by a Rep Agent thread. Request Commands are applied as the executing request user.

CmdsSystem Repserver system commands written into an inbound queue by a Rep Agent thread.

CmdsMiniAbort 'mini-abort' commands (in ASE, SAVEXACT records) processed by a Rep Agent thread. Mini-abort instructs Repserver to rollback commands to a specific OQIQ value.

CmdsDumpLoadDB 'dump database log' (in ASE, SYNCDPDB records) and 'load database log' (in ASE, SYNCLDDB records) processed by a Rep Agent thread.

CmdsPurgeOpen CHECKPOINT records processed by a Rep Agent thread. CHECKPOINT instructs Repserver to purge to a specific OQIQ value.

CmdsRouteRCL Create, drop, and alter route requests written into an inbound queue by a Rep Agent thread. Route requests are issued by RS user.

CmdsEnRepMarker Enable replication markers written into an inbound queue by a Rep Agent thread. Enable marker is sent by executing the rs_marker stored procedure at the active DB.

UpdsRslocater Updates to RSSD..rs_locater where type = 'e' executed by a Rep Agent thread.

PacketsReceived Number of protocol packets rcvd by a Rep Agent thread when in passthru mode. When not in passthru mode, RepServer receives chunks of lang data at a time. For packet size, see counter 'PacketSize'. Lang 'chunk' size is fixed at 255 bytes.

BytesReceived Bytes received by a Rep Agent thread. This size includes the TDS header size when in 'passthru' mode.

PacketSize In-coming connection packet size. RepAgent/ASE 12.0 or earlier versions used a hard coded 2K packet size. Later releases will allow you to change the packet size.

BuffersReceived Number of command buffers received by a RepAgent thread. Buffers are broken into packets when in 'passthru' mode, or language 'chunks' when not in 'passthru' mode. See counter 'PacketsReceived' for these numbers.

EmptyPackets Number of empty packets received in 'passthru' mode by a Rep Agent thread. These are 'forced' EOM's. See counter 'PacketsReceived' for these numbers.

RAYieldTime The amount of time the RepAgent spent yielding the processor while handling LTL commands each time the processor was yielded.

RAWriteWaitsTime The amount of time the RepAgent spent waiting for the SQM Writer thread to drain the number of outstanding write requests to get the number of outstanding bytes to be written under the threshold.

CmdsSQLDDL RepServer SQLDDL commands written into an inbound queue by a Rep Agent thread.

Final v2.0.1

92


RSTicket rs_ticket markers processed by a Rep Agent's executor thread.

RepAgentRecvPcktTime The amount of time, in 100ths of a second, spent receiving network packets.

Note that the “Total”, “Avg” and other aggregate suffixes (and counters) have been removed as these are available from the counter_total, counter_max, counter_last and counter_avg=counter_total/counter_obs columns in the rs_statdetail table for RS 15.0. There is one new counter added - the last one in the list: RepAgentRecvPcktTime. This can be interesting to use to determine how busy the RepAgent is on network processing time vs. waiting on writes, etc. Note also that counters RAYields and RAWriteWait appear to have been removed - which may be surprising considering the relative importance of them. However, both counters can be obtained as the number of observations for RAYieldTime and RAWriteWaitTime (counter_obs).

Obviously, the goal would be to increase the number of commands processed during a given period – assuming the commands are equal and transaction rate the same. The RA thread has a number of counters that are of special interest to us and can help us try to improve this rate. Consider the following list (note that most are derived by combining more than one counter):

CmdsPerSec = CmdsTotal/seconds CmdsPerPacket = CmdsTotal/PacketsReceived CmdsPerBuffer = CmdsTotal/BuffersReceived (Mirror Rep Agent & Heterogeneous Rep Agents) PacketsPerBuffer = PacketsReceived/BuffersReceived (Mirror Rep Agent & Hetero Rep Agents). UpdsRslocaterPerMin = UpdsRslocater/minutes ScanBatchSize = CmdsTotal/UpdsRslocater RAYieldsPerSec = RAYields/seconds RA_ECTS = CmdsTotal/RAYields RAWriteWaits

The first one (CmdsPerSec) should be fairly obvious – we are getting a normalized rate that we can use to track the throughput into RS. CmdsPerPacket is an interesting statistic. One would suspect this to be fairly high, but most often with the default 2K packet size and fairly large table sizes (when column names are included), most production sites find themselves only processing 2-3 commands per packet – and since this includes begin/commit commands, really identifies the first bottleneck. Increasing the Rep Agent packet size by changing the ASE rep agent ‘send buffer size’ configuration parameter helps this out tremendously. Note that heterogeneous replication agents and the Mirror Replication Agent (MRA) all use the concept of an LTL buffer that is different in size than the packet size. For example, the MRA has a default ltl_batch_size of 40,000 bytes and a default rs_packet_size of 2048. For ASE Rep Agent Thread, since the packet and buffer size are the same, you would expect the PacketsPerBuffer to be the same (and they are) - for a ratio of 1. For the MRA and heterogeneous replication agents, you may look at these two counters and determine if tuning them is appropriate. Minimally, raising the MRA rs_packet_size to 8192 or 16384 is suggested. Note that as of MRA 12.6, the MRA appears to be a bit “chatty” - using tens of packets per buffer - which artificially lowers the CmdsPerPacket ratio to considerably less than 1.

UpdsRslocaterPerMin and ScanBatchSize work together to identify when the Rep Agent scan batch size configuration should be adjusted. Yes, this does relate to recovery speed of ASE – but think about it. Is the difference of 1 minute really a big problem?? If not, then increasing the scan batch size to drive UpdsRslocaterPerMin towards 1 (likely impossible to get there) is the goal. However, on really busy systems, you will find out that even if you set scan batch size to 20,000, you will still see 10 or more updates per minute – which means recovery is only affected by a few seconds. However, setting scan_batch_size to really high values can be detrimental on low volume systems. If during peak processing, you don’t see any updates to the rs_locater within 2-3 minutes, you likely have scan_batch_size set too high.

RAYields is the number of times the RS RA User thread yielded the cpu to another module – and is very interesting. First, the number of yields per second gives a good indication of how much or how little cpu time the RA User thread is getting. Secondly, when compared with the number of commands received (via RA_ECTS), we can see how the configuration parameter exec_cmds_per_timeslice (aka ECTS) is helping or hurting us. A good goal to have is to get 8-10 commands per packet – but what good is that goal if the default exec_cmds_per_timeslice is still at 5 – which means that part way through processing the packet, RA thread yields the cpu??

However, the one that is most interesting is RAWriteWaits – it signals how often the RA thread had to wait when writing to the inbound queue. This is a factor of how much cache is available (exec_sqm_write_request_limit) as well as the values for init_sqm_write_delay/init_sqm_write_max_delay.

Final v2.0.1

93

RepAgent User Thread Counter Usage

Perhaps the best way to use the counters is to look at them in terms of progression of the data from the source DB to the next thread (SQM). Consider the following sequence for RS 12.6:

1. RepAgent User Thread receives a batch of LTL from the RS. Each LTL batch is a single LTL buffer that is sent using one or more packets to the RS. This causes the “network” counters BuffersReceived, PacketsReceived, BytesReceived, EmptyPackets to be incremented

2. The RepAgent User Thread then parses the commands out of the buffer and the commands are evaluated for type (i.e. is it a DML command the RepAgent has to pass to SQM or is a locater request). This updates the various “Cmd” counters such as CmdsTotal, CmdsApplied, CmdsRequest, CmdsSystem, CmdsMiniAbort, CmdsDumpLoadDB, CmdsPurgeOpen, CmdsRouteRCL, CmdsEnRepMarker, CmdsSQLDDL to be incremented accordingly.

3. Depending on the command, what happens next: a. In normal operations, it is likely that the command was a DML, DDL or system statement

(miniAbort, Dump/Load, PurgeOpen, Route RCL, Enable Replication marker (rs_marker)). If so, a write request is issued to the SQM (assuming num_messages or exec_sqm_write_request_limit hasn’t been reached) and processing continues.

b. If the command was a request for a new locator, the RepAgent determines which record was the last written to disk and updates the RSSD locater appropriately. This also increments the UpdsRslocater counter.

c. The command could be one of several different commands that the RepAgent User Thread needs to pass to other threads. For example, if a checkpoint record was received, in addition to the incrementing of the CmdsPurgeOpen, the RA User Thread coordinates with the inbound SQM to purge all the open transactions to that point (this happens during ASE database recovery). Similar behaviors for MiniAborts, Dump/Loads, etc.

d. If the command was an Enable Replication Marker (rs_marker), then the Rep Agent coordinates setting the replication definition to the marker state (i.e. valid).

e. If the command was an rs_ticket (a form of rs_marker), the RepAgent User Thread appends it’s timestamp info along with byte counts and process id unto the rs_ticket record and sends it through to the SQM. This also updates the RSTicket counter.

4. Periodically, of course, the RepAgent User Thread will need to yield the CPU. This can happen for several reasons, but in each case, if intrusive counters are enabled, the counters RAYields and RAYieldTimeAve are incremented. The types of yields include:

a. The number of cmds processed has exceeded the exec_cmds_per_timeslice. b. As mentioned in 3(a), the exec_sqm_write_request_limit has been reached – at which

point the SQM won’t accept anymore write requests, the counters RAWriteWaits and RAWriteWaitsTimeAve are incremented.

c. RS scheduler driven yield – which is why setting exec_cmds_per_timeslice high may be of no effect as the RS may still slice out the RA User Thread to provide time for the other threads to run.

From this point processing is handed off to the SQM. Let’s take a look at some sample data. Note: in each section, the first set of data will be from real customer data and the second set will be from a wide row (30+ columns) insert speed test. For the first consideration, let’s look at the efficiency of the network processing between the RepAgent and the RepAgent User Thread for the customer data set:

Final v2.0.1

94

Sam

ple

Tim

e

Pack

ets R

ecei

ved

Cm

dsT

otal

Cm

ds/P

ckt

(der

ived

)

Cm

ds/S

ec (d

eriv

ed)

Upd

s Rsl

ocat

er

Scan

_bat

ch_s

ize

(der

ived

)

Upd

sRsl

ocat

or/M

in

(der

ived

)

0:29:33 79,356 267,882 3.3 889 268 999.5 53

0:34:34 93,852 364,632 3.8 1,207 365 998.9 72

0:39:37 71,669 253,283 3.5 841 254 997.1 50

0:44:38 63,173 266,288 4.2 881 266 1,001.0 52

0:49:40 63,086 253,531 4.0 839 253 1,002.0 50

0:54:43 56,570 164,249 2.9 545 164 1,001.5 32

0:59:45 108,667 375,512 3.4 1,243 375 1,001.3 74

1:04:47 101,507 450,749 4.4 1,492 451 999.4 89

1:09:50 92,022 326,619 3.5 1,085 327 998.8 65

1:14:52 81,852 325,148 3.9 1,076 326 997.3 64

1:19:54 78,507 317,559 4.0 1,055 317 1,001.7 63

As you can see from the derived columns in red above, sometimes the most useful information from the monitor counters is when you compare two of them. Let’s explore some of these:

Cmds/Pckt – derived from dividing CmdsTotal by PacketsReceived. In this case we are seeing that we are hitting about 3 commands per packet. You have to admit, processing 3 commands per packet does not represent a lot of work nor very efficient. This system would likely benefit from raising the RepAgent configuration ltl_buffer_size, which controls the packet size sent to Replication Server.

Cmds/Sec – derived from dividing CmdsTotal by the number of seconds between samples (rs_statrun). Note that this is an average – in other words, during the ~5 minute intervals, there may have been higher spikes and lulls in activity. However, it does show that the Replication Agent is feeding roughly 1,000 commands per second to the Replication Server. To sustain this without latency, we will need to ensure that each part of Replication Server can also sustain this rate.

Scan_batch_size – derived by dividing CmdsTotal by UpdsRslocater to get a representative number of commands sent to RS before the Replication Agent asks for the new truncation point. While this is an average, it does provide insight into the probable setting for the Replication Agent scan_batch_size – which in this case is likely set to 1,000. To see the effect of this, consider the next metric

UpdsRslocater/Min – derived by dividing UpdsRslocater by the number of minutes between samples. This metric represents the SQL activity RS inflicts on the RSSD just to keep up with the truncation point. As you can see, it is updating the RSSD practically once per second. Again, this corresponds to the Replication Agent scan_batch_size configuration parameter. Some DBA’s are reluctant to raise this for fear of the extra log space that may impact recovery times, etc. But if you think about it, in its current state, I am moving the secondary truncation point every second – a bit of overkill. Increasing this to 10,000 would reduce the RSSD overhead considerably while reducing the secondary truncation point to every 10 seconds or so – certainly not a huge impact on the transaction log.

Now, let’s look at a test system in which a small desktop system was stressed by doing a high rate of inserts on wide rows (32 columns). Ideally, we would like to compare to the same system after Replication Agent configuration values have been changed, however, this was not possible to obtain from the customer. So while not a true apples-apples comparison, it will be useful to compare the counter behavior. The Replication Agent configuration differences are: ltl_buffer_size=8192; scan_batch_size=20,000. Using the same metrics from above, we see:

Final v2.0.1

95

Sam

ple

Tim

e

Pack

ets R

ecei

ved

Cm

dsT

otal

Cm

ds/P

ckt

(der

ived

)

Cm

ds/S

ec

(der

ived

)

Upd

s Rsl

ocat

er

Scan

_bat

ch_s

ize

(der

ived

)

Upd

sRsl

ocat

or/

Min

(der

ived

)

11:37:57 149 1,027 6.8 93 0 0 0

11:38:08 1,096 7,781 7 778 0 0 0

11:38:19 637 4,512 7 410 0 0 0

11:38:30 2,865 20,322 7 2,032 1 20,322 6

11:38:41 78 553 7 50 1 553 5

To see how these differences impact the system, let’s take a look at the CPU and write wait metrics from the RepAgent User Thread perspective – again looking at the customer system first:

Sam

ple

Tim

e

Pack

ets

Rec

eive

d

Cm

dsT

otal

RA

Yie

lds

RA

EC

TS

(der

ived

)

Wri

teR

eque

ssts

(S

QM

)

RA

Wri

te W

aits

Wri

teW

ait%

(d

eriv

ed)

0:29:33 79,356 267,882 42,984 6 268,187 32,040 11.9

0:34:34 93,852 364,632 58,811 6 364,705 35,479 9.7

0:39:37 71,669 253,283 36,820 6 253,283 20,243 8.0

0:44:38 63,173 266,288 39,084 6 266,334 14,859 5.6

0:49:40 63,086 253,531 39,804 6 253,684 20,673 8.1

0:54:43 56,570 164,249 25,347 6 164,566 22,528 13.7

0:59:45 108,667 375,512 59,447 6 376,184 38,279 10.2

1:04:47 101,507 450,749 72,149 6 450,809 32,790 7.3

1:09:50 92,022 326,619 45,778 7 326,750 28,127 8.6

1:14:52 81,852 325,148 47,273 6 325,340 22,201 6.8

1:19:54 78,507 317,559 39,971 7 317,674 14,817 4.7

Note that some of the columns are repeated for clarity - again we have some derived statistics.

RA ECTS – derived from dividing CmdsTotal by RAYields. This compares to the exec_cmds_per_timeslice configuration parameter, which has a default of 5. Note that in this case, using the default exec_cmds_per_timeslice, we are getting about 6 commands processed before the RA User thread slices. It may be that the exec_cmds_per_timeslice may be affecting the system since we are so close to the default or it may be just the thread scheduling.

WriteWait% - derived by dividing the SQM counter WriteRequests by the RAWriteWaits. This is partially due to the fact we have a default exec_sqm_write_request_limit of 16384 (1 block). Some of these waits are undoubtedly influencing the RA User Thread time slices

Now, let’s look at the insert stress test. For this system, exec_cmds_per_timeslice is set to 20, exec_sqm_write_request_limit is set to 983040 (the max) – other than the Rep Agent configurations mentioned earlier, no other tuning was done to the Rep Agent User configurations

Final v2.0.1

96

Sam

ple

Tim

e

Pack

ets

Rec

eive

d

Cm

dsT

otal

RA

Yie

lds

RA

EC

TS

(der

ived

)

Wri

teR

eque

ssts

(SQ

M)

RA

Wri

te

Wai

ts

Wri

teW

ait%

(d

eriv

ed)

11:37:57 149 1,027 34 30 1,027 0 0.00

11:38:08 1,096 7,781 264 29 7,788 0 0.00

11:38:19 637 4,512 156 28 4,512 0 0.00

11:38:30 2,865 20,322 748 27 20,336 0 0.00

11:38:41 78 553 22 25 553 0 0.00

As you can see, the SQM WriteRequests are much lower, so that may be why there are no RAWriteWaits – however, maxing the sqm_write_request_limit may have helped as well. The interesting thing is that the average RA ECTS (derived by dividing the CmdsTotal by RAYields again) shows considerably higher than the configuration value suggesting that the raising the exec_cmds_per_timeslice may be a limit when less than the default, but when cpu time is available, the Rep Agent User can exceed the default cap. This suggests from the customer viewpoint above, raising the exec_cmds_per_timeslice – while a suggestion – may not help. However, some customers have reported benefits when exec_cmds_per_timeslice is set as high as 100 – unknown if these were non-SMP systems, which could influence the behavior. Either the write waits or other cpu demands are causing the RA User thread to timeslice.

RepAgent User/EXEC Traces

There are a number of trace flags that can be used to diagnose RepAgent and or inbound SQM related performance issues.

Module Trace Flag Diag Description

EXEC EXEC_CONNECTIONS Traces LTM/Rep Agent connections

EXEC EXEC_TRACE_COMMANDS Traces LTL commands received by EXEC

EXEC EXEC_IGNORE_PAK_LTL RS behaves as data sink

EXEC EXEC_IGNORE_NRM_LTL Ignores Normalization in the LTL

EXEC EXEC_IGNORE_PRS_LTL Ignores Parsing of LTL commands

Note that each of the above requires use of the diag binary for Replication Server. As a result, it should only be used in a debugging environment as the extra diagnostic code will have an impact on performance and log output (which can slow down the system). Some of the more useful traces are described below. For best understanding, refer back to the earlier illustration (pg 76) at the modules the EXEC thread performs.

EXEC_CONNECTIONS

If the RepAgent is having problems connecting to the RS, this trace can be useful to determine if the correct password is being used, etc. The output in the errorlog is the RepAgent user login followed by the password – which can be compared to the RSSD values. Care should be taken as the password will be output into the errorlog in clear text – you will probably want to change the errorlog location for any diagnostic binary boot just due to the volume of output. If so, you will want to delete it if you use this trace to avoid having passwords exposed.

EXEC_IGNORE_PAK_LTL

(WARNING: Results in data loss). At first glance, this seems misnamed, however, realizing that the step immediately prior to the RepAgent user thread passing the LTL to the SQM is packing it into packed binary format. Consequently, by enabling this traceflag, the LTL output will not be written to the inbound queue – however, the RepAgent user thread will still parse and normalize the LTL stream. This can be useful for eliminating SQM performance issues when debugging RepAgent performance problems (especially when the waits on CT-Lib are high).

Final v2.0.1

97

EXEC_IGNORE_NRM_LTL

(WARNING: Results in data loss). This trace flag disable the normalization step with-in the RepAgent user thread. If you are positive that the replication definitions precisely match the table’s ordinal column definition, disabling this can be done without exec_ignore_pak_ltl. However, it is most useful in continuing to “step backward” to isolate RepAgent performance problems. By first disabling writes to the queue via exec_ignore_pak_ltl and then disabling normalization, you have eliminated the SQM and any normalization overhead (such as checking replication definitions from RSSD) from the RepAgent LTL transmit sequence.

EXEC_IGNORE_PRS_LTL

(WARNING: Results in data loss). This traceflag disables parsing the LTL commands received by the RepAgent user thread. When used with exec_ignore_pak_ltl and exec_ignore_nrm_ltl, the RepAgent user effectively is throwing the data away without even looking at it. Any RepAgent performance issues that are network oriented that remain at this point are likely caused by network contention within ASE, the host machine(s), or the OCS protocol stack within the RS binary.

SQM Processing

The Stable Queue Manager (SQM) is the only module that interacts with the stable queue. As a result, it performs all logical I/O to the stable queue and as one would suspect is then one of the focus points for performance discussions. However, SQM code is present in both the SQM and SQT on the inbound side of the connection, and in the SQM and DSI for the outbound (and Warm Standby) side of a connection. It is best to get a better understanding of the SQM module to better see that in itself, the SQM thread may not be contributing to slow downs in inbound queue processing.

The SQM is responsible for the following:

Queue I/O - All reads, writes, deletes and queue dumps from the stable queue. Reads are typically done by a SQM Reader (SQT or DSI) using SQM module code - while the SQM is responsible for all write activity.

Duplicate Detection - Compares OQID’s from LTL to determine if LTL log row is a duplicate of one already received.

Features of the SQM thread include support for:

Multiple Writers - While not as apparent in inbound processing, if the SQM is handling outbound processing, multiple sources could be replicating to the same destination (i.e. a corporate rollup).

Multiple Readers - More a function of inbound processing, a SQM can support multiple threads reading from the inbound queue. This includes user connections, Warm Standby DSI threads along with normal data distribution.

For the purpose of this discussion, we will be focusing strictly on the SQM thread which does the writing to the queue. The SQM write processing logic is similar to the following:

1. Waits for a message to be placed on the write queue 2. Flushes the current block to disk if

a. Message on queue is a flush request b. Message on queue is a timer pop AND there is a queue reader present c. Message on queue is a timer pop AND the current wait time exceeds

“init_sqm_write_max_delay” d. The current block is full

3. Adds message to current block

The flushing logic (where the physical I/O actually occurs) is performed in the following steps:

1. Attempts platform-specific async write 2. If retry indicated, yields then tries again 3. Once the write request is successfully posted, places write result control block on AIO Result daemon

message queue and sleeps 4. Expects to be awakened by AIO Result daemon when that thread processes this one’s async write result 5. Awakens any SQM Read client threads waiting for a block to be written

It is important to note the distinction – the SQM actually writes the block to disk and then simply tells the dAIO thread to monitor for that I/O completion. The dAIO detects the completion by using standard asynchronous I/O polling

Final v2.0.1

98

techniques and when the I/O has completed, wakes up the SQM, which, can then update the RSSD with the last OQID in the block that was written. This ensures system recoverability as it is this OQID that is returned to the RepAgent when a new truncation point is requested (as described earlier). This is illustrated as follows:

Figure 17 – SQM Thread Processing

SQM Performance Analysis

One of the best and most frequent commands for SQM analysis is the admin who, SQM command (sample output below extracted from Replication Server Reference Guide).

admin who, sqm Spid State Info ---- ----- ---- 14 Awaiting Message 101:0 TOKYO_DS.TOKO_RSSD 15 Awaiting Message 101:1 TOKYO_DS.TOKYO_RSSD 52 Awaiting Message 16777318:0 SYDNEY_RS 68 Awaiting Message 103:0 LDS.pubs2 Duplicates Writes Reads Bytes ---------- ------ ----- ----- 0 0 0 0 8867 9058 0 0.1 2037 2037 0 0.1.0 0 0 B Writes B Filled B Reads B Cache Save_Int:Seg -------- ------- ------- ------- ------------ 0 0 0 0:0 0 34 44 2132 0:33 0 3 54 268 0:4 0 0 23 0 strict:O First Seg.Block Last Seg.Block Next Read --------------- -------------- --------- 0.1 0.0 0.1.0 33.10 33.10 33.11.0 4.12 4.12 4.13.0 0.1 0.0 0.1.0 Readers Truncs ------- ------ 1 1 1 1 1 1 1 1

Final v2.0.1

99

Now that we understand how Replication Server allocates space (1MB allocations) and performs I/O (16K blocks – 64 blocks per 1MB), the above starts to make a bit more sense. Although a more detailed discussion is in the Reference Guide, a quick summary of the output is listed here for easy reference.

Column Meaning

Spid RS internal thread process id – equivalent to ASE’s spid

State Current state of SQM – Awaiting message, it is caught up and not necessarily part of the problem. However, if state shows “Active” or “Awaiting I/O”, the SQM is busy writing data to/from disk.

Info Queue id and database connection for queue

Duplicates Number of LTL records judged as already received – can increase at Rep Agent startup, but if continues to increase, it is a sign of someone recovering the primary database without adjusting the generation id.

Writes Number of messages (LTL rows) written to the queue. If consistently higher than Reads, you will most likely be seeing a backlog develop. If the inbound queue and not a warm standby, tuning exec_cmds_per_timeslice may help

Reads Number of messages read from queue. May surge high at startup due to finding the next row. However, after startup, if this number starts outpacing writes by any significant number, messages are being reread from the queue due to large transactions or SQT cache too small.

Bytes Number of actual bytes written to queue. The efficiency of the block usage can be calculated by dividing “Bytes” by “B Writes”. Obviously if the blocks were always full, the result would be close to 16K. However, in normal processing, this is often not the case as transactions tend to be more sporadic in nature. The most useful uses of this column are to track bytes/min throughput and to explain why the queue usage may be different than estimated (i.e. low block density).

B Writes Number of 16K blocks written to queue

B Filled Number of 16K blocks written to queue that were full

B Reads Number of 16K blocks read from queue

B Cache Number of 16K blocks read from queue that are cached

Save Int:Seg Save interval in minutes (left of colon) and oldest segment (1MB allocation) for which save interval has not yet expired.

First Seg.Block First undeleted segment and block in the queue.

Last Seg.Block Last segment and block written to the queue. As a result, the size of the queue can be quickly calculated via Last Seg – First Seg (answer in MB)

Next Read The next segment, block and row to be read. If it points to the next block after Last Seg.Block, then the queue is quiesced (caught up). If continually behind, then reading is not keeping up with writes. If Replication Server is behind, a rough idea of the latency can be determined from the amount of queue to be applied ~ Last Seg – Next Read (answer in MB)

Readers Number of readers

Trunc Number of truncation points

In the above table, performance indicators were highlighted. As such, these are indications – further commands will be necessary to determine exactly what the problem is. A frequent command for inbound queue determination is admin who, sqt, while for outbound queues, it most likely will be a look at the replicate database. Note the word “rough” is underlined in the high-lighted sentence regarding calculating latency by subtracting Last Seg and Next Read. The reason for the highlighting is that this method is not exactly accurate. This metric is from the viewpoint of the SQM thread and not the endpoint (DIST or DSI) that we think it is. Prior to the true endpoint, there is a substantial amount of

Final v2.0.1

100

cache likely in the SQT or DSI (dsi_sqt_max_cache_size) that can be masking the latency. However, if after successive queries the Next Read/Last Seg shows no latency, then it likely is that true that no latency exists (exception is Warm-Standby). As we discuss the SQT thread and DSI SQT module, we will explain in more detail the times and conditions when this could be inaccurate.

SQM Tuning

To control the behavior of the SQM, there are a couple of configuration parameters available:

Parameter RS Meaning

init_sqm_write_delay (Default: 1000; Recommendation: 50)

11.x Write delay for the Stable Queue Manager if queue is being read. Init_sqm_write_delay should be less than init_sqm_write_max_delay. Given that IO operations today are in the low ms range, this default value probably should be lowered – see next configuration for rationale.

init_sqm_max_write_delay (Default: 10000; Recommendation: 100)

11.x The maximum write delay for the Stable Queue Manager if the queue is not being read. Given that IO operations today are in the low ms range, this should be lowered. The likely cause of waiting for the queue to be read would be rescanning for large transactions. If we allow up to a 10 sec delay due to rescanning a large transaction, we will excessively delay Replication Agent processing and have a bigger impact on the system overall.

sqm_recover_segs (Default: 1; Recommendation: 10)

12.1 Controls how often the SQM updates rs_oqid’s. By increasing, the SQM will write less frequently, improving throughput, but lengthening the recovery time due to more segments needing to be analyzed during recovery.

sqm_warning_thr1 (Default: 75;Min: 1; Max: 100)

11.x Percent of partition segments (stable queue space) to generate a first warning. The range is 1 to 100.

sqm_warning_thr2 (Default: 90;Min: 1; Max: 100)

11.x Percent of partition segments used to generate a second warning. The range is 1 to 100.

sqm_warning_thr_ind (Default: 70;Min: 51; Max: 100)

11.x Percent of total partition space that a single stable queue uses to generate a warning. The range is 51 to 100.

sqm_write_flush (Default: “on”; Recommendation: “off”)

12.1 Specifies whether or not writes to memory buffers are flushed to the disk before the write operation completes. Values are "on" and "off." Essentially allows file system devices to be used safely (ala ASE’s dsync option).

The first two take a bit of explaining. The stable queue manager waits for at least init_sqm_write_delay milliseconds for a block to fill before it writes the block to the correct queue on the stable device - or if the queue is being read, it will delay writing by this initial delay. Of course, this is the initial wait time. When the delay time has expired, the SQM writer will check if there are actually readers waiting for this block. If there are no readers waiting for the block, and the block is not full, then SQM will adjust this time and make it longer for the next wait time. The other option is that the queue is still being read - which again causes the SQM to double the time and wait before it again tries to write. To realize what this means, you have to remember that the reader for the block typically will be the SQT, DSI or RSI threads. If the reader is caught up, then it is in fact waiting for the disk block, and the SQM needs to close the block so that the reader can access it immediately. However, if the reader is behind and is still processing previous blocks, then they will not be waiting for this block and consequently, the SQM can wait a bit longer to see if the block can be filled before flushing it to disk. The downside is that if the SQT is completely caught up, then it will be frequently attempting to read from the write block, delaying rows from being appended to it.

You may want to change this parameter if you have special latency requirements and the updates to the primary database are done in bursts. To get the smallest possible latency you’ll have to set init_sqm_write_delay to 100 or 200 milliseconds and batch_ltl to false (sp_config_rep_agent). Decreasing init_sqm_write_delay will cause more I/O to

Final v2.0.1

101

occur as a small init_sqm_write_delay will write blocks that are not filled completely. This will fill up the stable queue faster with less dense blocks. However, for increased throughput, you may wish to increase this parameter in bursty environments with low transaction rates to ensure more full blocks are written and consequently less i/o required to read/write to queue. A better solution than to increase this parameter is to simply ensure that batch_ltl is on at the Rep Agent (if on, Rep Agent sends an ltl_buffer_size block of LTL. Due to normalization, this may be less space in the queue, but under normal circumstances it will be sufficient). Increasing this value in situations in which the transactions do not quite fill up a full block, but are rather bursty may degrade performance as the Rep Agent effectively has a synch point with the SQM – basically another block can not be forwarded until the first one is on disk. The key here is that this is how long the SQM will wait before writing to the queue if the DSI, RSI or SQT threads are active to ensure full blocks. This is important – it means that the SQM will delay writing partially full blocks when the SQT is busy reading – consequently:

• A large transaction that is removed from the SQT cache and is being re-read (and keeping the SQT busy reading) may reduce throughput as it is likely that once the block is full, it will have to be flushed, forcing the SQT to read it from disk vs. from cache.

• If the SQT is completely caught up, the rapid polling read cycle against the SQM write block will cause the SQM to delay appending new rows to the block - delaying RepAgent User throughput.

The other important aspect is that the configuration value is the initial wait time. Each time RS hits init_sqm_write_delay, it will double the time up to init_sqm_max_delay. As a result, after RS has been in operation for any length of time, it is likely that the real delay in writing to the queue when the queue is being read is init_sqm_write_max_delay and not init_sqm_write_delay. As a consequence in many systems it is a good idea to reduce init_sqm_write_max_delay.

The question some may ask is what happens if other replicated rows arrive from the Replication Agent. Note that this delay does not mean the SQM is “sleeping” - if the block is not full, the SQM at the end of the “wait” cycle will check to see if there are more write requests. If so, it will append them to the block. Once the block is full and the wait has expired, the SQM will flush it to disk.

On the other hand, init_sqm_write_max_delay is how long a block will be held due to the fact that the DSI, RSI or SQT threads are suspended and not reading from the queue or the reader was not waiting for the block so the SQM delayed past init_sqm_write_delay. A flush to the queue is guaranteed to happen after waiting for init_sqm_write_max_delay. This is the final condition if a block wasn’t written yet because of a full condition or the init_sqm_write_delay. This parameter has to do more with when the block will be flushed from memory. If the RS is fully caught up, the SQM readers (when up) may be requesting to read the same disk block as was just written. The SQM cheats and simply reads the block from cache. However, if the SQM reader is not up or is lagging, this parameter controls how long the SQM will keep the block in cache waiting for the reader to resume or catch up.

These seem confusing, but consider the following scenario:

1. SQM begins receiving LTL rows and begins to build a 16K block. Assuming the DSI, RSI or SQT are up and the SQT is actively reading the queue, it waits init_sqm_write_delay before writing the current block to disk.

2. Init_sqm_write_delay expires, so block is written to disk. However, the block is still cached in memory of the SQM. If the block was not full and the readers were not waiting for it, the next block will wait longer (to a maximum of init_sqm_write_max_delay).

3. DSI, RSI, or SQT reads the next block. If RS is fully caught up, the block it is requesting is the one just written. To avoid unnecessary disk I/O, the block is simply read from cache vs. the copy flushed to disk.

Now, a little bit different. Let’s kill the SQM reader (i.e. suspend the DSI or suspend distribution (the DIST thread starts/stops the SQT thread)).

1. SQM begins receiving LTL rows and begins to build a 16K block. 2. Init_sqm_write_delay expires, however, readers are not up, consequently block is not flushed to disk

unless it is full. 3. If the reader comes back up within init_sqm_max_write_delay, it is able to retrieve the block from the

SQM cache as discussed above if the next block to read is the current block. 4. If the reader does not come back up within init_sqm_max_write_delay, the block is flushed to disk

regardless of full status. The reader will have to do a physical I/O to retrieve the disk block.

Finally, let’s consider what likely happens in real life. Let’s assume we have a system that is being updated 10 times per second during normal working hours, but is quiescent on weekends and evenings. Assume the default settings and that it the rows are 1KB each – so it will take 16 rows to fill a block.

Final v2.0.1

102

1. RS is booted/re-booted on a weekend. Since there is no activity, after a short time, init_sqm_write_delay is doubled from it’s initial 1 second delay until init_sqm_max_write_delay (10 seconds)

2. As activity starts, the first rows arrive – since the block is not full, the SQM delays writing the block (the timer will expire in 10 seconds).

3. At slightly more than 1.5 seconds, enough rows have arrived that the block is full. Even though the timer has not expired, the block will be flushed to disk.

4. A new block is allocated and the timer reset to 0. 5. Process repeats with the SQM block being written at a rate of 1 every ~1.5 seconds.

What happens if the transaction rate slows to 1 per second? At 1KB rows and 16KB blocks, if we waited for a full block we’d wait for ~16 seconds before the block flushed. But since we have a timer, the block will be flushed at init_sqm_write_max_delay regardless of whether or not it is full. So, every 10 seconds, we would be flushing a block containing 10 rows of data. Someone looking at the replicate database might notice the 10 second delay and make some wrong assumptions about why the delay and try tuning different areas of RS – especially if they have a desire to see RS latency in the 1-2 second range. And that is why it probably is useful to reduce init_sqm_write_max_delay for low throughput systems – while the blocks will be flushed nearly empty, the latency will be reduced. For example, if we use the suggested value of 1 second (from the table above), each block would only contain 1 row of data at 1 transaction per second activity rates.

Increasing the init_sqm_max_write_delay beyond 10 seconds is probably not useful. If the SQM reader (DSI, RSI or SQT) is down for any length of time, the Rep Agent or DIST will still be supplying data to the SQM. As a result, the block will in all likelihood fill and get flushed to disk. Consequently, it is more probable that the queue will begin to back up if the SQM reader is down, necessitating a physical I/O. The only time increasing this may make sense is if increasing the init_sqm_write_delay to greater than 10,000ms – a very rare situation in which queue space may be at a premium and write activity is very low in the source system.

Generally speaking, reducing both the init_sqm_write_delay and init_sqm_max_write_delay can help. However, keep this in mind. If the SQM ‘waits’ too long, the cache of write requests (exec_sqm_write_request_limit) will be filled and the RepAgent User will be forced to wait. This will show up as a RAWriteWait event (in RS 12.6 - in RS 15.0, the counter_obs for RAWriteWaitTime will be incremented). Consequently, reducing this value if there are no RAWriteWaits is likely not a going to help. However, if there are RAWriteWaits and you have already maximized exec_sqm_write_request_limit, you could try decreasing these values as well as looking at the cumulative writes (in MB) for all the queues on the same disk partition or look at the sqm_recover_segs to see if you can speed up the SQM processing.

Normal SQM processing is fairly fast – however, at some point, the end of the current 1MB segment will be reached. At that point, the SQM will need to allocate a new segment. While this sounds easy, the SQM actually has to do a bit of checking. Whenever a segment is full and new one is allocated, the SQM does the following

1. Update the rs_oqid with the last oqid processed for the segment 2. Check if there is space on the current partition being used 3. Check to see if the current partition has been marked to be dropped 4. Check if a new disk_affinity setting specifies a different location 5. Update the disk partition map and allocate the new segment

If a large number of connections exist or in a high volume system, you may wish to adjust sqm_recover_segs. By increasing this value, the SQM will update the rs_oqid less frequently. Note that the SQM does not currently update the RSSD with every block anyhow, so adjusting it from 1 to 2 may not show any appreciable impact. Also, be aware that increasing this parameter may also increase recovery time after a shutdown, hibernation, or any other event that suspends SQM processing. However, setting this value to 10 can help as SQM flushes to the RSSD are reduced yet for recovery the most that will have to be scanned is 10 blocks (~160KB). Much like changing the Replication Agent scan_batch_size to reduce the updates to the rs_locater, the intent here is to reduce the impact of updating the RSSD – not that the RSSD can’t handle the load, but since this is done inline with RS processing, updates to the RSSD have the worse effect of degrading RS throughput at that point in time. Additionally, remember that this reduces the updates to rs_oqid only – during a segment allocation, the other steps will still have to be performed (but the time to do so will likely nearly be cut in half).

From a performance perspective, the most common cause of SQM contributing to performance issues is simply if the SQM can’t write to disk fast enough. Other than the “lucky” instances where you might see the state column in the admin who, sqm command stating “Awaiting I/O” this may be difficult to detect as the bytes written to the queue may be more than what was written to the transaction log. However, if you see that the transaction log’s rate exceeds the SQM rate – it may be an indication that the Rep Agent is not able to keep up. From an input standpoint, the SQM write

Final v2.0.1

103

is likely the largest cause of Replication Agent latency – however, the biggest probably cause of latency is likely at the DSI, so concentrating on this is likely not going to help reduce overall latency much.

From a write speed aspect, remember that a stable device may be used by more than one connection. Consequently if experiencing a high rate on one or more connections, it is likely advisable to use disk_affinity to spread the writes across different devices for different connections. This includes separating inbound and outbound connections as well.

SQM Monitor Counters

SQM Thread Monitor Counters

In RS 12.1 and 12.5 there was only a single group of counters that applied to the SQM thread. In 12.6, this was supplemented by adding counters from the SQM Reader and some of the SQM module counters were shifted to the SQM Reader module counters (listed as deprecated/obsolete in the counter description as you will see below). While the former still use the module name of SQM, the latter use the SQMR module.

This SQM module thread counters for RS 12.6 are:

Counter Name Explanation

AffinityHintUsed Total segments allocated by an SQM thread using user-supplied partition allocation hints.

BlocksFullWrite Total number of full blocks written by an SQM thread. Individual blocks can be written due either to block full state or to sysadmin command 'show_queue' (only one message per block).

BlocksRead Obsolete. See CNT_SQMR_BLOCKS_READ.

BlocksReadCached Obsolete. See CNT_SQMR_BLOCKS_READ_CACHED.

BlocksWritten Total number of 16K blocks written to a stable queue by an SQM thread

BPSaverage Average byte deliver rate to a stable queue.

BPScurrent Current byte deliver rate to a stable queue.

BPSmax Maximum byte deliver rate to a stable queue.

BytesWritten Total bytes written to a stable queue by an SQM thread.

CmdSizeAverage Average command size written to a stable queue.

CmdsRead Obsolete. See CNT_SQMR_COMMANDS_READ.

CmdsWritten Total commands written into a stable queue by an SQM thread.

Duplicates Total messages that have been rejected and ignored as duplicates by an SQM thread.

SegsActive Total active segments of an SQM queue: the number of rows in rs_segments for the given queue where used_flag = 1.

SegsAllocated Total segments allocated to a queue during the current statistical period.

SegsDeallocated Total segments deallocated from a queue during the current statistical period.

SleepsStartQW Total srv_sleep() calls by an SQM Writer client due to waiting for SQM thread to start.

SleepsWaitSeg Total srv_sleep() calls by an SQM Writer client due to waiting for the SQM thread to get a free segment.

SleepsWriteDRmarker Total srv_sleep() calls by an SQM Writer client while waiting to write a drop repdef rs_marker into inbound queue.

SleepsWriteEnMarker Total srv_sleep() calls by an SQM Writer client while waiting to write an enable rs_marker into the inbound queue.

SleepsWriteQ Obsolete. See CNT_SQMR_SLEEP_Q_WRITE.

Final v2.0.1

104


SleepsWriteRScmd Total srv_sleep() calls by an SQM Writer client while waiting to write a special message, such as synthetic rs_marker.

TimeAveNewSeg (intrusive)

Average elapsed time, in 100ths of a second, to allocate a new segment. Timer starts when a segment is allocated. Timer stops when the next segment is allocated.

TimeAveSeg (intrusive)

Average elapsed time, in 100ths of a second, to process a segment. Timer starts when a segment is allocated or RepServer starts. Timer stops when the segment is deleted.

TimeLastNewSeg (intrusive)

The elapsed time, in 100ths of a second, to allocate a new segment. Timer starts when a segment is allocated. Timer stops when the next segment is allocated.

TimeLastSeg (intrusive)

Elapsed time, in 100ths of a second, to process a segment. Timer starts when a segment is allocated or RepServer starts. Timer stops when the segment is deleted. Includes time spent due to save interval, so care should be taken when attempting to time RS speed using this counter.

TimeMaxNewSeg (intrusive)

The maximum elapsed time, in 100ths of a second, to allocate a new segment. Timer starts when a segment is allocated. Timer stops when the next segment is allocated.

TimeMaxSeg (intrusive)

The maximum elapsed time, in 100ths of a second, to process a segment. Timer starts when a segment is allocated or RepServer starts. Timer stops when the segment is deleted. Includes time spent due to save interval, so care should be taken when attempting to time RS speed using this counter.

UpdsRsoqid Total updates to the RSSD..rs_oqid table by an SQM thread. Each new segment allocation may result in an update of oqid value stored in rs_oqid for recovery purposes.

WriteRequests Total message writes requested by an SQM client.

WritesFailedLoss Total writes failed by an SQM thread due to loss detection, SQM_WRITE_LOSS_I, which is typically associated with a rebuild queues operation.

WritesForceFlush SQM writer thread has forced the current block to disk when no real write request was present. However, there is data to write and we were asked to do a flush, typically by quiesce force RSI or explicit shutdown request.

WritesTimerPop SQM writer thread initiated a write request due to timer expiration.

XNLAverage Average size of large messages written to a stable queue.

XNLInterrupted Obsolete. See CNT_SQMR_XNL_INTR.

XNLMaxSize The maximum size of large messages written so far.

XNLPartials Obsolete. See CNT_SQMR_XNL_PARTIAL.

XNLReads Obsolete. See CNT_SQMR_XNL_READ.

XNLSkips Total large messages skipped so far. This only happens when site version is lower than 12.5.

XNLWrites Total large messages written successfully so far. This does not count skipped large message in mixed version situation.

Replication Server 15.0 has slightly different SQM counters:


CmdsWritten Commands written into a stable queue by an SQM thread.

Final v2.0.1

105


BlocksWritten Number of 16K blocks written to a stable queue by an SQM thread

BytesWritten Bytes written to a stable queue by an SQM thread.

Duplicates Messages that have been rejected and ignored as duplicates by an SQM thread.

SleepsStartQW srv_sleep() calls by an SQM Writer client due to waiting for SQM thread to start.

SleepsWaitSeg srv_sleep() calls by an SQM Writer client due to waiting for the SQM thread to get a free segment.

SleepsWriteRScmd srv_sleep() calls by an SQM Writer client while waiting to write a special message, such as synthetic rs_marker.

SleepsWriteDRmarker srv_sleep() calls by an SQM Writer client while waiting to write a drop repdef rs_marker into inbound queue.

SleepsWriteEnMarker srv_sleep() calls by an SQM Writer client while waiting to write an enable rs_marker into the inbound queue.

SegsActive Active segments of an SQM queue: the number of rows in rs_segments for the given queue where used_flag = 1.

SegsAllocated Segments allocated to a queue during the current statistical period.

SegsDeallocated Segments deallocated from a queue during the current statistical period.

TimeNewSeg The elapsed time, in 100ths of a second, to allocate a new segment. Timer starts when a segment is allocated. Timer stops when the next segment is allocated.

TimeSeg Elapsed time, in 100ths of a second, to process a segment. Timer starts when a segment is allocated or RepServer starts. Timer stops when the segment is deleted.

AffinityHintUsed Segments allocated by an SQM thread using user-supplied partition allocation hints.

UpdsRsoqid Updates to the RSSD..rs_oqid table by an SQM thread. Each new segment allocation may result in an update of oqid value stored in rs_oqid for recovery purposes.

WritesFailedLoss Writes failed by an SQM thread due to loss detection, SQM_WRITE_LOSS_I, which is typically associated with a rebuild queues operation.

WritesTimerPop SQM writer thread initiated a write request due to timer expiration.

WritesForceFlush SQM writer thread has forced the current block to disk when no real write request was present. However, there is data to write and we were asked to do a flush, typically by quiesce force RSI or explicit shutdown request.

WriteRequests Message writes requested by an SQM client.

BlocksFullWrite Number of full blocks written by an SQM thread. Individual blocks can be written due either to block full state or to sysadmin command 'show_queue' (only one message per block).

CmdSize Command size written to a stable queue.

XNLWrites Large messages written successfully so far. This does not count skipped large message in mixed version situation.

XNLSkips Large messages skipped so far. This only happens when site version is lower than 12.5.

XNLSize The size of large messages written so far.

SQMWriteTime The amount of time taken for SQM to write a block.

Final v2.0.1

106

Note again, that many of the averages, etc. have been removed. However, one new counter of interest is SQMWriteTime. While a byte rate is possibly useful, this counter may help as it shows how long each 16K I/O takes for a full block.

Regardless, the SQM counter values can be viewed in at least two different comparisons. First, the normal is to compare the current sampling’s values with the previous interval’s. This establishes an idea of the rate of a single activity. For example, CmdsWritten when compared with itself could demonstrate a rate (when normalized) of 100 commands/second. If the primary activity was a bcp of 200 rows/second, the obvious implication is that the RepAgent can only read that particular table’s rows out at half the speed of bcp, consequently, the replication to other destinations will take at least twice as long as the original bcp.

The second way of comparing the counters is to compare multiple counters within the same sample interval. In the above list, there are a number of counters when compared with their counter-parts can provide insight into what the possible causes of performance issues might be. For instance, consider the following:

RAWriteWaitPct = RAWriteWaits/WriteRequests CmdsWritten, CmdSizeAverage BlocksFullPct=BlocksFullWrite/BlocksWritten SegsActive, SegsAllocated, SegsDeallocated UpdsRsoqidSec = UpdsRsoqid / Sec RecoverSeg = SegsAllocated/UpdsRsoqid

The first counter (RAWriteWaitPct) is a derived value from taking the RAWriteWaits from earlier and dividing it by the number of SQM WriteRequests. This tells us a rough percentage of the time that the RA had to wait in order to write. Even a low value such as 5-10% could be indicative of a problem once you realize that the default init_sqm_write_delay is 1 second – which causes the ASE RepAgent to have to wait. The key to all this is realizing that the SQM writes/reads 16K blocks (not configurable). So, by default, the RepAgent User thread will be forced to go to sleep once its outstanding write requests have exceeded what the SQM Writer can pack into one block. Given that the inbound queue often has a 2-4x space explosion, this can literally mean that for every 4-8KB of log data, the RepAgent User is forced to wait – which in turn forces the RepAgent to stop scanning. Fortunately for most people, since they have not adjusted exec_sqm_write_request_limit from the default of 16384, increasing it to the maximum of 983,040 (60 16K blocks), provides a lot more cushioning for the RepAgent User to keep processing write requests before it is forced to sleep by the SQM.

The next sequence of counters (CmdsWritten, CmdSizeAverage) tells us how many commands actually were written into the queue and should compare with CmdsTotal from the RA – although it may not be exactly equal as purge commands during a recovery, etc. are not written to the queue. CmdSizeAverage is the first place that we get a look at how big each command is from the source when packed into SQM format. However, for an outbound queue, this could be different as the same outbound queue may be receiving transactions from more than one source (corporate rollup implementation), consequently you may not be able to directly compare the CmdsWritten to DIST counter values. Where a single connection is involved, however, it can be useful.

The next two sets (BlocksFullPct and SegsActive, SegsAllocated, SegsDeallocated) are ones to watch, but you really can’t do much about. In most busy systems, BlocksFullPct will likely be 100% as every block is written when full vs. the timer pop. Numbers less than 100% indicates that not a lot of commands are coming into RS on a throughput basis. The others – all the SegsActive, etc. counters – are more for just tracking the space utilization – although ideally, the goal is to see the SegsAllocated and SegsDeallocated matching. However, while this is a way of tracking disk space utilization, it shouldn’t be used as an indication of latency (it could be – but it also could be just due to something else).

The next two (UpdsRsoqidSec and RecoverSeg) are related and likely a big factor in performance of the SQM. As you will notice, once again, we are updating the OQID in the RSSD as we track our progress. However, in this case we are concerned about the speed of recovery for RS. When RS is restarted or a connection resumed, RS uses the OQID from the RSSD to locate the current segment and block. The more frequently this is updated, the shorter RS has to scan from the point the RSSD was last updated to the current working location. Again, just like with the Rep Agent scan batch size, you need to look at this realistically. A sub-second recovery interval is likely overkill – and yet most DBAs are surprised to find out that during busy periods, they are updating the OQID in the RSSD 2-3 times per second…and this is just the inbound queue. When you add in the outbound queue and multiply across the number of connections, you can see where the updates to the RSSD are a lot higher than we would like. Adjusting sqm_recover_seg from its default of 1 to 10 or another value and watching both UpdsRsoqidSec and RecoverSeg to fine tune it is likely a good course of action.

Final v2.0.1

107

SQMR Counters

After describing where these counters are located, you might think they are in the wrong location. The SQMR actually refers to the SQM code executed by the reader. For the inbound queue, the readers are the SQT and/or the WS DSI threads. For the outbound queue, it will either be a DSI or an RSI thread. These can be distinguished via the counter structures. For instance, a Warm Standby that doesn’t have distribution disabled or is replicating to a third site will have both a DSI set of SQMR’s (for the Warm Standby DSI which reads from the inbound queue) and a SQT set of SQMR’s. From the earlier table, we saw that in rs_statdetail, these would have the instance_val column value of 11 for the SQT SQMR and 21 for the WS-DSI SQMR. As a result, the counters below are actually from the respective reader thread in RS 12.6 and 15.0 and not actually part of the SQM thread. However, in queue processing, we are often comparing the read rate to the write rate, and given the name, we will discuss them here.

First let’s look at the counters from RS 12.6:

Counter Explanation

CmdsRead Total commands read from a stable queue by an SQM Reader thread.

BlocksRead Total number of 16K blocks read from a stable queue by an SQM Reader thread.

BlocksReadCached Total number of 16K blocks from cache read by an SQM Reader thread.

SleepsWriteQ Total srv_sleep() calls by an SQM read client due to waiting for the SQM thread to write.

XNLReads Total large messages read successfully so far. This does not count partial message, or timeout interruptions.

XNLPartials Total partial large messages read so far.

XNLInterrupted Number of interruptions so far when reading large messages with partial read. Such interruptions happen due to time out, unexpected wakeup, or nonblock read request which is marked as READ_POSTED.

SleepsStartQR Total srv_sleep() calls by an SQM Reader client due to waiting for SQM thread to start.

Similar to the SQM counters, RS 15.0 has a few modifications for SQM Readers as well.

Counter Explanation

CmdsRead Commands read from a stable queue by an SQM Reader thread.

BlocksRead Number of 16K blocks read from a stable queue by an SQM Reader thread.

BlocksReadCached Number of 16K blocks from cache read by an SQM Reader thread.

SleepsWriteQ srv_sleep() calls by an SQM read client due to waiting for the SQM thread to write.

XNLReads Large messages read successfully so far. This does not count partial message, or timeout interruptions.

XNLPartials Partial large messages read so far.

XNLInterrupted Number of interruptions so far when reading large messages with partial read. Such interruptions happen due to time out, unexpected wakeup, or nonblock read request which is marked as READ_POSTED.

SleepsStartQR srv_sleep() calls by an SQM Reader client due to waiting for SQM thread to start.

SQMRReadTime The amount of time taken for SQMR to read a block.

SQMRBacklogSeg The number of segments yet to be read.

SQMRBacklogBlock The number of blocks within a partially read segment that are yet to be read.

Final v2.0.1

108

The last three (which are new in RS 15.0) are interesting. The problem with SQMR for 12.6 is that it could not be used to derive a relative latency. While the SQM counters SegsAllocated, SegsDeallocated, and SegsActive would appear to give that information, the issue was that a segment is active until it is deallocated. Since this has a lower priority, a segment could have been read a long time before it is deallocated. These new counters - particularly the Backlog counters - could be used much like the admin who, sqm next.read and last.seg columns to determine a latency. Even better, once the number of segments in the backlog is obtained, the SQMRReadTime could be used as means of determining the length of time it will take to read it at the current rate (although this is likely an idealistic number).

One aspect to remember, is that if a transaction is removed from SQT cache due to size, the SQMR may have to re-read significant numbers of blocks to re-create it later. Keeping this in mind, the best counters to consider for the SQMR include:

CmdsRead BlocksReadCachedPct = BlocksReadCached/BlocksRead SleepPct = SleepsWriteQ/BlocksRead

Ideally, of course, we would like to see CmdsRead equal to the SQM counter CmdsWritten. However because of rescanning, you may frequently see a much higher value – especially when rescanning large transactions that were removed from the SQT cache.

The next counter (BlocksReadCachedPct) is the most important for the inbound queue reading. Ideally we would like to see this higher than 75%, although anything higher than 30% is fine. The cache referred to for queue reads is an unconfigurable 16k of memory that the writer uses to build the next block to be written. If between the time that the writer requests the block to be written and it starts to re-use the memory to build the next block, a reader requests a message from that block, then it is able to “read from cache” rather than from disk. While you would like to see high BlocksReadCachedPct numbers, and no RepAgent latency, at the same time if RepAgent latency exist (in ASE), you should be concerned that the writer is not flushing blocks fast enough so that the reader is constantly have to wait for the next write – see counter SleepsWriteQ. Alternatively, a possible cause is that the writer is constantly waiting on read activity – and when it does, it sleeps sqm_init_write_delay to sqm_init_write_max_delay. So, while reading from cache is ‘good’ for the reader, it could delay the writer. So if BlocksReadCached is high (i.e. 100%) and there is RepAgent latency, you may want to reduce sqm_init_write_delay (and the max) to reduce the sleep time. For the outbound queue, it is most likely that BlocksReadCachedPct will start high and rapidly drop to zero as the backlog in the DSIEXEC causes the DSI to lag far behind in reading the queue vs. the SQM writing.

The final SQMR counter takes a bit of explanation. SleepsWriteQ itself refers to the number of times the reader was put to sleep while waiting for the SQM to write. This wait is likely caused by the SQMR (SQT or DSI) being caught up and therefore is waiting on more data to be written. Consequently, this is best looked at in conjunction with (SQM) BlocksWritten (earlier) – but expressed as a ratio of how often it had to sleep for each block read. For the inbound queue, this number (SleepPct) should be in the 300%-700% range – as long as the BlocksRead are nearly identical to BlocksWritten (or a decent BlocksReadCachedPct). This indicates that the SQMR is caught up. If the SQT starts to lag and reading then gets behind, this ratio might drop. Again, though, one aspect to watch is if the writing seems to be going fine, but it doesn’t look like reading is fast enough (usually indicated by the fact the SQT cache is not full and BlocksReadCachedPct < 30%), a cause may be the configuration values sqt_init_read_delay and sqt_max_read_delay. In RS 12.6, these were defaulted to 2000ms and 10000 ms respectively which meant that if the reader went to read and it was caught up, it would most likely sleep for 2 or more seconds – now causing it to be behind. This caused so many problems with upgrades to RS 12.6, that in RS 15.0, the defaults for these values was set at 1ms each – which is likely overkill in the other direction and could be causing DIST servicing problems from the SQT. On the other hand, if SleepPct is too high (i.e. constantly >700%) then it is likely that the sqm_init_write_delay is too high. What could be happening is that the SQM writes a block, the SQT reads it…forcing the SQM to sleep sqm_init_write_delay seconds before it can write the next one, but the SQT tries to read the next one during that time and is put to sleep sqt_init_read_delay seconds. You can see quickly how that large settings (i.e. the defaults) could cause both the writer and reader to spend a lot of time sleeping vs. doing work – resulting in RepAgent latency (as high RAWriteWaits as eventually exec_sqm_write_request_limit fills).

SQM Thread Counter Usage

Again, helps to look at the counters in terms of the progression of data through the replication server. To see how this works, once again we will take a look at the customer data used earlier in the RepAgent User Thread discussion.

1. The first thing that happens is that the SQM Writer client puts a write request message on the internal message queue (as discussed in the earlier section detailing the OpenServer structures). This increments the WriteRequests counter. The counters BPSaverage, BPScurrent, and BPSmax effectively measure the bytes per second rate of delivery of the write requests to the SQM while CmdSizeAverage records the average size of the commands in the write requests to the SQM.

2. The SQM checks each incoming message to see if it is a duplicate or if a loss was detected.

Final v2.0.1

109

a. If it is a duplicate, it is discarded, the Duplicates counter is incremented and the SQM starts processing the next write request.

b. If loss was detected, typically the processing suspends. This can be overridden through a ‘rebuild queues’ command. Writes issued by such maintenance activities will cause the WritesFailedLoss counter to be incremented.

3. The SQM is continuously performing space management activities. As new requests come in, it may have to allocate additional segments, incrementing the SegsAllocated counter.

a. If the new segment is allocated according to the disk affinity setting, the counter AffinityHintUsed is incremented.

b. If intrusive counters are enabled, the time is measured from the last new segment allocated and the counters TimeAveNewSeg, TimeLastNewSeg, and TimeMaxNewSeg are updated accordingly. Use of these counters are interesting in that they show the time it takes for each 1MB segment to be allocated, populated, written to disk – in other words, in a steady state high volume system, this demonstrates the disk throughput in MB/milliseconds. In low volume systems, these counters are likely not as effective as the write request rate may not be driving new segments to be allocated fast enough.

c. Depending on the configuration values for sqm_recover_segs, the new segment allocation may have to update the OQID in the RSSD. If this happens, the counter UpdsRsoqid is incremented. If this value is fairly high and SQM write speed is blocking the EXEC or DIST rate, you may want to adjust the sqm_recover_segs configuration to reduce this.

d. If the SQM has to wait for the segment allocation, the counter SleepsWaitSeg is incremented. While there is no counter that tracks how long it waits, the time is built in to the above counters (TimeAveNewSeg, etc)

e. Since a segment is allocated only when needed, the counter SegsActive is incremented, indicating the number of segments that contain undelivered commands.

4. Now that the SQM has space it can use to write to, it receives the command records and begins filling out a 16K block in memory. This causes several counters to be affected, including CmdsWritten and in some situations others as discussed below.

a. If the command was a replication definition or subscription marker (rs_marker), or a synthetic rs_marker, the SQM has to process these records, so it sleeps while the enablement or disablement occurs. This increments the SleepsWriteDRmarker, SleepsWriteEnMarker, and SleepsWriteRScmd accordingly. High values here may indicate that the maintenance activity is affecting throughput.

b. If the message is considered to be large (i.e. corresponds to XNL Datatypes), the XNL related counters are affected.

i. If the RS site version configuration value is less than 12.5, the message is skipped and the XNLSkips counter is incremented. This is useful to detect a bad configuration when the replicate is getting out of sync on tables using XNL Datatypes.

ii. If the RS site version is 12.5 or greater, the XNLWrites, XNLMaxSize, and XNLAverage counters are incremented.

5. Eventually, the block will get flushed to disk (reasons and counters below). Regardless of the reasons, this will cause the counters BlocksWritten, BytesWritten to be incremented.

a. If the block was written to disk because it was full (essentially the next message would not fit in the space that was left), the counter BlocksFullWrite is incremented.

b. If the block was written to disk because the init_sqm_write_delay or the init_sqm_max_write_delay write timer expired, the counter WritesTimerPop is incremented. This is an indication that either the SQM is not getting data from the RepAgent User Thread fast enough (i.e. RA User is starved for cpu time), or the inbound stream of data is not that high of a volume.

c. If the block was written to disk due to a RS shutdown, hibernation or other maintenance activity that suspends or shuts down the SQM thread, the counter WritesForceFlush is incremented.

6. When an SQM Reader finishes processing its previous command(s), it will attempt to read the next block from the queue or SQM cache. While the block is being filled, it can not be read by a SQM Reader client (SQT or WS-DSI). If this happens, the SleepsWriteQ counter is incremented. This is an

Final v2.0.1

110

indication that the SQM Reader is reading the blocks at the same rate that they are being written – i.e. it is not lagging behind. However, remember that you may have multiple readers for an inbound queue. One of them (typically the SQT) may be reading fast enough to read the blocks from cache and may be tripping this counter, while the other may be lagging (see next point below).

7. When the block is read, the counters BlocksRead and BlocksReadCached are incremented accordingly. Obviously, the ratio of BlocksReadCached:BlocksRead is similar to the cache hit ratio in ASE and can indicate when the exec_sqm_write_request_limit/md_sqm_write_request_limit are too small – or that a SQM reader is lagging behind. In cases where there are multiple readers, one may be caught up (and incrementing BlocksReadCached) while the other is lagging. In strict Warm Standby’s with no other replicate involved, the SleepsWriteQ and BlocksReadCached may be the effect of the SQT processing the messages if distribution has not been disabled for the connection. In such cases, disabling the DIST will provide more accurate values for these counters. Otherwise, admin sqm_readers command or the SegsActive can be an indication of how far the WS-DSI may be lagging behind.

8. Once the block is read successfully, the reader parses out the commands. This causes the counter CmdsRead to be incremented. If the message contains XNL data, additional command records may need to be read as follows:

a. For each partial XNL data record read, the XNLPartials counter is incremented. b. If the XNL data record spans more than one 16K block, the next block will try to be

fetched and processed. However, since the SQM is a single thread, the write timers may have popped necessitating and write operation. When this happens, the reading of large messages is interrupted and the XNLInterrupted counter is incremented. If you see large values for XNLInterrupted, it may be an indication that the large message reading is blocking the SQM writes – which in turn may be slowing down the RepAgent processing. It this occurs frequently, you may need to check the replicate_if_changed state of text/image columns or whether their replication is necessary. The same could be true for large comment columns – while these may be necessary for WS systems, in non-WS environments, replicating 16,000 character comment fields to a reporting system may not be necessary.

c. Once the last row is read for the large message, the counter XNLReads is incremented. 9. Once all the commands have been read from a block and successfully processed, the SQM reader tells

the SQM that they are finished with that block. This continues for all 64 blocks in the segment. When all SQM readers signal that they are finished with all the blocks on a particular segment, the segment is marked inactive and the SegsActive counter is decremented.

a. If intrusive counters have been enabled, the timers started when the segment was allocated (3(b) above) are sampled and the TimeAveSeg, TimeLastSeg, and TimeMaxSeg counters are adjusted.

10. Once the segment has been marked inactive and any save interval has expired, the segment is deallocated from the particular queue. This increments the SegsDeallocated timer.

Let’s take a look at some sample data. Again, we will use the customer data as well as in the insert stress test – starting with the customer data below. First, we will look at the writing side by looking at the SQM counters (vs. the reading which are the SQMR counters). Once again, derived statistics are in red.

Final v2.0.1

111

Sam

ple

Tim

e

Cm

dsW

ritt

en

Cm

dSiz

eAve

rage

Blo

cksW

ritt

en

Wri

tesT

imer

Pop

Blo

cksF

ullW

rite

Blo

cksF

ull%

(d

eriv

ed)

Segs

Allo

cate

d

Slee

psW

aitS

eg

Upd

sRso

qid

Upd

sRso

qid/

sec

(der

ived

)

Sqm

_rec

over

_seg

(d

eriv

ed)

0:29:33 268,187 1,655 32,693 2 32,691 99.99 511 0 511 1.6 1

0:34:34 364,705 1,380 36,395 3 36,392 99.99 569 0 569 1.8 1

0:39:37 253,283 1,190 23,664 1 23,663 99.99 370 0 370 1.2 1

0:44:38 266,334 893 18,322 2 18,320 99.98 287 0 287 0.9 1

0:49:40 253,684 1,097 22,907 2 22,903 99.98 358 0 358 1.1 1

0:54:43 164,566 1,723 24,759 0 24,759 100 387 0 387 1.2 1

0:59:45 376,184 1,355 39,865 1 39,862 99.99 623 0 622 2 1

1:04:47 450,809 1,032 34,248 1 34,246 99.99 536 0 535 1.7 1

1:09:50 326,750 1,200 31,783 0 31,783 100 497 0 497 1.6 1

1:14:52 325,340 1,011 25,153 0 25,153 100 393 0 393 1.3 1

1:19:54 317,674 825 19,975 1 19,974 99.99 312 0 312 1 1

Let’s take a look at some of these:

CmdsWritten – This corresponds to the number of commands actually written to the queue. This metric should be fairly close to the RepAgent counter CmdsTotal – although it may not be exact as some RepAgent User thread commands are system commands not written to the queue (such as truncation point fetches). While this may not appear to be as useful given that CmdsTotal is broken down by CmdsApplied, CmdsRequest, CmdsSystem, etc., this value is actually fairly important when looking at read activity and SQMR counters.

CmdSizeAverage – This metric records the number of bytes necessary to store each command. For inserts, this is the after row image, while for updates, both the after row image and the before row image – less identical values when minimal columns is enabled. This metric is useful when trying to determine how wide the rows are being replicated (for space projections) and especially compared to the RepAgent counter PacketsReceived. If the CmdSizeAverage is large – i.e. 2,000 bytes – this could result in a single command per packet being sent using the default packet size. Earlier, we noted that we were getting about 3 RepAgent commands per packet (which includes begin/commit transaction commands) and this metric demonstrates why. At ~1,000 bytes per command, that is all that will fit in the default packet size.

WritesTimerPop & BlocksFull% – the second metric is derived by dividing BlocksWritten by BlocksFullWrite. However, both of these are a good indication of how busy the input stream is to Replication Server. Any writes caused by a timer pop indicate that the SQM block wasn’t full indicating a lull in activity from the Replication Agent User thread. This system is consistently busy with very marginal timer driven flushes. A non-busy system would likely have a lot more and correspondingly a lower full %.

SegsAllocated & SleepsWaitSeg – taken together, these two can illustrate when the segment allocation process is hindering replication performance. The actual cause of the delay could be I/O related, however, it is just as likely to be caused by RSSD performance issues.

UpdsRsoqid/sec – this metric is derived by dividing UpdsRsoqid by the number of seconds between sample intervals. Specifically, again it shows the impact on the RSSD. If we couple this metric with the RepAgent counter UpdsRslocater from above, we are averaging about 2 updates/second. While not a high volume, again, this shows the interruption in RS processing to record recovery information.

Sqm_recover_seg – this metric is derived by dividing the SegsAllocated by the UpdsRsoqid. Much like the RA ECTS value, this is a good indication of the actual RS configuration parameter sqm_recover_seg. Adjusting this slightly could improve RS throughput.

Before we look at the SQM read (SQMR) counters, let’s compare this to the insert stress test:

Final v2.0.1

112

Sam

ple

Tim

e

Cm

dsW

ritt

en

Cm

dSiz

eAve

rage

Blo

cksW

ritt

en

Wri

tesT

imer

Pop

Blo

cksF

ullW

rite

Blo

cksF

ull%

(d

eriv

ed)

Segs

Allo

cate

d

Slee

psW

aitS

eg

Upd

sRso

qid

Upd

sRso

qid/

sec

(der

ived

)

Sqm

_rec

over

_seg

(d

eriv

ed)

11:37:57 1,027 1,465 105 1 104 99.04 2 0 0 0 0

11:38:08 7,788 1,491 817 0 817 100.00 12 0 1 0.1 12

11:38:19 4,512 1,491 471 0 471 100.00 7 0 1 0 7

11:38:30 20,336 1,491 2,120 0 2,120 100.00 33 0 3 0.3 11

11:38:41 553 1,458 57 1 56 98.24 1 0 0 0 0

Again, we see mostly full blocks with exception of the beginning and end of the test run – which illustrates how WritesTimerPop can be used to indicate a lull in Replication Agent user thread activity. Also note that sqm_recover_seg is 10 and the derived value is showing the fluctuation induced by averaging across time periods – for example, the 11:38:08 sample likely had an update to rs_oqid at 8 (2 from previous sample period + 8 = 10) and then the next four were combined with six of the seven in sample 11:38:19 and so forth.

Now let’s take a look at some read statistics by looking at the SQMR counters. First, let’s view the customer data metrics (note that segment allocation metrics are SQM and not SQMR counters):

Sam

ple

Tim

e

Cm

dsW

ritt

en

(SQ

M)

Cm

dsR

ead

(SQ

MR

)

Blo

cksR

ead

Blo

cksR

eadC

ache

d

Cac

hed

Rea

d %

Slee

psW

rite

Q

Wri

re W

ait %

Segs

Act

ive

Segs

Allo

cate

d

Segs

Dea

lloca

ted

0:29:33 268,187 587,860 73,887 17,996 24.35 40,153 54.34 303 511 621

0:34:34 364,705 947,808 99,781 19,657 19.70 36,035 36.11 38 569 835

0:39:37 253,283 318,611 33,309 11,165 33.51 84,369 253.29 2 370 403

0:44:38 266,334 282,958 19,998 7,615 38.07 79,786 398.96 2 287 287

0:49:40 253,684 277,054 28,017 5,199 18.55 40,364 144.06 25 358 335

0:54:43 164,566 194,386 39,231 8,344 21.26 19,273 49.12 2 387 412

0:59:45 376,184 365,435 43,396 2,462 5.67 19,398 44.69 41 623 583

1:04:47 450,809 522,844 42,165 8,728 20.69 40,419 95.85 57 536 522

1:09:50 326,750 400,065 44,025 7,210 16.37 73,404 166.73 29 497 523

1:14:52 325,340 352,656 32,134 6,438 20.03 73,932 230.07 3 393 422

1:19:54 317,674 317,683 19,975 10,909 54.61 144,828 725.04 2 312 312

Let’s take a look at some of these:

CmdsWritten (SQM) vs. CmdsRead (SQMR) – it looks like the SQM is reading a lot more than writing. This is partially true. What has happened is that the SQT cache was filled causing large transactions to get removed from cache. Consequently, when the commit was finally seen, the SQT had to re-read the entire transaction from disk – and consequently had to re-request the commands from the SQM. Consequently, anytime the SQMR.CmdsRead counter is appreciably higher than SQM.CmdsWritten, you should look to the SQT metrics as the SQM is re-scanning the disks. As you will see in some of the later metrics, this has an impact on system performance.

Final v2.0.1

113

Cached Read % - this metric is derived by dividing the BlocksReadCached by BlocksRead. Ideally, we would like this to be in the high percentages with 100% being perfect, but anything in the 90’s acceptable. In this case we see rather dismal numbers – largely the fault of all the rescanning. Even when it appears to “catch up” (around samples 3, 4 & 5), the cache hit rate is low. The reason is simple is that when the SQMR had to re-read, the SQM had to flush the blocks it had to disk – resulting in physical reads most of the time.

Write Wait % - this metric is derived by dividing the BlocksRead by the SleepWriteQ. This is actually an interesting metric. It is desirable that SleepWriteQ is high – by definition, it is when the SQM read client sleeps while waiting for the SQM write client to write. While normally 100% is considered “complete”, in this case a SQM read client may have to wait more than once for the current block to be written. Consequently, the higher above 100% this value, the stronger the indication that the SQM read client is caught up to the SQM writer. This will be evident more when looking at the insert stress test metrics. However, numbers below 300% seem to indicate a latency.

SegsActive – this metric shows how much space is being consumed in the stable queue. Similar (and if fact the same metrics) to admin who, sqm – the amount of active segments indicates latency. However, the latency may not be as large as the actual number of segments active. For instance, between the first two sample periods, the number of active segments drop from 303 to 38. Likely, the large transaction began 300+ segments back – and when it had be successfully read out and distributed, the SQM could then drop those segments (a better description of the process is contained in the SQT processing section regarding the Truncate list). Ideally, low numbers would be desirable here.

Now, let’s take a look a the same counters from the insert stress test. The only caveat is that the insert stress test was a Warm Standby implementation, so these counters are from the SQM read client for the WS-DSI that was reading from the inbound queue.

Sam

ple

Tim

e

Cm

dsW

ritt

en

(SQ

M)

Cm

dsR

ead

(SQ

MR

)

Blo

cksR

ead

Blo

cksR

eadC

ache

d

Cac

hed

Rea

d %

Slee

psW

rite

Q

Wri

re W

ait %

Segs

Act

ive

Segs

Allo

cate

d

Segs

Dea

lloca

ted

11:37:57 1,027 1,018 105 104 99.04 112 106.7 3 2 0

11:38:08 7,788 7,795 817 759 92.90 7,798 954.5 12 12 3

11:38:19 4,512 4,514 471 466 98.93 3,816 810.2 16 7 4

11:38:30 20,336 20,084 2,093 631 30.14 6,742 322.1 45 33 4

11:38:41 553 575 59 13 22.03 3,509 5,947.5 43 1 4

Comparing this to the above, we notice that the Cache Read % is in the high 90’ initially (it drops off later due to the fact the DSI is not keeping up – so the Cache Read % is artificially high at the beginning as the DSI SQT cache is filled). However, note that the Write Wait % is very high – which is desirable. The SegsActive is climbing as the DSI is falling behind due to the replicate ASE not being able to receive the commands fast enough (most often this is the biggest source of latency). This last point is interesting. Nearly all customers who call into Sybase Tech Support complaining about latency in RS and think RS is the problem due to the “backup being in the inbound queue” forget that as a Warm Standby, they only have an inbound queue – which also functions as the outbound queue.

SQT Processing

The Stable Queue Transaction (SQT) is responsible for sorting the transactions into commit order and then sending them to the Distributor to determine the distribution rules. The following diagram depicts the flow of data through the SQT starting with the inbound queue SQM and the Distributor to the outbound queue.

Final v2.0.1

114

Figure 18 – Data Flow Through Inbound Queue and SQT to DIST and Outbound Queue

It is good to think of the SQT as just one step in the process between the two queues - and that performance of this ‘pipeline’ of data flowing between the queues depends on the performance of each thread along the path. For this section, we will focus strictly on the SQT thread on the left side of the above diagram.

In early releases of Replication Server, the SQT thread was a common cause of problems because the default SQT cache was only 128KB and DBA’s would forget to tune it. Even today’s default (1MB) may not be sufficient. In any case, thankfully, this problem is very easy to address by adding cache. Unfortunately, this became almost a “silver bullet” that became relied on by DBA’s to simply keep raising the SQT cache any time there was latency – and then complaining when it no longer helped. Today, if the SQT cache is already above 4-8MB, DBA’s should resist raising it further without first seeing if the cache is being exceeded. Likely, the problem isn’t here – and adding more cache will likely just contribute to the problem at the DSI.

Key Concept #11: SQT cache is dynamically allocated – for small transactions, large amounts of SQT cache will not even be utilized and will result in over-allocating DSI SQT cache if dsi_sqt_max_cache_size is still at the default.

As mentioned earlier, the SQT thread is responsible for sorting the transactions into commit order. In order to better understand the performance implications of this (and the output of admin who, sqt), it is best to understand how the SQT works.

SQT Sorting Process

The SQT sorts transactions by using 4 linked lists, often referred to (confusingly enough) as “queues”. These lists are:

Open – The first linked list that transactions are placed on, this queue is a list of transactions for which the commit record has not been processed or seen by the SQT thread yet.

Closed – Once the commit record has been seen, the transaction is moved from the “Open” list to the closed list and a standard OpenServer callback is issued to the Distributor thread (or DSI, although this is internal to the DSI as will be discussed later in the outbound section).

Read – Once the DIST or DSI threads have read the transaction from the SQT, the transaction is moved to the “Read” queue.

Truncate – Along with the Open queue, when a transaction is first read in to the system, the transaction structure record is placed on the Truncate queue. Only after all of the transactions on a block have had the commit statements read and been processed by the DIST and placed on the read queue can the SQT request the SQM to delete the block.

Final v2.0.1

115

To get a better idea how this works, consider the following example of three transactions committed in the following order at the primary database:

T17 T00T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01T17 T00T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01

CT1 BT1D19 I1

8 I17 I1

6 D15 U1

4 I13 I1

2 I11CT1 BT1D1

9 I18 I1

7 I16 D1

5 U14 I1

3 I12 I1

1

CT2 U27 I2

6 I25 I2

4 I23 I2

2 I21 BT2CT2 U2

7 I26 I2

5 I24 I2

3 I22 I2

1 BT2

CT3 D35 D3

4 I33 U3

2 U31 BT3CT3 D3

5 D34 I3

3 U32 U3

1 BT3

U31

BT3 / CT3 Begin/Commit Transaction Pair (with tran id)

Statement ID

Transaction IDDML Operation (Update, Insert, Delete)Statement ID

Transaction IDDML Operation (Update, Insert, Delete)

Figure 19 – Example Transaction Execution Timeline

In this example, the transactions were committed in the order 2-3-1. Due to the commit order, however, the transactions might as well have been applied similar to:


CT1 BT1D19 I1

8 I17 I1

6 D15 U1

4 I13 I1

2 I11CT1 BT1D1

9 I18 I1

7 I16 D1

5 U14 I1

3 I12 I1

1

CT2 U27 I2

6 I25 I2

4 I23 I2

2 I21 BT2CT2 U2

7 I26 I2

5 I24 I2

3 I22 I2

1 BT2

CT3 D35 D3

4 I33 U3

2 U31 BT3CT3 D3

5 D34 I3

3 U32 U3

1 BT3

Figure 20 – Example Transactions Ordered by Commit Time

However, the transaction log is not that neat. In fact, it would probably look more like the following:

CT1 D19 I1

8 I17CT3 D3

5 D34 CT2 I3

3 U27 U3

2 I26 U3

1 I25 BT3 I2

4 I16 I2

3 D15 I2

2 U14 I2

1 I13 BT2 I1

2 I11 BT1

End of Log Beginning of Log Figure 21 – Transaction Log Sequence for Example Transaction Execution

After the Rep Agent has read the log into the RS, the transactions may be stored in the inbound queue in blocks similar to the following (assuming blocks were written due to timing and not due being full):

Final v2.0.1

116

CT1 D19 I1

8 I17CT3 D3

5 D34 CT2 I3

3 U27 U3

2 I26 U3

1 I25 BT3 I2

4 I16 I2

3 D15 I2

2 U14 I2

1 I13 BT2 I1

2 I11 BT1

0.00.10.20.30.40.50.6

End of Queue Beginning of QueueRow 0.3.0Row 0.3.1Row 0.3.2Row 0.3.3Row 0.3.4

Figure 22 – Inbound Queue from Example Transactions with Sample Row Id’s The following diagrams illustrate the transactions being read from the SQM by the SQT, sorted via the Open, Closed, Read and Truncate queues within the SQT. After reading the first block (0.0), these four queues will look like the below:

BT1

I12

I11

Open Closed Read TruncateTX1 TX1

Figure 23 – SQT Queues After Reading Inbound Queue Block 0.0

Note that the transaction is given a transaction structure record (TX1 in above) and statements read thus far along with the begin transaction record have been linked in a linked list to the Open queue. Note that immediately after reading the transaction from the SQM, the transaction id is recorded in the linked list for the Truncate queue. Continuing on and reading the next block from the SQM yields:

U14

I13

Open Closed Read Truncate

BT1

I12

I11

TX1

BT2

I22

I21

TX2 TX1

TX2


Having read the second block from the SQM, we encounter the second transaction. So, we begin a second linked list for its statements as well as continuing to build the first transactions list with statements belonging to it read from the second block. Additionally, we add that transaction to the Truncate queue. Continuing on and reading the next block from the SQM yields:

Final v2.0.1

117

I16

D15

U14

I13

I24

I23


BT1

I12

I11

TX1BT2

I22

I21

TX2 TX1

TX2


No new transactions were formed, so we are simply adding statements to the existing transaction linked lists. Continuing on yields the following SQT organization:

I16

D15

U14

I13

I26

I25

I24

I23

U32

U31


BT1

I12

I11

TX1

BT2

I22

I21

TX2

BT3

TX3 TX1

TX2

TX3


At this point, we have all three transactions in progress. Continuing with the next block read from the SQM yields the first commit transaction (for TX2). Since we now have a commit, the transaction’s linked list of statements is simply moved to the “Closed” queue and the DIST thread notified of the completed transaction. This yields an SQT organization similar to:

I16

D15

U14

I13

U32

U31


BT1

I12

I11

TX1

BT3TX3 TX1

TX2

TX3

CT2

U27

I26

I25

I24

I23

BT2

I22

I21

TX2

CT2

U27

I26

I25

I24

I23

BT2

I22

I21

TX2


Continuing with the next read from the SQM, the DIST is able to read TX2 which causes it to get moved to the “Read” queue and the commit record for TX3 is read, which moves it to the “Closed” queue. This yields an SQT organization similar to:

Final v2.0.1

118

I18

I17

I16

D15

U14

I13


BT1

I12

I11

TX1

CT2

U27

I26

I25

I24

I23

BT2

I22

I21

TX2

CT3

D35

D34

I33

U32

U31

BT3

TX3 TX1

TX2

TX3


At this juncture, you might think that we could remove TX2 from the inbound queue. However, if you remember, all I/O is done at the block level. In addition, in order to free the space, the space must be freed contiguously from the front of the queue (block 0.0 in this case). Since the statements that make up TX2 are scatter among the blocks and statements for transactions for which the commit has not been seen yet, the deletion of TX2 must wait. Continuing on with the last block to be read, yields the following:


CT1

D19

I18

I17

I16

D15

U14

I13

BT1

I12

I11

TX1

CT2

U27

I26

I25

I24

I23

BT2

I22

I21

TX2

CT3

D35

D34

I33

U32

U31

BT3

TX3 TX1

TX2

TX3


At this stage, all transactions have been closed, however, we still cannot remove them from the inbound queue. Remember, this is strictly memory sorting (SQT cache), consequently, if we removed them from the inbound queue now and a system failure occurred, we would lose TX1. Consequently, we have to wait until it has been read by the DIST. Once that is done, all three transactions would be in the “Read” queue and consequently a contiguous block of transactions could be removed since all of the transactions on the blocks have been read. If however, block 0.6 also contained a begin statement for TX4, then the deletes could still be done for blocks 0.0 through 0.5. How? The answer is that the SQM flags each row in the queue with a status flag that denotes whether it has been processed. Consequently on restart after recovery, the SQT doesn’t attempt to resort and resend transactions already processed. Instead, it simply starts with the first active segment/row and begins sorting from that point.

SQT Performance Analysis

Now that we see how the SQT works, this should help explain the output of the admin who, sqt command (example copied from Replication Server Reference Manual).

Final v2.0.1

119

admin who, sqt Spid State Info ---- ----- ---- 17 Awaiting Wakeup 101:1 TOKYO_DS.TOKYO_RSSD 98 Awaiting Wakeup 103:1 DIST LDS.pubs2 10 Awaiting Wakeup 101 TOKYO_DS.TOKYO_RSSD 0 Awaiting Wakeup 106 SYDNEY_DSpubs2sb Closed Read Open Trunc ------ ---- ---- ----- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Removed Full SQM Blocked First Trans Parsed ------- ---- ----------- ----------- ------- 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 SQM Reader Change Oqids Detect Orphans ---------- ------------ -------------- 0 0 0 0 0 0 0 0 1 0 0 1

The observant will say that not all the SQT threads are listed as the ones for the inbound queues (designated with qid:1) are present, but the ones for outbound queues (designated qid:0) are missing. Well, the reality is that there is not a SQT thread for outbound queues. Instead, the DSI (Scheduler) calls SQT routines. Consequently, spids 10 & 0 above represent DSI threads performing SQT library calls. For this section, we are going to concentrate on the SQT thread aspect – however, remember that it applies to the DSI SQT module as well. Differences will be discussed in the section on the DSI later. The output for the columns are described in the below table:

Column Meaning

Spid Process Id for each SQT thread

State State of the processing for each SQT thread

Info Queue being processed

Closed Number of transactions in the “Closed” queue waiting to be read by DIST or DSI. If a large number of transactions are “Closed”, then the next thread (DIST or DSI-Exec) is the bottleneck as the SQT is simply waiting for the reader to read the transactions.

Read Number of transactions in the “Read” queue. This essentially explains the number of transactions process not yet deleted from the queue. A high number in this block may point to a long transaction that is still “Open” at the very front of the queue (i.e. user went to lunch) as deleting queue space is fairly quick.

Open Number of transactions in the “Open” queue for which commit has not been seen by SQT yet (although SQM may have written it to disk already)

Trunc Number of transactions in the “Truncate” queue – essentially an ordered list of transactions to delete once processed in disk contiguous order. Trunc is the sum of the Closed, Read, and Open columns (due to reasons discussed above).

Removed Number of transactions removed from cache. Transactions are removed if the cache becomes full or the transaction is a large transaction (discussion later)

Full Denotes if the SQT cache is currently full. Since this is a transient counter, you may wish to monitor the "removed" counter to detect if transactions are getting removed due to cache being full.

Final v2.0.1

120

Column Meaning

SQM Blocked 1 if the SQT is waiting on SQM to read a message. This state should be transitory unless there are no closed transactions.

First Trans This column contains information about the first transaction in the queue and can be used to determine if it is an unterminated transaction. The column has three pieces of information: ·ST: Followed by O (open), C (closed), R (read), or D (deleted) ·Cmds: Followed by the number of commands in the first transaction ·qid: Followed by the segment, block, and row of the first transaction An example would be ST:O Cmds: 3245 qid: 103.5.23 – which basically tells you that at this stage, the first transaction in the queue is still “Open” (no commit read by SQT) and so far it has 3,245 commands in the transaction (probably a large one) and begins in the queue at segment 103 block 5 row 23. As we will see later, this is a very useful piece of information.

Parsed The number of transactions that have been parsed. This is the total of transactions including those already deleted from the queue. Along with statistics, this field can give you an idea of the transaction volume over time.

SQM Reader The index of the SQM reader handle. If multiple readers of an SQM, this designates which reader it is.

Change Oqids Indicates that the origin queue ID has changed. Typically this only happens in Warm Standby after a switch active.

Detect Orphans Indicates that it is doing orphan detection. This is largely only noticed on RSI queues. For normal database queues, if someone does not close their transaction when the system crashes, on recovery, the Rep Agent will see the recovery checkpoint and instruct the SQM to purge all the open transactions to that point.

Admin who, sqt is one of the key commands to determining problems on the inbound queue performance. In addition to helping you identify progress of transactions through the Open, Closed, Read and Truncate queues, it is extremely useful for determining when you have encountered a large transaction – or, one that is being held open for very long time. The column that assists in this is the “First Trans” column. Above we gave an example of one view of a large transaction (ST:O Cmds: 3245 qid: 103.5.23). Consider the following tips for this column:

ST: Cmds Qid Possible Cause

O increasing same Large transaction

O same same Rep Agent down or uncommitted transaction at primary

O changes increasing SQT processing normally

C changes slow SQT reader not keeping up (DIST or DSI)

C same same DIST down, outbound queue full

R same same Transaction on same block/queue still active

It is important to recognize that this is the first transaction in the queue – which especially for the outbound queue could have been delivered already. The inbound queue is even more confusing – it may have already been processed, but the space has not been truncated from the queue yet by the SQM. This is especially true if the sqt_init_read_delay and sqt_max_read_delay to are not set to 1000 milliseconds (1 second).

Common Performance Problems

The most common problems with the SQT are associated with 1) large transactions; and 2) slow SQT readers (i.e. DIST or DSI). The first deals with the classic 10,000 row delete. If the SQT attempted to cache all of the statements for such a delete in its memory, it would quickly be exhausted. Consequently, when a large transaction is encountered,

Final v2.0.1

121

the SQT simply discards all of its statements and merely keeps the transaction structure in the appropriate list. However, this means that in order for the transaction to be passed to the SQT reader, the SQT must go back to the beginning of the transaction and physically rescan the disk. In addition to the slow down of simply doing the physical i/o, it effectively pauses the scanning where the SQT had gotten to until that transaction is fully read back off disk and sent to the DIST, etc. It also impacts Replication Agent performance as this likely will involve a large number of read requests to refetch all of the same blocks – adding to the workload of the SQM that is busy trying to write.

The second problem is common as well. In cases where the DIST, or DSI threads cannot keep up, the Closed queue continues to grow until all available DSI SQT cache is used. Once this begins to happen, the SQT has a decision to make. If there are transactions in the Closed or Read queue, the SQT simply halts reading the SQM until the transaction is complete and queue can be truncated. If there are no transactions in the Closed or Read queue, the SQT finds the largest transaction in the Open queue, discards the statements (keeping the transaction structure – similar to a large transaction) and then keeps processing. Should this continue for very long, a large number of transactions in the SQT cache may have to be rescanned – further slowing down the overall process.

SQT Performance Tuning

To control the behavior of the SQT, there are a couple of configuration parameters available:

Parameter RS Meaning

sqt_max_cache_size (Default: 1MB; Recommendation: 4MB)

11.x Memory allocated per connection for SQT cache. Note that this is a maximum – RS dynamically allocates this as needed by the connection and then deallocates when no longer needed. Consequently, this is frequently oversized and customers often don’t understand why continuing to increase it has no effect. Values above 4MB need to be considered very cautiously and only when transactions are being removed and cache has been exceed. The reason is that over sizing this can drive the DSI to be filling cache more than issue SQL due to the default value of dsi_sqt_max_cache_size. See discussion below.

dist_sqt_max_cache_size (Default: 0??; Recommendation: 4MB)

12.6+ 15.0+

This new parameter was added in RS 12.6 ESD #7 as well as RS 15.0 ESD #1. In the past, all connections used sqt_max_cache_size for the inbound queue processing by the SQT regardless of requirement. By adding this parameter, individual inbound queue SQT cache sizes can be tuned similar to DSI SQT cache sizes.

dsi_sqt_max_cache_size (Default: 0; Recommendation: 1MB)

11.x If other than zero, the amount of memory used by the DSI thread for SQT cache. If zero, the memory used by DSI is the same as the sqt_max_cache_size setting. The default of 0 is clearly inappropriate if you start adjusting sqt_max_cache_size. See discussion below.

sqt_init_read_delay (Default: 2000; Min: 1000; Max: 86,400,000 (24 hrs); Recommendation: 10)

12.5+ The length of time in milliseconds that an SQT thread sleeps while waiting for an SQM read before checking to see if it has been given new instructions in its command queue. With each expiration, if the command queue is empty, SQT doubles its sleep time up to the value set for sqt_max_read_delay.

sqt_max_read_delay (Default: 10000; Min: 1000; Max: 86,400,000 (24 hrs); Recommendation: 50)

12.5+ The maximum length of time an SQT thread sleeps while waiting for an SQM read before checking to see if it has been given new instructions in its command queue.

There are two main ways of improving SQT performance. The first is rather obvious – increase the amount of memory that the SQT has by changing the value for sqt_max_cache_size. By default, the SQT has 1MB for each inbound and outbound queue. So, for a total of 2 source and 5 destination databases we would have 14 (2 source inbound/outbound and 5 destination inbound/outbound) 1MB memory segments for SQT cache. However, 1MB is typically too little. Most medium production systems need 2MB SQT caches with high volume OLTP systems using any where from

Final v2.0.1

122

4MB+ of cache. Obviously, the more connections you have, the more this impacts overall Replication Server memory settings. With a 4MB sqt_max_cache_size setting, the earlier example of 2 source/5 destinations would require 52MB strictly for SQT cache – providing that all SQT caches are completely full. Earlier we had the following table:

Configuration Normal Mid Range OLTP High OLTP

sqt_max_cache_size 1-2MB 1-2MB 2-4MB 8-16MB

dsi_sqt_max_cache_size 512KB 512KB 1MB 2MB

memory_limit 32MB 64MB 128MB 256MB

In which these were defined by:

Normal – thousands to tens of thousands of transactions per day Mid Range – tens to hundreds of thousands of transactions per day OLTP – hundreds of thousands to millions of transactions per day High OLTP – millions to tens of millions of transactions per day

By transactions, we are referring to DML based transactions (unfortunately sp_sysmon reports all). However, notice that for most OLTP systems, only a 2-4MB sqt_max_cache_size is truly all that is necessary. Higher than this is really only necessary in very high volume systems that have periodic/regular large transactions. The rationale is that normal OLTP transactions will cycle through the SQT cache so quickly that the SQT cache will likely not use very much memory. However, to avoid problems caused by rescanning, sizing the SQT cache to contain the periodic large transactions will allow the SQT to avoid the hit.

Even 2-4MB SQT cache may be a bit excessive. If you think about it, if each source database is replicating to individual destination systems (1 to 2 and the other to 3), the outbound queue will contain “sorted” transactions provided that no other DIST thread is replicating into the destination. As a result, the SQT cache may not be fully needed for the DSI for transaction sorting – and it can be adjusted down on a connection-by-connection basis via the dsi_sqt_max_cache_size. However, if using Parallel DSI, the DSI may need SQT cache to keep up with the multiple DSI’s parsing requirements. The later (Parallel DSI) is best dealt with by adjusting the dsi_sqt_max_cache_size separately from sqt_max_cache_size. The tendency to oversize SQT cache has lent to some concern from within Sybase Replication Server engineering, prompting the following statement:

Prior to RepServer 12.6, typical tuning advice was to increase sqt_max_cache_size so that there are plenty of closed transactions ready to be distributed or sent to the replicate database when RepServer resources handling those efforts became available. Starting with 12.6 that advice no longer applies. Due to SQT behavior modifications associated with the SMP feature, the best advice for correctly sizing SQT (for either the sqt_max_cache_size or the dsi_sqt_max_cache_size configuration) is to set it large enough so that transactions removed from SQT cache never occur or only infrequently, but not much larger than that. Transactions are removed from SQT cache forcing them to be re-read from the queue when needed, whenever SQT cache contains no closed or read transactions (that is, no transactions to be distributed or to be deleted after having been distributed) and cache is full. In these cases, SQT will remove the statements of undistributed transactions from cache in order to make room for more transactions until it is able to cache one that can be distributed or until some distributed transactions can be deleted. You can monitor the removed transaction count by watching counter 24009 - "TransRemoved". Typically, if this counter does not report more than 1 removed transaction in any 5-minute period, transaction removal rate may be considered acceptable. To help determine the proper setting of sqt_max_cache_size and dsi_sqt_max_cache_size, refer to counter 24019 - "SQTCacheLowBnd". This counter captures the minimum SQT cache size at any given moment, below which transactions would have been removed. Monitor this value frequently during a period of typical transaction flow, and configure SQT cache to be no more than about 20% greater than the largest value observed.

Arguably, this was true even prior to RS 12.6 as SQT cache sizing was frequently oversized on many systems. The rationale for the above statement was that in implementing the SMP logic, the logic for the SQT processing was altered to favor filling the cache vs. providing cached transactions to clients such as the DIST and DSI threads. As a result, latency sometimes was introduced simply by the SQT thread waiting to fill huge caches allocated by the DBA vs.

Final v2.0.1

123

passing the transactions on. It became crucial, then, in RS 12.6 and RS 15.0 - to “right-size’ the SQT cache vs. oversizing it.

One way to detect either of these two situations is to watch the system during periods of peak activity via the admin who, sqt command. As taught/mentioned in the manuals, if the “Full” column is set to a 1, then it is a possible indication that SQT cache is undersized – particularly from the inbound processing side. However, the best indication from the admin who,sqt command is the “Removed” column. If the “Removed” column is growing and the transactions are not large, then it is probable that the cache was filled to capacity several times and multiple transactions normally not considered large were removed to make room. However, the absolute best way (and most accurate) to determine cache sizing is to use the monitor counters.

SQT Counters

SQT Thread Monitor Counters

The following counters are available in RS 12.6 to monitor the SQT thread.

Counter Explanation

CacheExceeded (a useless counter)

Total number of times that the sqt_max_cache_size configuration parameter has been exceeded.

CacheMemUsed SQT thread memory use. Each command structure allocated by an SQT thread is freed when its transaction context is removed. For this reason, if no transactions are active in SQT, SQT cache usage is zero.

ClosedTransRmTotal Total transactions removed from the Closed queue.

ClosedTransTotal Total transactions added to the Closed queue.

CmdsAveTran Average number of commands in a transaction scanned by an SQT thread.

CmdsLastTran Total commands in the last transaction completely scanned by an SQT thread.

CmdsMaxTran Maximum number of commands in a transaction scanned by an SQT thread.

CmdsTotal Total commands read from SQM. Commands include XREC_BEGIN, XREC_COMMIT, XREC_CHECKPT.

EmptyTransRmTotal Total empty transactions removed from queues.

MemUsedAveTran Average memory consumed by one transaction.

MemUsedLastTran Total memory consumed by the last completely scanned transaction by an SQT thread.

MemUsedMaxTran Maximum memory consumed by one transaction.

OpenTransRmTotal Total transactions removed from the Open queue.

OpenTransTotal Total transactions added to the Open queue.

ReadTransRmTotal Total transactions removed from the Read queue.

ReadTransTotal Total transactions added to the Read queue.

TransRemoved Total transactions whose constituent messages have been removed from memory. Removal of transactions is most commonly caused by a single transaction exceeding the available cache.

TruncTransRmTotal Total transactions removed from the Truncation queue.

TruncTransTotal Total transactions added to the Truncation queue.

These changed in RS 15.0 to the following set:

Final v2.0.1

124

Counter Explanation

CmdsRead Commands read from SQM. Commands include XREC_BEGIN, XREC_COMMIT, XREC_CHECKPT.

OpenTransAdd Transactions added to the Open queue.

CmdsTran Commands in transactions completely scanned by an SQT thread.


MemUsedTran Memory consumed by completely scanned transactions by an SQT thread.

TransRemoved Transactions whose constituent messages have been removed from memory. Removal of transactions is most commonly caused by a single transaction exceeding the available cache.

TruncTransAdd Transactions added to the Truncation queue.

ClosedTransAdd Transactions added to the Closed queue.

ReadTransAdd Transactions added to the Read queue.

OpenTransRm Transactions removed from the Open queue.

TruncTransRm Transactions removed from the Truncation queue.

ClosedTransRm Transactions removed from the Closed queue.

ReadTransRm Transactions removed from the Read queue.

EmptyTransRm Empty transactions removed from queues.

SQTCacheLowBnd The smallest size to which SQT cache could be configured before transactions start being removed from cache.

SQTWakeupRead An SQT client awakens the SQT thread who is waiting for a queue read to complete.

SQTReadSQMTime The time taken by an SQT thread (or the thread running the SQT library functions) to read messages from SQM.

SQTAddCacheTime The time taken by an SQT thread (or the thread running the SQT library functions) to add messages to SQT cache.

SQTDelCacheTime The time taken by an SQT thread (or the thread running the SQT library functions) to delete messages from SQT cache.

SQTOpenTrans Current open transaction count.

SQTClosedTrans Current closed transaction count.

SQTReadTrans Current read transaction count.

As mentioned earlier, the average, total and max counters are replaced in RS 15.0 with individual columns in rs_statdetail. However, the three new time tracking counters above (SQTReadSQMTime, SQTAddCacheTime, and SQTDelCacheTime) could be interesting if there is a latency within the SQT.

The most important counters SQT counters are:

CmdsPerSec = CmdsTotal/seconds OpenTransTotal, ClosedTransTotal, ReadTransTotal CmdsAveTran, CmdsMaxTran CacheMemUsed, MemUsedAveTran CachedTrans = CacheMemUsed/MemUsedAveTran

Final v2.0.1

125

TransRemoved (vs. CacheExceeded) EmptyTransRmTotal

Again, the first one (CmdsPerSec)is establishing a rate – hopefully it should compare to the rate from the RA thread. The second set (OpenTransTotal, ClosedTransTotal, ReadTransTotal) all refer to the Open, Closed, Read and Truncate transaction lists used by the SQT for sorting. However, the real goal is to see that all three are nearly identical. If ClosedTransTotal starts to lag behind OpenTransTotal, the most likely culprit is a series of large transactions. However, this is not as common as when ReadTransTotal is lagging Closed. In the latter case, either the DIST is not able to keep up (due to bad STS cache settings or slow outbound queue) or a large number of large transactions were committed and in order to pass them to the DIST (which is when it moves from Closed to Read), the whole transaction has to be rescanned from disk. A third alternative is that the SQT cache is too big and since the SQT prioritizes reading over servicing the DIST (and freeing space from the SQM dead last), too much SQT cache could be a problem as well. If this happens, increasing sqt_init_read_delay slightly may help (as the SQT will be forced to find something else to do).

The way to find out the cause is to look at the next set of counters. These report the average number of commands per transaction as well as the max. This can be really useful to spot those bcp’s that someone is not using –b on as well as to get a picture of the transaction profile from the origin from a sizing perspective (as will be useful for DSI tuning). If CmdsMaxTran is high, than it is likely a transaction was removed from cache and that may be the cause of ReadTransTotal lagging (more on this later).

The third set of counters (CmdsAveTran, CmdsMaxTran) is also very interesting – especially when combined with the next one ‘CachedTrans’. From this, we can see how much SQT cache was actually used by this SQT and the average memory per transaction. From the inbound queue’s perspective, we likely will only care about the CacheMemUsed – monitoring to see how much memory we actually are using and if we need to increase this (if TransRemoved > 0). If we need to increase it, MemUsedAveTran gives us a good starting point to use as a multiple to increase by (i.e. to add cache for another 100 transactions – simply multiply MemUsedAveTran by 100). However, these counters are the most useful for DSI tuning. For example, we can not group transactions if they are not in cache – so if we are using 5 parallel DSI’s and have dsi_max_xacts_in_group at 20, we will need enough cache for at least 100 transactions – and likely double that number (so if CachedTrans is <100, we will need to increase dsi_sqt_max_cache_size). Most often, you will find that sites have left dsi_sqt_max_cache_size at 0 which means it inherits the value for sqt_max_cache_size – which is likely oversized and now the DSI/SQT module is spending time stuffing the cache vs. giving the DSIEXEC’s transactions to work on.

Of all of these, the TransRemoved counter is likely the most critical (and why it is high-lighted). If TransRemoved is 0, adding SQT cache is a useless exercise and may actually contribute to the problem. Additionally, if TransRemoved is occasionally > 0 but the CmdsMaxTrans is 1,000,000 – you likely don’t have enough memory to cache it anyhow. However, if you frequently see TransRemoved >0, you may want to add more SQT cache by increasing sqt_max_cache_size. The key here is that just about any non-zero value occurring often is a problem – so thinking that just because it is low (like a steady value < 10) means it is not a problem is just plain wrong. Additionally, sqt_max_cache_size is a server setting that applies to the all connections – so before decreasing, you may want to check all your connections and do not decrease if any show TransRemoved > 0 that are not attributable to the once nightly batch job or other obscenely large transaction.

Notice that we focused on TransRemoved. CacheExceeded is kind of like the admin who, sqt cache full column – it merely is an indication that the cache was full at some point (which the SQT is busy trying to do). However, as transactions are read and the truncated from the SQT cache, this value rapidly change as the new space available is filled quickly by the SQT. If using admin who, sqt, the full column likely blinked between 0 & 1 so fast it is like a light-bulb in your house – blinks so fast you think it is constantly on vs. 60Hz. This is so useless a metric, that this counter was removed in RS 15.0.

The last counter (EmptyTransRmTotal) is good as a bad-application design counter. If you see a lot of empty transactions, it is either because everything is being done in isolation level 3 or chained mode. The latter can be especially unplanned with java applications as the default behavior is to execute all procedures in chained mode. Even if no rows were modified in the proc (selects only), since a commit was registered, the empty transaction is flushed to the transaction log (think of the performance implications there – and the log semaphore) and then replicated. Another common occurrence of this prior to ASE 12.5.2 was system commands – such as reorgs – which use a plethora of small empty transactions to track progress. So if the RA and/or SQMR is lagging and you have a high number of EmptyTransRmTotal, it is time to either upgrade to ASE 12.5.3+ or hunt your developers down to see if they are running everything in chained mode or isolation level 3 for some reason.

SQT Thread Counter Usage

After the fairly lengthy discussion of how the SQT works, we don’t need a lot of detail here as the Open, Closed, Read and Trunc prefixes make the counters fairly intuitive. Instead, let’s skip to looking at the customer data:

Final v2.0.1

126

Sam

ple

Tim

e

Cm

dsW

ritt

en

(SQ

M)

SQT

Cm

dsT

otal

Cm

dsM

axT

ran

Ope

nTra

nsT

otal

Clo

sedT

rans

Tot

al

Rea

dTra

nsT

otal

Cac

heE

xcee

ded

Cac

heM

emU

sed

Tra

nsR

emov

ed

0:29:33 268,187 268,502 215,196 21,031 21,031 21,131 733 324,608 6

0:34:34 364,705 336,528 215,196 29,787 29,790 29,661 4,438 1,430,016 2

0:39:37 253,283 280,586 9,462 65,767 65,766 65,941 10,892 632,320 7

0:44:38 266,334 266,528 9,462 65,192 65,193 65,257 10,382 857,600 3

0:49:40 253,684 253,246 3,448 59,014 59,014 59,035 13,297 1,379,840 7

0:54:43 164,566 165,535 3,442 38,943 38,944 38,933 10,222 1,498,880 3

0:59:45 376,184 347,213 1,723 81,818 81,817 81,678 21,159 2,091,776 10

1:04:47 450,809 432,871 72,313 83,471 83,469 83,465 27,029 1,944,832 5

1:09:50 326,750 374,994 3,442 84,597 84,597 84,806 24,705 2,103,040 15

1:14:52 325,340 327,038 1,723 73,442 73,443 73,213 15,644 1,967,104 17

1:19:54 317,674 318,111 93 76,525 76,525 76,441 5,240 1,750,528 0

Now then, this customer had sqt_max_cache_size set at 2,097,152 bytes (2MB) and dsi_sqt_max_cache_size at 0. Also, monitoring had been ongoing for >10 hours when this slice of the sampling was taken – and the system was busy the entire time. As a result, this represents a ‘steady state” of the server. With this in mind, let’s take a look at these metrics.

SQT CmdsTotal vs. SQM CmdsWritten – This represents the lag that the SQT in reading from the inbound queue as commands occur. We said earlier, that often the best starting point is to compare the “Cmds” in each counter module through the RS. In this case, the SQT is keeping up, reading the commands almost as soon as they arrive (when the SQM writes them). However, it got behind in the 1:00am time frame when the cache filled, but then caught back up quickly. Any latency in the system is not due to the SQT, however, that does not mean that it is tuned properly.

CmdsMaxTran – This is a very interesting statistic as it indicates the largest transaction processed during that sample period. While it might be tempting to use CmdsAveTran, the problem is that a lot of small transactions could skew when a large transaction hit. The most useful aspect to this metric is used in conjunction with TransRemoved to determine if raising the sqt_max_cache_size would be of benefit. Note especially the extremely large transaction at the beginning; the fairly consistently large transactions throughout and the small transaction at the end.

OpenTransTotal, ClosedTransTotal, ReadTransTotal – It should be fairly obvious what these refer to – the “Open”, “Closed” and “Read” transaction lists. The goal is that these should be fairly identical during the sample period – meaning that transactions are added to the SQT cache, the commit is found, and it is passed to the DIST thread quick enough that no discernable lag is evident. The problem is that the SQT gives priority to filling the cache over servicing the DIST, and as a result, it is not untypical for the ReadTransTotal to lag behind ClosedTransTotal until sqt_max_cache_size is reached. At this point, the ReadTransTotal will start mimicking the ClosedTransTotal. The reason why is that the SQT can’t put any more transactions into the cache until it removes one – as a result, the processing (once the cache is full) is that a new transaction can’t be read from the inbound queue until one is read by the DIST. This isn’t obvious in the above statistics as the stats were from RS 12.1 vs. 12.6 when the change in SQT processing was influenced by the SMP implementation.

CacheMemUsed – This is a very interesting counter. Not only does it help in sizing sqt_max_cache_size by showing the high-water mark during each sample interval, it also shows the dynamic allocation and deallocation of memory within each SQT cache. In this case, we have 2MB configured – but at the beginning we are only using about 300K. This grows to 1.4MB and then drops back down to 600K before growing successively until the max is reached.

Final v2.0.1

127

TransRemoved – this is one of the more important counters. Looking at the above, we note that nearly every sample interval has transactions removed, clearly indicating the SQT cache is undersized. However, if transactions were only removed during the first several sample intervals, this may not be true. If you think about it, a 200,000 row transaction averaging 1K command size (SQM counter CmdSizeAverage), you would need 200MB of SQT cache to contain it. This is impractical as the next large transaction (likely a bcp as it was in this case) may have 500,000 rows. Consequently, you don’t tune sqt_max_cache_size to fully cache extraordinarily large transactions that occur periodically. However, in the above case we see that we have a fairly constant transaction sizes in the 3,000-9,000 row range (suggesting a 4-10MB cache). Additionally, the cache is completely full twice around 1:00am when the number of transactions peak at ~80,000 transactions.

Consequently, this system would benefit from increase sqt_max_cache_size to 16MB (16,777,216). This value is actually high but is based on providing padding over the largest transaction that is expected that we really want to cache (the 9,000 command transactions assuming 1,500 byte command size). While an 8MB SQT cache may be usable, increasing it to 32MB is likely not to have any benefit over 16MB. However, if we do raise this, we should make sure that dsi_sqt_max_cache_size is explicitly set to 1-2MB. Without doing this, we allocate 16MB of cache for the DSI thread – which really doesn’t need it. As a result, the DSI Scheduler will spend it’s time filling the DSI SQT cache vs. yielding its time to the DSI EXEC threads to process the SQL statements. It has been shown that oversizing the SQT cache can lead to performance degradation as a result.

Distributor (DIST) Processing

Earlier we showed the inbound process flow from the inbound queue to the outbound queue using the following diagram:

Figure 30 – Data Flow Through Inbound Queue and SQT to DIST and Outbound Queue

This time, we will be focusing on the Distributor (DIST) thread. Of all the processes in the Replication Server, the DIST thread is probably the most CPU intensive. The reason for this is that the DIST thread is the “brains” behind the Rep Server – determining where the replicated data needs to go. In order to determine the routing of the messages, the DIST thread will call three library routines - the SRE, TD and MD as depicted above. These library routines are discussed below.

Subscription Resolution Engine (SRE)

The Subscription Resolution Engine (SRE) is responsible for determining whether there any subscribers to each operation. Overall, the SRE performs the following functions:

• Unpacks each operation in the transaction. • Checks for subscriptions to each operation

Final v2.0.1

128

• Checks for subscriptions to publications containing articles based on the repdef for each operation • Performs subscription migration where necessary.

For the most part, the SRE simply has to do a row-by-row comparison for each row in the transaction. A point to consider is that the begin/commit pairs in the transaction were effectively removed by the SQT thread and the transaction information (transaction name, user, commit time, etc.) are all part of the transaction control block in the SQT cache. This is important as the TD module will make use of this information, but for now, the SRE simply has to check for subscriptions on the individual operations. The reason the SRE looks at the individual operations is that not all tables may be subscribed to by all the sites – consequently a transaction that affects multiple tables would still need to have the respective operations forwarded accordingly.

Subscription Conditions

To maintain performance, the SRE is a very lean/efficient set of library calls that only supports the following types of conditionals:

• Equality – for example col_name = constant. A special type of equality is permitted using rs_address columns is bit-wise comparisons with the logical AND (&) function.

• Range (unbounded and bounded) – for example col_name < constant or col_name > low_value and col_name < high_value

• Boolean AND conditionals

Note that several (sometimes disturbing to those new to Replication Server) forms of conditionals are not supported:

• Functions, formulas or operators (other than & with rs_address columns) are not supported • Boolean OR, NOT, XOR conditionals. Boolean OR conditionals are easily accomplished via simply creating

two subscriptions – one for each side of the OR clause. • Not equals (!=, <>) comparators. However, this is easily bypassed by treating the situation like a non-

inclusive range. For example (col_name != “New York”) becomes (col_name < “New York” OR col_name > “New York”) which is handled simply by using two subscriptions. For “not null” comparisons, a single subscription based on col_name > ‘’ (note the empty string and use of single quotation marks) is sufficient. Incidentally, this trick is borrowed from the SQL optimization trick of switching column!=null to column>char(0) – the ANSI C equivalent for NUL.

• Columns contained in the primary key can not have rs_address datatypes.

It should also be pointed out that the SRE does not check to see if a site subscribes more than once. For example, a given replication definition could specify that last name, city, and state are subscribable columns. If a destination wants to subscribe to all authors in Dublin, CA or have a last name of ‘Smith’ care needs to be taken to avoid a duplicate row situation. Simply creating two subscriptions: one specifying last_name=’Smith’ and the other specifying city=’Dublin’ and state=’CA’ will result in an overlapping subscription – and cause the destination to receive duplicate rows.

It should be noted that the next discussion – while focusing on rs_address columns – has a secondary purpose in illustrating how subscription rules can impact implementation choices.

The biggest restriction is that for any subscription, each searchable column can only participate in a single conditional (a range condition constructed by two where clauses is considered a single conditional). A good example of how this impacts replication can be seen in the treatment of rs_address columns. Many Replication System Administrators complain that the rs_address column isn’t as useful as it could be for several reasons:

• It only supports 32 bits – restricting them to 32 sites in the organization. • If the only column changed, then it is not replicated – problematic for standby databases using repdef &

subscriptions vs. Warm Standby feature. • The bit-wise AND operation for the subscription behaves as col_name & value > 0 vs. col_name & value =

value. This causes a problem described later in this section.

As a result, as their business grows, they have to add more rs_address columns causing considerable logic to be programmed in to the application or database triggers to support replication. While one rs_address column is easily managed, they are reluctant to add more. A valid complaint if you think of the bits one dimensionally with sites. Of course, using the rs_address column as an integer and subscribing with a normal equality (for example, subscribing where site_id = 123 vs. subscribing where site_id & 64 ) extends this near infinitely, however, if the data modification is projected for multiple sites, this could require multiple updates to the same rows and subscription migration issues. An alternative solution (but one that doesn’t work as we will see why) might be to think of the bits in the rs_address columns as components similar to class B & class C Internet addresses. High order bytes could be associated with

Final v2.0.1

129

countries or regions while the low order bits with specific sites within those regions. Consider the following examples of bit-wise addressing:

Bit Addressing Total Sites Comments

4 – 28 112 Could be 4 World Regions – each with 28 locations

8 – 24 192 World Region – Location

16 – 16 256 Country/Region – Location

4 – 4 – 24 384 World Region – Country – Location

4 – 8 – 16 512 Hemisphere – Country/Region – Location

4 – 4 – 4 – 20 1280 Hemisphere – Country – Region – Location

4 – 4 – 8 – 16 2048 Hemisphere – Country – Region – Location

4 – 4 – 8 – 8 – 8 8192 Hemisphere – Country – Region – District – Office

While this does expand the number of conditions that must be checked, it logically fits with distribution rules the application may be trying to implement and therefore mentally easier to implement. Additionally, in the above, we treated each as separate individual locations. If the last bit address represented a region or “cell”, then the number of sites addressable with each scheme extends another order of magnitude. However, it should be noted that this scheme (if it worked) would only work in cases where data is intended solely to be distributed to a single Region or District (next to last division) or a single location. Otherwise, the same subscription migration issue would occur that plagues a single integer based scheme – updates setting the value to first one value and then another in an attempt to send to more than one location migrates the data from one location to the other instead of sending it to both.

As mentioned earlier, using multiple rs_address columns or “dimensioning” the rs_address column will result in more conditionals for the SRE to process. For multiple columns, the reason should be obvious – a separate condition would be necessary for each column. However, the same is true for rs_address columns that have been dimensioned – a separate condition would be necessary for each “dimension” at a minimum. This is simply due to the fact that the original intent of the rs_address column was a single dimension of bits. Consequently, when a condition such as (column & 64) returns a non-zero number, the row is replicated. Combining several bits as in (column & 71) could have some unexpected results. Since “71” is 64+4+2+1 (bits 6,2,1, and 0), you might think that this would achieve the goal. However, the way rs_address columns are treated, any column which has bits 6, 2, 1 or 0 on would get replicated to that site – effectively a bitwise “OR”. This includes rs_address values of 3, 129, etc. Since we are allowed to AND conditions together, you might think the way to ensure that exactly the desired value is met is to use multiple conditions as in:

-- my_rsaddr_col is an rs_address column create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where my_rsaddr_col & 64 and my_rsaddr_col & 4 and my_rsaddr_col & 2 and my_rsaddr_col & 1

BUT, we can’t do that!!! Unlike other columns (in a sense), rs_address columns may only appear once in the where clause of a subscription. It results in:

Msg 32027, Level 12, State 0: Server 'SYBASE_RS': Duplicate column named 'my_rsaddr_col' was detected.

The reason is that for any single subscription, a single column can only participate in a single rule (rs_rules table has a unique index on subscription and column number). Consequently, although other columns can appear more than once in a where clause, the union of the conditions must produce a single valid range (single pair of low & high values). For example:

-- Good subscription create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where int_col > 32 and int_col < 64 -- Good subscription (effectively !=32)

Final v2.0.1

130

create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where int_col < 32 and int_col > 32 -- Bad range subscription – should be an OR (two subscriptions) create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where int_col < 32 and int_col > 63 -- Bad range subscription – should be an OR (two subscriptions) create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where int_col = 30 and int_col = 31

Among other things, you can see that this condition restricts Replication Server from supporting Boolean “OR” conditionals and forces designers to implement multiple rs_address columns. Even if attempting to use the second rs_address column as the Region/District dimension as depicted above in the 2 dimensional break-out, you could incur problems. There is a work-around for the ‘OR’ problem, of course. Use articles/publications overlaying replication definitions/subscriptions. Introduced in RS 11.5, articles allow Boolean OR’s as well as referring to the same column multiple times in the same where clause. However, the references to the same column must use an OR clause as within the RSSD, and ‘AND’ clause behaves the same as a normal subscription, while an OR clause constructs multiple where clauses conditions in the RSSD. Consider the following: create publication rollup_pub with primary at HQ.db go -- illegal article definition create article titles_art for rollup_pub with primary at HQ.db with replication definition titles_rep where my_rsaddr_col & 64 and my_rsaddr_col & 8 go -- legal article definition create article titles_art for rollup_pub with primary at HQ.db with replication definition titles_rep where my_rsaddr_col & 64 or where my_rsaddr_col & 8 go

It seems frustrating that there seems to be no way to bypass the 32 site limit with a single rs_address column. While a theoretical 1,024 sites could be addressed if each dimension supported an even 32 locations in each, remember, only a single Region/District or location could be the intended target. Additionally, if you think about it for a second, the most common method for updating rs_address columns to set the desired bits is in a trigger. Consequently, the original row modification plus the modification in which the bits are set are both processed by replication server. As a result, a single replication would require 2 updates to the same row – the first being the regular update and the second setting the appropriate bits for distribution. Additional destinations would require additional updates. This leads to n+1 DML operations at the primary for every intended operation – not a good choice then, if performance is of consideration. Additionally, if a WS system is involved, it ignores updates in which the only changes were to rs_address columns – consequently – after a failover – you may not have an accurate reflection of the last site updates were distributed to in the processing.

SRE Performance

Performance of the SRE depends on a number of issues that should be fairly obvious:

• Number of replication definitions per table. • Number of subscriptions per replication definition • Number of conditions per subscription

In order to reduce the number of physical RSSD lookups to retrieve replication definitions, subscriptions and where clauses, the SRE makes use of the System Table Services (STS) cache. Configurable through the replication server configuration sts_cachesize, the STS caches rows from each RSSD table/key combination in a separate hash table. The

Final v2.0.1

131

sts_cachesize parameter refers to the number of rows for each RSSD table. For most systems, the default sts_cachesize configuration of 100 is far too low. This would restrict the system to only retaining the most current 100 rows of subscription where clauses, etc. A better starting point might be to set sts_cachesize to the max of the number of columns in repdefs managed by the current Rep Server or the number of subscriptions on the repdefs managed by the current Rep Server, if greater. One way to determine how effective the STS cache is, is to turn on the cache statistics trace flag.

trace “on”, STS, STS_CACHESTATS - Collects STS Statistics

Which works prior to RS 12.1. With RS 12.1, you can simply use the provided monitor counters. As you can imagine, the largest impact that you can have is by increasing sts_cachesize to reduce the physical lookups.

Key Concept #12: The single largest tuning parameter to improve Distributor thread performance is increasing the sts_cachesize parameter in order to reduce physical RSSD lookups.

The biggest bottleneck of the SRE will actually be getting the transactions from the SQT fast enough. Consequently, the sqt_max_cache_size setting is crucial to overall inbound processing. For example, at one customer, a sqt_max_cache_size of 4MB was resulting in considerable latency in processing large transactions being distributed to two different reporting system destinations. Setting the sqt_max_cache_size to 16MB resulted in the inbound queue draining at over 100MB/min. This speed is even more notable when considering that the DIST thread had to write each transaction from the inbound queue to two different outbound queues.

Transaction Delivery

The Transaction Delivery (TD) library is used to determine how the transactions will be delivered to the destinations. The best way to think of this is that while the SRE decides who gets which individual modifications, the TD is responsible for “packaging” these modifications into a transaction and requesting the writes to the outbound queue. For example, consider the following transaction:

begin transaction web_order insert into orders (customer, order_num, ship_addr, ship_city, ship_state, ship_zip) values (1122334, 123456789, “123 Main St”,”Anytown”,”NY”,21100) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”31245Q”, “Chamois Shirt”,$25.00,2,0,$50.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”987652W”, “Leather Jacket”,$250.00,1,0,$250.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”54783L”, “Welcome Mat”,$12.00,1,0,$12.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”732189H”, “Bed Spread Set”,$129.00,1,0,$129.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”30345S”, “Volley Ball Set”,$79.00,1,0,$79.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”889213T”, “6 Man Tent”,$494.00,1,$49.40,$444.60) update orders set order_subtotal=$964.60, order_shipcost=$20, order_total=$984.60 commit transaction

Now, picture what happens in a normal replication environment if the source system was replicating to three destinations – each concerned with its own set of rules. For example, Replicate Database 1 (RDB1) might be concerned with clothing transactions (shipping warehouse for clothing), while RDB2 with transactions for household goods, and RDB3 focusing on sporting items. This would result in the following replicate database transactions:

-- replicate database 1 (clothing items) begin transaction web_order insert into orders (customer, order_num, ship_addr, ship_city, ship_state, ship_zip) values (1122334, 123456789, “123 Main St”,”Anytown”,”NY”,21100) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”31245Q”, “Chamois Shirt”,$25.00,2,0,$50.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”987652W”, “Leather Jacket”,$250.00,1,0,$250.00) update orders set order_subtotal=$964.60, order_shipcost=$20, order_total=$984.60 commit transaction -- replicate database 2 (household goods) begin transaction web_order insert into orders (customer, order_num, ship_addr, ship_city, ship_state, ship_zip) values (1122334, 123456789, “123 Main St”,”Anytown”,”NY”,21100) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”54783L”, “Welcome Mat”,$12.00,1,0,$12.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”732189H”, “Bed Spread Set”,$129.00,1,0,$129.00)

Final v2.0.1

132

update orders set order_subtotal=$964.60, order_shipcost=$20, order_total=$984.60 commit transaction -- replicate database 3 (sporting goods) begin transaction web_order insert into orders (customer, order_num, ship_addr, ship_city, ship_state, ship_zip) values (1122334, 123456789, “123 Main St”,”Anytown”,”NY”,21100) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”30345S”, “Volley Ball Set”,$79.00,1,0,$79.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total)

values (123456789,”889213T”, “6 Man Tent”,$494.00,1,$49.40,$444.60) update orders set order_subtotal=$964.60, order_shipcost=$20, order_total=$984.60 commit transaction

The SRE physically determines what DML rows go to which of the replicates, however, it is the TD that “remembers” that each is within the scope of the outer transaction “web_order” and requests the rows to be written to each of the outbound queues. It accomplishes this through the following steps:

• Looks up the correct queue for each of the destination databases – it is passed a bitmap of the destination databases from the DIST thread (based on SRE).

• Writes a begin record for each transaction to the destination queue (using the commit OQID) • For each operation received, adds two bytes to the commit OQID and replaces the operations OQID with the

new OQID based off of the commit record. • Packs the command into packed ASCII format and writes the command to each of the destination queues (via

the MD module) • Writes a commit record to each of the queues once the entire list of operations has been processed.

Earlier, in the makeup for the OQID, we discussed the fact that the TD module added two bytes for uniqueness. A frequent question is “Why?”. The answer is in the simple fact that transactions could overlap begin/commit times and since the original OQID’s are generated in order, it would result in a de-sorting all the work done by the SQT thread if they were just sent through as normal. Consider the following points:

• When the Rep Agent forwards commands to the Replication Server it generates unique 32 byte monotonically increasing OQID’s.

• The job of the SQT thread is to pass transactions to DIST thread in the COMMIT order, therefore the commands the DIST forwards to the TD module may not have increasing OQID’s.

• The SQM thread relies on the increasing OQID’s to perform its duplicate detection. • In order to prevent the outbound SQM rejecting the commands, TD library appends a 2 byte counter to

COMMIT record of the transaction for all the commands which are distributed by TD. Only DIST thread calls TD.

o Why the commit record??? Because if your transaction began before someone else’s who committed before you, your begin tran (and other rows would have lower OQID’s and would really mess things up).

o So we use the CT oqid and add 0001-ffff to each row in the tran • The counter is reset when a NEW begin record is passed to TD

Consequently, as each transaction is processed, the TD uses the commit record’s OQID and simply adds a sequential number in the last two bytes. Consider the following scenario in which transaction T1 begins prior to transaction T2, yet commits after:

OQID Operation 0x04010000 begin t1 0x04020000 insert t1 0x04030000 begin t2 0x04040000 delete t2 0x04050000 insert t1 0x04060000 update t2 0x04070000 insert t2 0x04080000 insert t1 0x04090000 commit t2 0x040A0000 insert t1 0x040B0000 commit t1

The TD would receive T2 first and then T1 and would renumber the OQID’s as follows:

Final v2.0.1

133

OQID Operation 0x04090001 begin t2 0x04090002 delete t2 0x04090003 update t2 0x04090004 insert t2 0x04090005 commit t2 0x040B0001 begin t1 0x040B0002 insert t1 0x040B0003 insert t1 0x040B0004 insert t1 0x040B0005 insert t1 0x040B0006 commit t1

As a result, now the destination queues have transactions in commit order with increasing OQID’s to facilitate recovery. This should also explain why some people have a difficult time identifying the same transaction in the outbound queue as one in the inbound queue when attempting to ensure that it is indeed there. You need to first find the commit record for that transaction in the inbound queue – a feat that is not made simple in that it is not always identified which transaction the commit record belongs to. As a result, it almost always easier to search by values in each record (i.e. the primary key values).

Message Delivery

The Message Delivery (MD) module is called by the DIST thread to optimize routing of transactions to data servers or other Replication Servers. The DIST thread passes the transaction row and the destination ID to the MD module. Using this information and routing information in the RSSD, the module determines where to send the transaction:

• If the current Replication Server manages the destination connection, the message is written to the outbound queue via the SQM for the outbound connection.

• If the destination is managed by another Replication Server (via an entry in rs_repdbs), the MD module checks to see if it is already sending the exact same message to another database via the same route. If so, the new destination is simply appended to the existing message. If not, the message is written to the outbound queue via the SQM for the RSI connection to the Replicate Replication Server.

MD & Routing

This last point is crucial to understanding a major performance benefit to routing data – consider the following architecture

Figure 31 – Example World Wide Topology

In the above diagram, if a transaction needs to be replicated to all of the European sites, the NY system only needs to send a single message with all of the European destinations in the header to the London system. Further, due to the multi-tiered aspects of the Pacific arena above, NY would only have to send a single message to cover Chicago, Dallas, Mexico City, San Francisco, Tokyo, Taiwan, Hong Kong, Peking, Sydney Australia, New Delhi. In the past, this has often been touted as a means to save expensive trans-oceanic bandwidth. While this may be true, from a technical perspective, the biggest savings is in the workload required of any one node – allowing unparalleled scalability.

In addition, this performance advantage gained by distributing the outbound workload may make it feasible to implement replication routing even to Replication Servers that may reside on the same host. Take, for example, the following scheme.

Final v2.0.1

134

POS

Supply

Accounting

Pay Roll Marketing

CRM

Billing

Purchasing

ShippingDW Figure 32 – Example Retail Sales Data Distribution

In this scenario, if the Replication System begins to lag, the POS system may be impacted due to the affect the Replication Server could have on the primary transaction log if the Replication System’s stable devices are full. While none of the systems are very remote from the POS system, in this case, it may make sense to implement a MP Rep Server implementation by using multiple Replication Servers to balance the load.

POS

Supply

Accounting

Pay Roll Marketing

CRM

Billing

Purchasing

ShippingDW Figure 33 – Retail Sales Data Distribution using Multiple Replication Servers

Note that in the above example solution, the RS that manages the POS connection does not then manage any other database connections. Consequently, that RS can concentrate strictly on inbound queue processing and subscription resolution. The other three can concentrate strictly on applying the transactions at the replicates. With a 6-way SMP box, all four Replication Servers, along with a single ASE implementation for the RSSD databases could start making more effective use of larger server systems that they may be installed on.

Key Concept #13: While replication routes offer network bandwidth efficiency, they offer a tremendous performance benefit to Replication Server by reducing the workload on the primary Replication Server. This can be used to effectively create a MP Replication scenario for load balancing in local topologies.

An additional performance advantage in inconsistent network connectivity environments is that network problems that occur during Replication Server applying the transactions at the replicate can degrade performance due to frequent rollbacks/retry due to loss of connection.

Final v2.0.1

135

MD Tuning

Other than the sts_cachesize and replication routing, the other performance tuning parameter that directly affects the distributor thread is md_sqm_write_request_limit (formerly known as md_source_memory_pool prior to RS 12.1). This is a memory pool specifically for the MD to cache the writes to the SQM for the outbound queues. With previous versions of RS (i.e. 11.x & 12.0), this parameter was frequently missed as the only way to set it was through using the rs_configure stored procedure in the RSSD database. Fortunately, with RS 12.1+, the md_sqm_write_request_limit can be set through the standard alter connection command.

While md_sqm_write_request_limit is a connection scope tuning parameter, it is often misunderstood as it does not change destination connections, but rather the source connection. The reason for this is that we are still discussing the Distributor thread, which is part of the inbound side of replication server internal processing. By adjusting the md_sqm_write_request_limit/md_source_memory_pool, you allow the source connection’s distributor thread to cache its writes when the outbound SQM is busy and to enable more efficient outbound queue space utilization. This is especially useful when a source system is replicating to multiple destinations without routing, when a replicate database has more than one source database (i.e. corporate rollup), or for the remote replication server when multiple destinations exist for the same source system. The problem is that it is a single pool and the blocks (if you will) are for single connection each. Consequently, even with 60 blocks available for caching, if replicating to 5 different destinations, only 12 blocks of cache will be available for each destination’s SQM (assuming each are experiencing same performance traits). Note that similar to the exec_sqm_write_request_limit, in RS 12.6 ESD #7 and RS 15.0 ESD #1, the limit for md_sqm_write_request_limit was raised from 983040 (60 blocks) to 2GB (recommendation is 2-4MB).

Prior to RS 12.1, the only visibility into this memory was via the admin statistics, md command as illustrated below: admin statistics, md Source Pending_Messages Memory_Currently_Used ------ ---------------- --------------------- SYDNEY_DS 0 0 TOKYO_DS 0 0 TOKYO_DS 0 0 Messages_Delivered SQM_Writes Destinations_Delivered_To ------------------ ---------- -------------------------

34 34 34 551 551 551 1452 1452 1452

Max_Memory_Hit Is_RSI_Source? -------------- --------------

0 0 0 0 0 0

Each of these values are described below:

Column Meaning

Source The Replication Server or data server where the message originated.

Pending_Messages The number of messages sent to the SQM without acknowledgment. Usually, this occurs because Replication Server is processing the messages before writing them to disk.

Memory_Currently_Used Memory used by pending messages.

Messages_Delivered Number of messages delivered.

SQM_Writes Number of messages received and processed.

Destinations_Delivered_To Total number of destinations.

Max_Memory_Hit Not yet implemented.

Is_RSI_Source? Indicates whether the current Replication Server can send messages: 0 - This Replication Server cannot send messages 1 - This Replication Server can send messages

Beyond tuning the md_sqm_write_request_limit and sts_cache_size, not much tuning is needed. Frequently, customers have noted that when the inbound queue experiences a backlog, once the SQT cache is resized, the inbound queue

Final v2.0.1

136

drains quite dramatically – at a rate exceeding 8GB/hr. This is a testament to the performance and efficiency of the DIST thread.

DIST Performance and Tuning

Within each of the Distributor module discussions above, we covered tuning issues specific to that module. Overall, to monitor the performance or throughput of the Distributor thread, you can use the admin who, dist command

admin who, dist Spid State Info ----- ---------------- --------------------- 21 Active 102 SYDNEY_DS.SYDNEY_RSSD 22 Active 106 SYDNEY_DS.pubs2 PrimarySite Type Status PendingCmds SqtBlocked ----------- ---- ------ ----------- ---------- 102 P Normal 0 1 106 P Normal 0 1 Duplicates TransProcessed CmdsProcessed MaintUserCmds ---------- -------------- ------------- -------------

0 715 1430 0 290 1 293 0

NoRepdefCmds CmdsIgnored CmdMarkers ------------ ----------- ----------

0 0 0 0 0 1

The meaning for each of the columns is described below.

Column Meaning

PrimarySite The ID of the primary database for the SQT thread.

Type The thread is a physical or logical connection.

Status The thread has a status of "normal" or "ignoring." You should only see “ignoring” during initial startup of the Replication Server.

PendingCmds The number of commands that are pending for the thread. If the number of pending commands is high, then the DIST could be a bottleneck as it is not reading commands from the SQT in a timely manner. The likely culprit is either the STS cache is not large enough and repeated accesses to the RSSD is slowing processing – or the outbound queue is slow, delaying message writes.

SqtBlocked Whether or not the thread is waiting for the SQT. This is the opposite of the above (PendingCmds). This essentially certifies that the DIST is not a cause for performance problems.

Duplicates The number of duplicate commands the thread has seen and dropped. This should stop climbing once the Replication Server has fully recovered and the Status (above) changed from “ignoring” to “normal”.

TransProcessed The number of transactions that have been processed by the thread.

CmdsProcessed The number of commands that have been processed by the thread.

MaintUserCmds The number of commands belonging to the maintenance user. This should be 0 unless the Rep Agent was started with the “send_maint_xacts_to_replicate” option.

Final v2.0.1

137

Column Meaning

NoRepdefCmds The number of commands dropped because no corresponding replication definitions were defined – or in RS 12.6 and higher, it could include commands replicated using database repdefs (MSA) for which no table level repdef exists. In either case, this is an indication that a table/procedure is marked for replication but lacks a replication definition (as table level repdefs should be created even for MSA implementations). If a procedure, this can be a key insight into why there may be database inconsistencies between a primary and replicate system.

CmdsIgnored The number of commands dropped before the status became "normal."

CmdMarkers The number of special markers (rs_marker) that have been processed. Normally only noticed during replication system implementation such as adding a subscription or a new database.

As noted from the above command output, the DIST thread is responsible for matching LTL log rows against existing replication definitions to determine which columns should be ignored, etc. If the replication definition does not exist, it discards the log row at this stage. This is also when request functions are identified. The way this is detected is described in more detail later, however, if you remember from classes you have taken (or reading the manual), request functions have a replication definition specifying the real primary database which would not be the current connection processing the logged procedure execution. In any case, a large number of occurrences of NoRepdefCmds can mean one of several things:

• Database replication definition was created (for MSA implementation possibly) for a specific source system, but individual table-level replication definitions were not created (a performance issue)

• A replication definition was mistakenly dropped or never created. In either case, this means that the databases are probably suspect as they are definitely out of synch. Or…

• Tables or procedures were needlessly marked for replication. If this is the case, then a good, cheap performance improvement is to simply unmark the tables or procedure for replication. This will reduce Rep Agent processing, SQM disk i/o, SQT and DIST CPU time.

DIST Thread Monitor Counters

The Distributor thread counters added in RS 12.1 are listed below:

Counter Explanation

CmdsDump Total dump database commands read from an inbound queue by a DIST thread.

CmdsIgnored Total commands ignored by a DIST thread.

CmdsMaintUser Total commands executed by the maintenance user encountered by a DIST thread.

CmdsMarker Total rs_markers placed in an inbound queue. rs_markers are enable replication, activate, validate, and dump markers.

CmdsNoRepdef Total commands encountered by a DIST thread for which no replication definition exists.

CmdsTotal Total commands read from an inbound queue by a DIST thread.

Duplicates Total commands rejected as duplicates by a DIST thread.

RSTicket Total rs_ticket markers processed by a DIST thread.

SREcreate Total SRE creation requests performed by a DIST thread. This counter is incremented for each new SUB.

SREdestroy Total SRE destroy requests performed by a DIST thread. This counter is incremented each time a new SUB is dropped.

Final v2.0.1

138

Counter Explanation

SREget Total SRE requests performed by a DIST thread to fetch an SRE row. This counter is incremented each time a DIST thread fetches an rs_subscriptions row from RSSD.

SRErebuild Total SRE rebuild requests performed by a DIST thread.

SREstmtsDelete Total deletes commands encountered by a DIST thread and resolved by SRE.

SREstmtsDiscard Total DIST commands with no subscription resolution that are discarded by a DIST thread. This implies either there is no subscription or the 'where' clause associated with the subscription does not result in row qualification.

SREstmtsInsert Total insert commands encountered by a DIST thread and resolved by SRE.

SREstmtsUpdate Total update commands encountered by a DIST thread and resolved by SRE.

TDbegin Total Begin transaction commands propagated by a DIST thread.

TDclose Total Commit or Rollback commands processed by a DIST thread.

TransProcessed Total transactions read from an inbound queue by a DIST thread.

UpdsRslocater Total updates to RSSD..rs_locater table by a DIST thread. A DIST thread performs an explicit synchronization each time a SUB RCL command is executed.

The counters in RS 15.0 are:

Counter Explanation

CmdsRead Commands read from an inbound queue by a DIST thread.

TransProcessed Transactions read from an inbound queue by a DIST thread.

Duplicates Commands rejected as duplicates by a DIST thread.

CmdsIgnored Commands ignored by a DIST thread while it awaits an enable marker.

CmdsMaintUser Commands executed by the maintenance user encountered by a DIST thread.

CmdsDump Dump database commands read from an inbound queue by a DIST thread.

CmdsMarker rs_markers placed in an inbound queue. rs_markers are enable replication, activate, validate, and dump markers.

CmdsNoRepdef Commands encountered by a DIST thread for which no replication definition exists.

UpdsRslocater Updates to RSSD..rs_locater table by a DIST thread. A DIST thread performs an explicit synchronization each time a SUB RCL command is executed.

SREcreate SRE creation requests performed by a DIST thread. This counter is incremented for each new SUB.

SREdestroy SRE destroy requests performed by a DIST thread. This counter is incrementedeach time a new SUB is dropped.

SREget SRE requests performed by a DIST thread to fetch a SRE object. This counter is incremented each time a DIST thread fetches an SRE object from SRE cache.

SRErebuild SRE rebuild requests performed by a DIST thread.

SREstmtsInsert Insert commands encountered by a DIST thread and resolved by SRE.

SREstmtsUpdate Update commands encountered by a DIST thread and resolved by SRE.

Final v2.0.1

139

Counter Explanation

SREstmtsDelete Deletes commands encountered by a DIST thread and resolved by SRE.

SREstmtsDiscard DIST commands with no subscription resolution that are discarded by a DIST thread. This implies either there is no subscription or the 'where' clause associated with the subscription does not result in row qualification.

TDbegin Begin transaction commands propagated by a DIST thread.

TDclose Commit or Rollback commands processed by a DIST thread.

RSTicket rs_ticket markers processed by a DIST thread.

dist_stop_unsupported_cmd dist_stop_unsupported_cmd config parameter.

DISTReadTime The amount of time taken by a Distributor to read a command from SQT cache.

DISTParseTime The amount of time taken by a Distributor to parse commands read from SQT.

As with the other modules, the average, total and max counters have been combined into a single counter with the different columns in rs_statdetail. However, the last two counters are new and can be helpful in determining why a latency might occur between the DIST and the SQT - other than the obvious problem of the SQM outbound slowing things down.

The DIST thread will generally have two sources of problems. First, either not enough STS cache was provided or sts_full_cache_ is not enabled for rs_objects and rs_columns. The second source (and most common) is that the outbound queue is not keeping up (or we are writing to too many outbound queues in a fan-out – time to add routes and spread the load a smidgen). Either way, the DIST counters also are fairly handy for finding application problems as well. Key counters include:

CmdsTotal, CmdsPerSec = CmdsTotal/seconds TransProcessed, TranPerSec = TransProcessed/seconds CmdsNoRepdef UpdsRslocater (again!!!) SREstmtsInsert, SREstmtsUpdate, SREstmtsDelete DISTReadTime, DISTParseTime (RS 15.0 only)

Again, the first one helps us identify the rate and compare this back to the SQT and RA modules to see if we are running up to speed. The second set is useful as now we can get a glimpse as to how many transactions vs. just commands are flowing through – which can then be compared to the DSI transaction rate later.

CmdsNoRepdef is a bit interesting. If using RS 12.6 and a database replication definition (MSA) with no table level repdefs, a high value here is to be expected. However, this in itself should also point out that it is ALWAYS a good idea to use repdefs from a performance perspective – even when not necessary (MSA or WS). In all other cases, it points to a table marked for replication for which there is no repdef.

This time, there is no real way to control UpdsRslocater – but by reducing everything else, this shouldn’t afflict much damage – besides, this is lower than the updates to the OQID – typically less than 1 per second in any case. The next three are useful if trying to learn how many inserts/updates/deletes are flowing through the system. However, these counters are only incremented if using standard table repdefs – a database repdef without table repdefs will cause these to be ignored. This also is a good place to again find application driven problems. For instance, if you see that the number of inserts and deletes are nearly identical, it is possible that either autocorrection is turned on – or the application developers used a delete followed by insert instead of an update.

The last two are new counters added in RS 15.0 to help track how much time the DIST spends on these activities. Typically, this should be minimal, but if DISTReadTime is high, it may point to a problem with the SQT. After the DIST thread, of course we have the SQM for the outbound queue(s) which have the same counters as the inbound queue – the only difference is that the DIST does not have a WriteWaits style counter like the RA thread. However, it does have a similar cache configuration – called md_sqm_write_request_limit (replaces the deprecated md_memory_source_pool) – which should be increased to the current maximum of 983,040 (for pre 12.6 ESD #7 and pre 15.0 ESD #1 servers) as well.

Final v2.0.1

140

DIST Thread Counter Usage

Again, let’s take a look at some of these counters in action using the customer data we’ve been discussing:

Sam

ple

Tim

e

Cm

dsW

ritt

en

(SQ

M)

SQM

R

Cm

dsR

ead

DIS

T

Cm

dsT

otal

Cm

ds/S

ec

Cm

dsN

o R

epD

ef

SRE

stm

ts

Inse

rt

SRE

stm

ts

Upd

ate

SRE

stm

ts

Del

ete

0:29:33 268,187 587,860 286,280 951 243,481 0 299 0

0:34:34 364,705 947,808 459,313 1,520 393,577 3,757 2 3,753

0:39:37 253,283 318,611 280,677 932 95,050 26,698 9 26,662

0:44:38 266,334 282,958 266,409 882 84,076 25,847 87 25,687

0:49:40 253,684 277,054 250,152 828 83,607 24,250 4 24,238

0:54:43 164,566 194,386 165,375 549 57,013 15,432 3 15,432

0:59:45 376,184 365,435 344,168 1,139 110,949 35,926 14 33,965

1:04:47 450,809 522,844 430,077 1,424 203,934 29,710 4 29,707

1:09:50 326,750 400,065 373,714 1,241 157,915 22,554 469 22,540

1:14:52 325,340 352,656 325,586 1,078 136,768 21,726 7 20,247

1:19:54 317,674 317,683 317,470 1,054 125,408 19,261 44 19,261

This one sample period actually was useful as it illustrated two different problems at this customer site. This will become apparent as we look at these counters

SQM CmdsWritten vs. DIST CmdsTotal – The best way to identify latency in the SQT DIST pipeline is to compare the DIST.CmdsTotal counter to the SQM.CmdsWritten counter. Note that not exactly all commands will be distributed, so a precise match is likely not possible. However, if instead you tried to compare with SQMR CmdsRead, you would have a negative influence based on the re-scanning of removed transactions (as illustrated above) – plus if there was any latency, you could not compare it to the previous stage. Note that in this case, despite all the rescanning for large transactions, the DIST thread is keeping pace with the SQM Writer. This does not mean that the SQT cache does not need to be resized – it suggests that if any latency is observed, increasing the SQT cache size is not likely to have a significant impact on throughput or reduce the latency as not much exists at this stage.

Cmds/Sec – Much like other derived rate fields, this value is derived by dividing the CmdsTotal by the number of seconds between sample intervals. This value is useful in observing the impact of tuning on the overall processing by the DIST – particularly if adjustments are made to the STS cache (in addition to observing the STS counters as well).

CmdsNoRepDef – Here is where we begin to see the first problem – we have significantly large values for this counter where logically we should expect none. There are two possible causes for this. First, a database replication definition being used for a standby database implementation via the Multiple Standby Architecture (MSA) method is similar to a Warm Standby implementation in that table level replication definitions are not required. While not required, table level replication definitions ought to be used if database consistency (think float datatype problems) and DSI performance is of any consideration. The second possible cause is that the table is marked for replication – or the database is marked for standby replication – but the table(s) involved at this point don’t have corresponding subscriptions. Without subscriptions and lacking a database repdef/subscription – the DIST has not choice but to discard these statements. However, it does indicate that overall system performance could be improved by not replicating this data in the first place – either by unmarking the tables for replication, using the ‘set replication off’ command prior to the batch submission, or other technique of ensuring that the Replication Agent doesn’t process the rows. In this case, it would significantly reduce the workload of the SQM (inbound) and the SQT.

SREstmtsInsert/Update/Delete – This is the first location within the monitor counters where you begin to get a picture of what the source transaction profile looked like – especially if combined with DIST.TransProcessed. However, in this case, a very curious phenomenon was observed that lead to the second problem

Final v2.0.1

141

identification. If you notice, from the second sample interval on, the inserts and deletes are nearly identical while the number of updates are noise level. This could be legitimate – for example, when working off of a job queue – new jobs could be added as old jobs are removed. However, this is unlikely. This leaves two other possible choices. The most likely choice is that the ‘autocorrection’ setting has been accidentally left enabled for a replication definition. In that mode, a replicated update would be submitted as a delete followed by an insert. The second choice is that the application itself is doing delete/insert pairs vs. performing an update. While this sounds illogical, earlier versions of some GUI application development tools such as PowerBuilder used to do this by default. The issue is that this not only doubles the workload in Replication Server in having to process twice the number of commands, but it also causes slower performance at the DSI as rows are removed not only from the table – but also the indices – and then re-added. At the primary, this workload is not as apparent thanks to user concurrency. With Replication Server by default using a single DSI, this workload delays replication as a whole. It turned out that this indeed was the application logic – and while not a simple fix – rewriting the application to use updates instead would immediately have the replication latency.

In addition to the DIST counters, the STS counters and SQM (outbound) counters may also need to be looked at to determine what may be driving DIST thread performance.

Minimal Column Replication

Unfortunately, appending the clause “replicate minimal columns” to replication definitions is often forgotten. A common misconception is that minimal column replication chiefly benefits the RS throughput by reducing the amount of space consumed in the inbound (and outbound) queues. While it does reduce the space – and tighter row densities allow more rows to be processed by the SQM/SQT per I/O and this can improve performance, the biggest benefit of minimal column replication is the performance gain through reducing the workload involved at the replicate DBMS – aiding in DSI performance (where typically the problem is). While not reducing the workload of the DIST thread so much, it can dramatically reduce the workload of the DSI thread as it can tremendously reduce the work at the replicate dataserver. This workload reduction specifically is the probable reduction in unnecessary index maintenance at the replicate as well as a reduction in contention caused by index maintenance when parallel DSI’s are used and the dsi_serialization_method is set to isolation_level_3.

To understand the impact of this, you first have to understand what happens normally.

Normal Replication Behavior

Under normal (non-minimal column) replication, the DIST thread does not perform any checking of what columns have been changed for an update statement. As a result, if an update of only 2 columns of a 10 column table occurs, Replication Server constructs a default function string containing an update for all 10 columns of the table, setting the column values equal to the new values with a where clause of the primary key old values. For example, consider the following table (from pubs2 sample database shipped with Sybase ASE) and associated indexes. create table titles (title_id tid not null, title varchar(80) not null, type char(12) not null, pub_id char(4) null, price money null, advance money null, total_sales int null, notes varchar(200) null, pubdate datetime not null, contract bit not null )

go create unique clustered index titleidind on titles (title_id) go create nonclustered index titleind on titles (title) go

For further fun, note that the salesdetail table has a trigger that updates the title.total_sales column: create trigger totalsales_trig on salesdetail for insert, update, delete as /* Save processing: return if there are no rows affected */ if @@rowcount = 0 begin return

Final v2.0.1

142

end /* add all the new values */ /* use isnull: a null value in the titles table means ** "no sales yet" not "sales unknown" */ update titles set total_sales = isnull(total_sales, 0) + (select sum(qty) from inserted where titles.title_id = inserted.title_id) where title_id in (select title_id from inserted) /* remove all values being deleted or updated */ update titles set total_sales = isnull(total_sales, 0) - (select sum(qty) from deleted where titles.title_id = deleted.title_id) where title_id in (select title_id from deleted) go

By now some of you may be already seeing the problem. As mentioned previously, for an update statement, RS will generate a full update of every column. Consider a mythical replication definition like:

create replication definition CHINOOK_titles_rd with primary at CHINOOK.pubs2 with all tables named 'titles' ( "title_id" varchar(6), "title" varchar(80), "type" char(12), "pub_id" char(4), "price" money, "advance" money, "total_sales" int, "notes" varchar(200), "pubdate" datetime, "contract" bit ) -- Primary key determination based on: Primary Key Definition primary key ("title_id") searchable columns ("title_id")

This means the function string (if you were to mimic it by altering the function string) would resemble: alter function string CHINOOK_titles_rd.rs_update for rs_sqlserver_function_class output language ' update titles set title_id = ?title_id!new?, title = ?title!new?, type = ?type!new?,

pub_id = ?pub_id!new?, price = ?price!new?, advance = ?advance!new?, total_sales = ?total_sales!new?, notes = ?notes!new?, pubdate = ?pubdate!new?, contract = ?contract!new? where title_id = ?title_id!old?

'

The result is rather drastic. The first problem, is of course, that the outbound queue will contain significantly more data than actually was updated - assuming the notes column was filled out. But this is minor compared to what really impacts DSI delivery speed.

For those of you familiar with database server performance issues, any time a row is updated, any index values that are updated automatically cause the index to be treated as “unsafe” and therefore also needing updated. In this example, every time a new order is inserted into the salesdetail table, the corresponding update at the replicate not only updates the entire row - it also performs index maintenance. Worse yet, if ANSI constraints were used, the related foreign key tables would have holdlocks placed on the related rows, increasing the probability of contention.

Clearly, this is not desirable behavior. Unfortunately, it occurs much more often than you would think. Consider:

Aggregate columns – Such as the titles example. Auditing columns – this includes such columns as last_update_user, last_updated_date, etc. – similar to the

trigger issue mentioned previously. Status columns – shipping/order status information for order entry or any workflow system.

Final v2.0.1

143

Dynamic values – product prices (sale prices, etc.). Consider a regional chain store that wants to replicate price changes to 60+ stores for 100’s of products. Now add in the overhead of changing every column and index maintenance – and the associated impact that could have on store operations.

Undoubtedly, there are others you could think of as well.

Minimal Column Replication

When the replication definition includes the “replicate minimal columns” phrase, the behavior is much different. With minimal column replication, only the columns with different before and after images – as well as primary key values – are written to the inbound & consequently outbound queue. Consequently, most of the updates to the titles table would be executing a function string similar to:

alter function string CHINOOK_titles_rd.rs_insert for rs_sqlserver_function_class output language ' update titles set total_sales = ?total_sales!new?

where title_id = ?title_id!old? '

Which more than likely will execute much quicker in high volume environments. An interesting aspect to minimal column replication is what happens if the only columns updated were columns not included in the replication definition. Under normal replication rules, if a column is updated, the rs_update function is processed and sent to the RS. The RepAgent User thread simply strips out any columns not being replicated as part of the normalization process and the resulting functions are generated as appropriate. For example, in the above titles table, let’s assume that the contract column was excluded from the replication definition as in:

create replication definition CHINOOK_titles_rd with primary at CHINOOK.pubs2 with all tables named 'titles' ( "title_id" varchar(6), "title" varchar(80), "type" char(12), "pub_id" char(4), "price" money, "advance" money, "total_sales" int, "notes" varchar(200), "pubdate" datetime ) -- Primary key determination based on: Primary Key Definition primary key ("title_id") searchable columns ("title_id")

Of course, the full update function string would now be: alter function string CHINOOK_titles_rd.rs_update for rs_sqlserver_function_class output language ' update titles set title_id = ?title_id!new?, title = ?title!new?, type = ?type!new?,

pub_id = ?pub_id!new?, price = ?price!new?, advance = ?advance!new?, total_sales = ?total_sales!new?, notes = ?notes!new?, pubdate = ?pubdate!new? where title_id = ?title_id!old?

'

Now, consider the following update statement: Update titles set contract=1 where title_id=”BU1234”

If this statement was executed at the primary, the replicate would receive a full update statement of all columns in the replication definition (excluding the contract column, of course), setting them to the same values they already are.

As you can guess, under minimal columns, this behaves differently. Obviously, if the only column(s) updated were columns excluded from the replication definition, the RS would otherwise attempt to generate an empty “set clause”. One option would be for RS to ignore any update for which only columns not being replicated were updated. However, what happens is RS submits an update setting the primary key values to after image values – essentially a no-op. This

Final v2.0.1

144

can be confusing and lead to a quick call to TS demanding an explanation. Before you pick up the phone – one little consideration – what if a custom function string simply was counting the number of updates to a table?? By excluding the update from replication simply if only non-replicated columns were updated, the functions would never get invoked. While this is easier handled today in a cleaner approach via using multiple replication definitions, this implementation no doubt dates back to the earliest implementations of RS, in which guaranteed assurance of replicated transactions held sway over performance (and rightfully so).

Keep in mind that this does impose a number of restrictions:

• Autocorrection can not be used while minimal column replication is enabled. • Custom function strings containing columns other than the primary keys may not work properly or generate

errors.

Regarding the first restriction, autocorrection should not normally be on. If left on, performance could be seriously degraded as each update translates into a delete/insert pair. Even if the values haven’t changed, this can have a greater penalty than not using minimal columns as the index maintenance load could be greater due to first removing the index keys (and any corresponding page shrinkage) and then re-adding them (which could cause splits). Consequently, minimal column replication should be enabled by default, and when autocorrection is necessary due to inconsistencies, the replication definition can be altered to remove minimal column replication (temporarily). Note that minimal column replication really only applies for updates. In the case of insert statements, all of the values are new and therefore need replication. While minimal column replication documentation does include comments about both update and delete operations, for most users, only the rs_update function will be impacted. For delete statements, this translates to only the primary key values being placed into the outbound queue (vs. the full before image as without minimal column replication) – which means any custom function strings (such as auditing) that is recording the values being deleted in a history table will incur problems. Again, if not using custom function strings on the table, minimal column replication will not have a negative impact on RS functionality. If using custom function strings, using multiple repdefs may alleviate the pain of not being able to use minimal column replication. For example, if you have a Warm Standby and a Reporting system and the reporting system uses custom function strings (to perform aggregates), then you may want to use two repdefs for the table(s) in question – one for the Warm Standby – supporting minimal column replication; and one for the reporting server. Note that for Warm Standby, minimal column replication is enabled by default as also is true of MSA implementations.

Key Concept #14: Unless custom function strings exist for update and delete functions for a specific table, minimal column replication should be considered. By using minimal columns, update operations at the replicate will proceed much quicker by avoiding unnecessary index maintenance and possibly avoiding updates altogether if the only columns updated at the primary are excluded from the replication definition.

Final v2.0.1

145

Outbound Queue Processing

…must come out. The single biggest bottleneck in the Replication System is the outbound queue processing. As hard as this seems to be believed, the main reason for this is that the rate of applying transactions at the replicate will often be considerably slower than they were originally applied at the primary. While some of this is due to the replicated database tuning issues, a considerable part of it is also due to the processing of the outbound queue.

A key point to remember, is that when discussing the outbound processing of Replication Server internals, you are discussing threads and queues that belong to the replicate database connection and not the primary.

If you remember from the earlier internals diagram, the outbound processing basically includes the SQM for the outbound queue, the DSI thread group and the RSI thread for replication routes. These are illustrated below, with the exception of the RSI thread.

Figure 34 – Replication Server Internals: Inbound and Outbound Processing

As you can imagine, the outbound queue SQM processing is extremely similar to the SQM processing for an inbound queue – basically manage stable device space allocation and perform all outbound queue write activity via the dAIO daemon. Consequently, we will begin by looking at the Data Server Interface (DSI) thread group in detail. A closer in diagram would look like the following:

Final v2.0.1

146

Figure 35 - Close up of DSI Processing Internals

Many of the concepts illustrated above - DSI SQT processing, transaction grouping, command batching, etc. will be discussed in this section, while the Parallel DSI features will be discussed later. In any case, you can think of the flow through the DSI as having the following stages:

1. Read from Queue (DSI SQM Processing) 2. Sort Transactions (due to multiple sources) (DSI SQT Processing) 3. Group Transactions (DSI Transaction Grouping) 4. Convert to SQL (DSIEXEC Function String Generation) 5. Generate Command Batches for Execution (DSIEXEC Command Batching) 6. Submit SQL to RDB (DSIEXEC Batch Execution)

We will use this list as a starting point to discuss DSI processing. We will look at the most appropriate counters during each section. Because of the number of DSI & DSIEXEC module counters, we will not necessarily look at each one. First, however, it might be a good idea to take a closer walk-through of the DSI/DSIEXEC processing.

1. The DSI thread reads from the outbound queue SQM 2. As the DSI reads each command, it uses SQT logic to sort the commands into their original transactions

and also into commit order (when multiple sources are replicating to a single destination) 3. When the DSI/SQT sees a closed transaction, determines if it can group it with already closed

transactions it has in cache according to the transaction grouping rules and the various connection configurations.

4. One it can’t add it to an existing group, it checks to see which of the DSIEXEC’s are available and submits the existing transaction group to the DSIEXEC via message queues

5. The DSIEXEC takes the transaction group commands and converts the structures to SQL statements 6. As the DSIEXEC converts the transaction group to SQL statements, it attempts to batch the commands

into command batches for execution efficiency (similar to multiple statements in an isql script before the ‘go’).

7. When the batch limit is hit (50 commands) or when the batching is terminated due to batching rules/configuration parameters, the DSIEXEC notifies the DSI that it is ready to submit the first batch

8. The DSI checks the dsi_serialization_method and if the serialization method is wait_for_commit, the batch is held until the previous thread is ready to commit. Otherwise, the DSI notifies the DSIEXEC to send the batch the replicate DBMS for execution.

9. When the first batch is sent to the replicate database, the DSIEXEC notifies the DSI so that the DSI can allow parallel DSI’s to work if the dsi_serialization_method is not wait_for_commit (i.e. wait_for_start).

10. The DSIEXEC then processes the results from each of the commands within the command batch. When all the results have been processed, it submits the next command batch until the entire transaction has been submitted (but not yet committed).

Final v2.0.1

147

11. When all the SQL commands have been submitted, the DSIEXEC notifies the DSI that it is ready to commit via message queue.

12. The DSI checks the commit order and notifies the DSIEXEC’s when they can commit. In addition, if the DSI serialization method is wait_for_commit, it notifies other DSIEXEC’s that they can send their batch.

13. As each DSIEXEC receives commit notification, it sends the commit to the replicate DBMS and notifies the DSI that it has committed and is available for another transaction group.

This illustrated in the below diagram (showing only the communications between the DSI and one DSIEXEC – others implied).

Figure 36 – Logical View of DSI & DSIEXEC Intercommunications

As you can tell, there is quite a bit of back-and-forth communications between the various DSIEXEC’s and the DSI thread to ensure proper commit sequencing and to also ensure that the command execution sequencing is maintained. A few items of interest relating to the monitor counters from the above diagram

Batch Sequencing Time – (Steps 4 5 6 7) Is the time between when the first command batch is ready (#4 Batch Ready) and when the DSIEXEC receives the Begin Batch message (#5). This gap is used to control when parallel DSI’s can start sending their respective SQL batches according to the dsi_serialization_method. For example, if the dsi_serialization_method was ‘wait_for_commit’, if the bottom thread sent a ‘Batch Ready’ message, the DSI would not respond with a ‘Begin Batch’ until it got the ‘Commit Ready’(#10) from the top thread. If instead the dsi_serialization_method was ‘wait_for_start’, the bottom thread would get a ‘Begin Batch’ response when the top thread sent the ‘Batch Began’ message (#7)

Commit Sequencing Time – (Steps 9 10 11 12 13) This is the time between the ‘Commit Ready’ (#10) and the ‘Commit’ (#11) response. Any time lag is likely due to the DSI waiting for a previous thread to respond back ‘Committed (#13)’ which means that it has committed successfully. The reason we say it begins at rs_get_threadseq (#9) is that in parallel DSI’s, when not using commit control, the rs_threads table is used for serialization - and it is in this step that it occurs (as will be discussed later).

Note that only the first command batch is coordinated with the DSI. Subsequent command batches are simply applied except in the case of large transactions in which every dsi_large_xact_size commands, a rs_get_thread_seq is sent. Note that in the above diagram, when the thread is ready to commit (rs_get_threadseq returns), the seq number from the rs_get_threadseq is passed to the DSI for comparison. If the seq number is less than expected, the implication is that the previous thread rolled back (due to error or contention) and that this thread needs to rollback as well – in which case step #11 becomes a ‘Rollback’ command (currently implemented as disconnect which causes an implicit rollback).

DSI SQM Processing

Much like the SQT interaction with the inbound queue SQM, the DSI reads from the outbound queue SQM. As far as the SQM itself, it is identical to the inbound queue SQM. While many of the SQM/SQM-R related counters are the same, there is at least one major difference. If you remember from the inbound discussion, the primary goal is to be reading the blocks from cache – using BlocksReadCached as the indicator. While this is a desirable goal for the outbound queue as well, the likelihood is that the latency in executing the SQL at the replicate will result in the cache hit quickly dropping to zero once the DSI SQT cache fills. Consider the following:

Final v2.0.1

148

Sam

ple

Tim

e

SQM

.Cm

ds

Wri

tten

SQM

R C

mds

R

ead

Blo

cks R

ead

Blo

cksR

ead

Cac

hed

Cac

he H

it %

Segs

Act

ive

Allo

cagt

ed

Dea

lloca

ted

Cac

he

Mem

Use

d

19:02:07 6 6 2 2 100 1 0 0 0

19:07:08 6,312 6,293 189 189 100 1 3 3 1,792

19:12:10 7,711 7,689 308 307 99.67 1 4 4 3,328

19:17:12 4,075 4,046 185 185 100 1 3 3 0

19:22:13 6,963 6,987 270 269 99.62 1 5 5 0

19:27:14 7,499 7,496 291 291 100 1 4 4 143,104

19:32:16 25,533 18,058 530 401 75.66 3 10 8 2,098,432

19:37:18 48,468 41,405 715 0 0 5 13 11 2,097,920

19:42:19 29,238 42,331 744 0 0 2 9 12 2,098,432

19:47:21 40,042 21,570 405 240 59.25 7 11 6 2,097,920

19:52:22 19,140 22,807 403 0 0 9 8 6 2,098,432

19:57:45 31,727 9,876 266 0 0 15 10 4 2,098,432

20:02:48 93,539 12,270 418 0 0 31 23 7 2,098,432

20:07:49 67,564 18,803 298 0 0 44 17 5 2,098,432

20:12:51 52,751 29,352 470 0 0 50 13 7 2,098,432

As you can see from the above, once the DSI SQT cache fills, the BlocksReadCached quickly hits bottom. Now this also points out a bit of a fallacy. Earlier we stated that one way to determine the amount of latency was to subtract the Next.Read value from the Last Seg.Block in the admin who,sqm command. For the outbound queue, this does represent a “rough” estimate – what it is lacking is the amount in the DSI SQT cache. Consequently, the most accurate measurement would be Last Seg.Block – Next.Read + CacheMemUse. The number of active segments above is a good estimate as well – however these are not reported in any easily obtained admin who statistics. The First Seg.Block includes segments still allocated due to simply not having been deallocated yet as well as segments preserved by the save interval – so subtracting First Seg.Block from Last Seg.Block is even more inaccurate than using Next.Read. One aspect to consider is that if there is any latency, then you can be sure that the DSI SQT cache is probably full, which means that the most accurate estimate for latency in the outbound queue is:

Latency = Last.Seg Block – Next.Read + (DSI SQT Cache)

If Next.Read is higher than Last.Seg Block, it is very likely that the DSI is caught up or nearly so. But this may explain to some why when the connection appears to be all caught up and you suspend the connection, that suddenly there is 1MB of backlog in the outbound queue – despite the source being quiescent.

DSI SQT Processing

If you notice in the internals diagram above, unlike with the inbound processing, the outbound processing does not have a separate SQT thread. This is largely due to a very simple reason – transactions in the outbound queue are more than likely already in commit order. For example, if a source database is replicating to a single destination, the inbound SQT effectively sorts the transactions into commit sequence. Since this ordering is not overridden anywhere within the rest of the inbound processing, then the outbound queue is automatically in sorted order. This does not change if the primary has multiple replicates, since each replicate will have its own independent outbound queue that the single DIST thread is writing commit ordered transactions into. The only time this is not true is when multiple primary databases are replicating into the same replicate database – such as corporate rollup topologies. However, even in this latter case, due to MD caching of writes, providing that the transactions are small enough, the SQT will still encounter complete and contiguous transactions from each source system. If the transactions are not contiguous (replicated rows from the various sources inter-dispersed in the stable queue), the SQT will still only have a single transaction per origin in the Open/Closed/Read linked lists as the transactions are still in commit order respective to the source database. As a

Final v2.0.1

149

result, the main DSI thread queue manager (DSI - normally called the DSI scheduler or DSI-S) simply calls the SQT functions when reading from the outbound queue via the SQM. This lack of workload was the primary driver to simply including the SQT module logic into the DSI vs. having a separate SQT thread for the outbound queue.

One notable difference to this is for Warm Standby DSI’s. In a Warm Standby, the WS-DSI threads read straight off the inbound queue – effectively duplicating the sorting process carried out by the SQT thread. If your only connection within the replication server is a Warm Standby, you should consider the ‘alter logical connection logical_DS.logical_DB set distribution off’ command. This command shuts down the DIST thread for the logical connection. The DIST is more than just a client of the SQT thread – it actually controls it. During startup, the RS first starts the SQM threads then the DSI and DIST threads. The DIST in turn starts the appropriate SQT thread. Consequently, by disabling distribution for a logical connection, not only shut down the DIST thread, but you also shut down the SQT thread. This can save CPU time – especially in pre-12.6 non-SMP RS implementations by:

• Eliminating CPU consumed by the DIST thread unnecessarily checking for subscriptions, etc. • Eliminating CPU and memory consumed by the SQT thread in sorting the transactions

So, with the exception of the SQT cache in a WS DSI thread, if the SQT module is so little used, what is the SQT cache used for by the DSI thread? Remember, the SQT cache contains the actual commands that comprise the transaction – consequently, the SQT cache is where the DSI EXEC threads read the list of commands to generate SQL for and apply to the replicate database. This is illustrated in the above drawing in which the DSI EXEC threads read from the SQT cache “Closed” queue and after applying the SQL, notify the DSI of the success, causing the transaction to be moved to the “Read” queue.

DSI SQT Performance Monitoring

This does not mean that you cannot monitor the SQT processing within the outbound queue processing. If you remember from previous, the admin who, sqt command reports both the inbound and outbound SQT processing statistics.

admin who, sqt Spid State Info ---- ----- ---- 17 Awaiting Wakeup 101:1 TOKYO_DS.TOKYO_RSSD 98 Awaiting Wakeup 103:1 DIST LDS.pubs2 10 Awaiting Wakeup 101 TOKYO_DS.TOKYO_RSSD 0 Awaiting Wakeup 106 SYDNEY_DSpubs2sb Closed Read Open Trunc ------ ---- ---- ----- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Removed Full SQM Blocked First Trans Parsed ------- ---- ----------- ----------- ------- 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 SQM Reader Change Oqids Detect Orphans ---------- ------------ -------------- 0 0 0 0 0 0 0 0 1 0 0 1

In the above example output, the DSI SQT processing is reported in the last two lines lacking the queue designator (:1 or :0). The way this can easily be verified is by issuing a normal admin who command and comparing the spids (10 and 0 above) with the type of thread reported for those processes in the process list returned by admin who.

From a performance perspective, if you (hopefully) have tuned the Replication Server’s sqt_max_cache_size parameter (i.e. to 2-4MB), you may want to adjust the SQT cache for the outbound queue downward or up depending on the status of the removed and full columns in the admin who, sqt output and careful monitoring of the monitor counters. This can (and must) be done on a connection basis via setting the dsi_sqt_max_cache_size to a number differing from the sqt_max_cache_size. In the following sections we will take a look at why you might want to do either.

Final v2.0.1

150

dsi_sqt_max_cache_size < sqt_max_cache_size

In most systems, the default dsi_sqt_max_cache_size setting is 0 – which means the DSI inherits the same cache size as the SQT cache limit (sqt_max_cache_size). This is extremely unfortunate as DBA’s tend to over allocate sqt_max_cache_size – setting it well above the 4-8MB that is likely all that is necessary even in high volume systems. As a result, the DSI-S thread will continuously be trying to fill the available DSI SQT cache from the outbound queue – often at the expense of yielding the CPU to the DSI EXEC. As a result, in most common systems, the default dsi_sqt_max_cache_size causes performance degradation. The proper sizing for the dsi_sqt_max_cache_size is likely 1-2MB at most and can be more accurately determined for parallel DSI configurations by reviewing the monitor counter information (discussed below).

dsi_sqt_max_cache_size >= sqt_max_cache_size

A notable exception to this is the Warm Standby implementation. As mentioned earlier, in a WS topology, it is the DSI SQT thread that is actually sorting the transactions into commit order. In this case, you will probably want to set the DSI SQT cache equal to the SQT cache – or possibly even higher.

A second exception concerns the use of parallel DSI’s. When parallel DSI’s are used, the DSI thread can effectively process large amounts of row modifications as the load can be distributed among the several available DSI’s. This could result in a situation where the DSI transaction rate is higher than the amount of rows read from the outbound queue. In such situations, raising the DSI SQT cache allows the DSI to “read ahead” into the queue and begin preparing transactions before they are needed. This is especially true in high volume replication environments in which the rate of changes requires more than the default number of parallel DSI threads. In fact, consider the default dsi_max_xacts_in_group setting of 20. If the number of parallel DSI’s was set to 5, then you would need dsi_sqt_max_cache_size large enough to hold 100 closed transactions at a minimum and probably some number of open transactions that the DSI executer could be working on. However, even in these cases, unless the system only experienced short transactions allowing the primary sqt_max_cache_size setting to remain low at 1-2MB, the dsi_sqt_max_cache_size setting for parallel DSI’s will still likely be less that sqt_max_cache_size. How to size this will be illustrated in the next section.

DSI SQT Monitor Counters

Although the DSI SQT is not a separate threaded module, the standard SQT monitor counters apply. These are repeated here with DSI appropriate counters highlighted.

Counter Explanation

CacheExceeded Total number of times that the sqt_max_cache_size configuration parameter has been exceeded.


ClosedTransRmTotal Total transactions removed from the Closed queue.

ClosedTransTotal Total transactions added to the Closed queue.

CmdsAveTran Average number of commands in a transaction scanned by an SQT thread.

CmdsLastTran Total commands in the last transaction completely scanned by an SQT thread.

CmdsMaxTran Maximum number of commands in a transaction scanned by an SQT thread.

CmdsTotal Total commands read from SQM. Commands include XREC_BEGIN, XREC_COMMIT, XREC_CHECKPT.

EmptyTransRmTotal Total empty transactions removed from queues.

MemUsedAveTran Average memory consumed by one transaction.

MemUsedLastTran Total memory consumed by the last completely scanned transaction by an SQT thread.

MemUsedMaxTran Maximum memory consumed by one transaction.

OpenTransRmTotal Total transactions removed from the Open queue.

Final v2.0.1

151

Counter Explanation

OpenTransTotal Total transactions added to the Open queue.

ReadTransRmTotal Total transactions removed from the Read queue.

ReadTransTotal Total transactions added to the Read queue.

TransRemoved Total transactions whose constituent messages have been removed from memory. Removal of transactions is most commonly caused by a single transaction exceeding the available cache.

TruncTransRmTotal Total transactions removed from the Truncation queue.

TruncTransTotal Total transactions added to the Truncation queue.

Let’s take a look at some of these counters and how the can be used from the outbound queue/DSI perspective

Counters Performance Indicator

CacheExceeded TransRemoved

Normally, we would associate these values with needing to raise the SQT cache setting (i.e. dsi_sqt_max_cache_size). However, what we are likely to see is that the CacheMemUsed grows until dsi_sqt_max_cache_size is reached – at which point the CacheExceeded will jump to substantially large values. The only transactions likely to be removed will be large transactions too large to fit into the DSI SQT max cache size. Unless this happens frequently due to larger transactions, DBAs should avoid raising the DSI SQT cache as the latency in processing transactions ahead of them will likely result in their being removed in any case.

OpenTransTotal CloseTransTotal ReadTransTotal

These counters take on a different perspective. Since the transactions are nearly all presorted, these counters may differ until the cache fills. Once the cache fills, these values will be identical as each group of transactions as committed by the DSI makes room for the same number of transactions in to be read into the DSI SQT cache.

CacheMemUsed MemUsedAveTran

These counters are the most appropriate ones to use to size the dsi_sqt_max_cache_size. Ideally, you want the DSI SQT cache to contain double the dsi_max_xacts_in_group transactions for each DSI EXEC thread. Consequently, for 5 DSIEXECs and the default of 20 dsi_max_xacts_in_group, you would like to see 2 * 5DSIs * 20Xacts/Group or 200 transactions. The number of cached transactions can be derived by dividing the CacheMemUsed by MemUsedAveTran. If divided by the dsi_max_xacts_in_group, this will explain how many possible transaction groups are in cache at a max (exluding partitioning rules, different origins, etc.). If we have 200 or more transactions in cache, raising dsi_sqt_max_cache_size is likely of no benefit.

CmdsAveTran This is useful for helping to size dsi_max_xacts_in_group when using parallel DSI’s. If the number of commands per transaction is fairly high, large transaction groups only will compound any contention between the parallel DSI’s.

Let’s take a look at how these might work by looking at the earlier insert stress test.

Final v2.0.1

152

Sam

ple

Tim

e

Cac

he

Mem

Use

d

Clo

sedT

rans

Tor

al

Mem

Use

d A

veT

ran

Cac

hed

Tra

ns

Cac

he

Exc

eede

d

Tra

ns

Rem

oved

Rea

dTra

ns

Tot

al

DSI

. T

rans

Tot

al

DSI

.Ng

Tra

nsT

otal

DSI

Xac

t In

Grp

Max

Cac

hed

Gro

ups

11:37:47 0 0 0 0 0 0 0 0 0 0.0 0.0

11:37:57 2,097,408 75 10,729 195 1 0 54 21 58 2.7 72.2

11:38:08 2,099,712 289 12,223 171 47 0 296 62 287 4.6 37.1

11:38:19 2,099,200 327 12,223 171 54 0 331 68 322 4.7 36.3

11:38:30 2,097,920 347 12,223 171 42 0 339 75 334 4.4 38.8

11:38:41 2,098,432 319 12,223 171 56 0 311 67 315 4.7 36.3

11:38:52 2,101,504 345 12,223 171 64 0 336 64 310 4.8 35.6

11:39:03 2,100,224 319 12,223 171 61 0 333 68 319 4.6 37.1

11:39:14 2,099,968 345 12,223 171 61 0 326 68 316 4.6 37.1

11:39:25 2,100,224 295 12,223 171 45 0 307 67 291 4.3 39.7

To evaluate this, it helps to know that there were 10 parallel DSI’s; dsi_xact_group_size was set to 262,144; dsi_max_xacts_in_group was set to 20; and dsi_sqt_max_cache_size was set to 2,097,152. Again, the derived statistics are in red in the above table. Let’s take a look at what these counters are telling us.

CacheMemUsed, CacheExceeded & TransRemoved – As you can see from the above, as soon as transactions arrive, the DSI SQT cache was quickly filled by the DSI-S – filled in about 10 seconds. From that point, as long as there were transactions in the queue to be delivered, the cache remained full and the cache was “exceeded” frequently. However, notice that there were 0 transactions removed – implying that this 2MB DSI SQT cache is likely oversized or is correctly sized.

ClosedTransTotal & ReadTransTotal – During the first period of activity when the cache was filled (CacheExceeded=1) we see that the DSI SQT cache had 75 “Closed” transactions and only 54 “Read” transactions – demonstrating that the DSIEXEC’s were lagging right from the start. However, as the cache became full, new transactions could only be read from the queue into the SQT cache at the same rate that the DSIEXEC’s could deliver them – resulting in the situation we described before in which the Closed ≈ Read. When looking at these numbers, you also need to realize that the number of Closed & Read transactions are over the full sample period, so these values to not reflect the number of transactions in cache – but the number of transactions that are in cache plus the number of transactions that have been moved to the next stage of the cache (Open Closed Read Truncate). For example, let say we were delivering transactions at a rate of one per sec – if the cache quickly filled with 50 transactions, then each second one would be moved from Closed Read making room for one more – and at the end of the 10 second sample interval we would show a total of 60 transactions having been “Closed” – the original 50 plus 10 due to processing.

CachedTrans – The actual number of transactions in the cache can be roughly derived by dividing the CacheMemUsed by the MemUsedAveTran. This is the first indication that the DSI SQT cache is possibly oversized from the system performance perspective as we see about 170 transactions in the cache on a regular basis but the DSIEXEC’s are only processing ~30 transactions per second (loosely extrapolating from the NgTransTotal over the time period – NgTransTotal to be discussed later – but it represents the number of original transactions prior to the DSI-S grouping them together). However, the cache may be undersized according to our desired target! With 10 DSIEXEC’s active and a dsi_max_xacts_in_group of 20, we would need 200 cached transactions to meet the full need.

DSIXactInGrp – This is the effective dsi_max_xacts_in_group derived by dividing the number of “ungrouped” transactions as submitted by the source system by the number of transaction groups that the DSI-S created. As you can see, we are not getting anything close to our desired setting of 20 – likely some other DSI configuration value is affecting this.

MaxCachedGroups – This metric is derived by dividing the CachedTrans by the number of transactions being grouped (DSIXactInGrp) – which yields the number of transaction groups at the current grouping that are in the DSI SQT cache. If we were getting our maximum dsi_max_xacts_in_group, this would be a good indication that our SQT cache is oversized as we have nearly twice the number of transaction groups in

Final v2.0.1

153

memory as our effective dsi_max_xacts_in_group. However, since we are only averaging about 4 transactions per group, if we succeed in raising this effective value to even 10 (half of the target dsi_max_xacts_in_group) the number of cached groups drops to 17 (still higher than dsi_num_threads=10 though) – and if we reach our target of 20, the number of cached groups would be between 8 & 9.

So, DSI SQT cache is slightly undersized for the target performance, but is oversized for the way the system is performing – consequently it is some other setting that this restricting processing. Now, let’s take a look at the customer example we were looking at earlier:

Sam

ple

Tim

e

Cac

he

Mem

Use

d

Clo

sedT

rans

Tor

al

Mem

Use

d A

veT

ran

Cac

hed

Tra

ns

Cac

he

Exc

eede

d

Tra

ns

Rem

oved

Rea

dTra

ns

Tot

al

DSI

. T

rans

Tot

al

DSI

.Ng

Tra

nsT

otal

DSI

Xac

t In

Grp

Max

Cac

hed

Gro

ups

19:02:07 0 2 1,142 0 0 0 2 2 2 1.0 0.0

19:07:08 1,792 1,574 2,109 0 0 0 1,574 1,574 1,574 1.0 0.0

19:12:10 3,328 1,922 2,477 1 0 0 1,926 1,920 1,920 1.0 1.0

19:17:12 0 1,012 2,483 0 0 0 1,030 1,030 1,030 1.0 0.0

19:22:13 0 1,747 2,493 0 0 0 1,747 1,746 1,746 1.0 0.0

19:27:14 143,104 1,906 2,490 57 0 0 1,881 1,873 1,873 1.0 57.0

19:32:16 2,098,432 4,530 2,273 923 2,413 0 3,922 3,899 3,899 1.0 923.0

19:37:18 2,097,920 10,379 1,579 1,328 17,820 0 10,385 10,348 10,348 1.0 1,328.0

19:42:19 2,098,432 10,605 1,561 1,344 19,378 0 10,599 10,578 10,578 1.0 1,344.0

19:47:21 2,097,920 5,400 1,573 1,333 3,069 0 5,442 5,430 5,430 1.0 1,333.0

Then the next day, it looks like the following:

Sam

ple

Tim

e

Cac

he

Mem

Use

d

Clo

sedT

rans

Tor

al

Mem

Use

d A

veT

ran

Cac

hed

Tra

ns

Cac

he

Exc

eede

d

Tra

ns

Rem

oved

Rea

dTra

ns

Tot

al

DSI

. T

rans

Tot

al

DSI

.Ng

Tra

nsT

otal

DSI

Xac

t In

Grp

Max

Cac

hed

Gro

ups

19:18:31 0 3 1,123 0 0 0 3 3 3 1.0 0.0

19:23:32 1,725,696 2,023 2,179 791 65 0 1,708 148 1,702 11.5 68.7

19:28:34 1,023,232 1,738 2,468 414 84 0 1,860 115 1,849 16.0 25.8

19:33:36 1,166,592 1,081 2,478 470 2 0 1,060 69 1,034 14.9 31.5

19:38:38 2,098,432 1,598 2,482 845 102 0 1,417 101 1,405 13.9 60.7

19:43:40 2,098,432 3,760 2,481 845 357 0 3,748 187 3,740 20.0 42.2

19:48:42 2,098,944 5,800 1,760 1,192 480 0 5,574 276 5,520 20.0 59.6

19:53:44 2,098,432 13,120 1,567 1,339 1,120 0 13,100 652 13,040 20.0 66.9

19:58:46 2,097,408 11,547 1,573 1,333 996 0 11,580 579 11,580 20.0 66.6

20:03:48 2,097,664 6,593 1,844 1,137 456 0 6,772 339 6,780 20.0 56.8

Ouch!!! In the first sample (day 1), we can see we aren’t doing any transaction grouping whatsoever – DSI.NgTransTotal ≈ DSI.TransTotal – despite the fact that dsi_max_xacts_in_group=20 and dsi_xact_group_size=65,536 (default), which should allow grouping. As a result, any DSI SQT cache above the bare minimum is excessive. But in the second sample (day 2), we can see we are grouping transactions – so perhaps the configuration was changed or the transaction profile differs enough to change how transactions are grouped. But rather

Final v2.0.1

154

than reducing the DSI SQT cache, we probably should start by figuring out why transaction grouping is not happening – as well as see if we can’t increase the transaction rate to something above 33 transactions per second (~10,000 xact/5 mins). The last may seem like a strange comment (how could we know this is attainable?) – but considering the insert stress test target system above was a laptop and it was processing 30 transactions per second (and then barely working) and the customer system is likely a server of considerable more capacity.

Now, let’s take a look at probably what is a more normal sample that illustrates the point we were making earlier about SQT cache & DSI cache being oversized. This sample comes to us courtesy of a RS 12.1 customer – who unfortunately was only collecting a few modules of their RS 12.1 system and RS 12.1 lack some of the more granular details around the SQT Open, Closed, Read and Truncate lists.

Sam

ple

Tim

e

Sour

ce S

QM

C

mds

Wri

rren

Sour

ce S

QT

C

mds

Tot

al

Sour

ce C

mds

M

axT

ran

Sour

ce S

QT

C

ache

Exc

eede

d

Sour

ce S

QT

C

ache

Mem

Use

d

Sour

ce S

QT

T

rans

Rem

oved

Des

t SQ

M

Cm

dsW

ritt

en

Des

t SQ

M

Cm

dsR

ead

DSI

Cm

dsR

ead

DSI

SQ

T C

ache

21:40:46 5,524 5,524 19 0 57,088 0 5,510 7,776 7,866 13,632,512

21:42:47 7,868 7,867 19 0 59,648 0 7,866 8,225 8,180 13,632,000

21:44:48 5,797 5,795 19 0 59,648 0 5,795 14,008 13,999 13,632,256

21:46:49 324 324 19 0 0 0 342 18,962 18,794 13,632,000

21:48:50 1 0 0 0 0 0 0 18,615 18,205 13,632,256

21:50:50 2 0 0 0 0 0 0 27,125 26,564 13,632,512

21:52:51 2 0 0 0 0 0 0 8,684 18,078 0

21:54:52 3 0 0 0 0 0 0 0 0 0

21:56:53 0 0 0 0 0 0 0 0 0 0

22:02:21 6 3 3 0 0 0 3 3 3 0

22:04:22 0 0 0 0 0 0 0 0 0 0

22:06:22 844 842 132 0 531,200 0 747 747 741 0

22:08:23 3,192 3,191 104 0 481,024 0 3,187 3,187 2,873 638,720

22:10:24 8,688 8,683 105 0 172,288 0 8,744 8,744 5,359 8,424,960

22:12:25 9,411 9,407 105 0 406,784 0 9,357 6,873 4,298 13,632,256

22:14:26 1,366 1,364 106 0 40,192 0 1,442 3,837 4,326 12,682,240

22:16:26 3,075 2,869 105 0 442,112 0 2,999 2,999 3,516 13,632,768

22:18:27 6,845 0 0 0 442,112 0 6,871 6,322 3,664 13,632,768

In the above system, the sqt_max_cache_size was raised from 10MB to 13MB to attempt to get better throughput. The problem was the SQT was never using more than about 500KB of cache! Now, that doesn’t mean only 500KB is necessary – it means that setting it higher actually wouldn’t help. In fact, as you can see all it did was allow the DSI-S to fill up 13MB of cache waiting for the DSIEXEC to catch up. The real problem is the latency at the DSIEXEC in delivering and executing the SQL at the replicate DBMS – as can be seen by the lag between the destination SQM.CmdsWritten or SQM.CmdsRead and DSI.CmdsRead. Likely, the same throughput could be achieved by setting sqt_max_cache_size to 4MB and dsi_sqt_max_cache_size to 2MB.

Final v2.0.1

155

DSI Transaction Grouping

Why Group Transactions

One function of the main DSI thread is to group multiple independent transactions from the primary into a single transaction group at the replicate. Consider the following illustration of the difference between the primary database transaction and the DSI transaction grouping:

Primary Database Transactions DSI Transaction Grouping

begin tran order_traninsert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…commit tran order_tranbegin tran ship_tranInsert into ship_history values (…)Update orders set status=…commit tran ship_tranbegin tran order_traninsert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…commit tran order_tranbegin tran order_traninsert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…commit tran order_tran

begin traninsert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…insert into ship_history values (…)update orders set status=…insert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…insert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…commit tran

Primary Database Transactions DSI Transaction Grouping

begin tran order_traninsert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…commit tran order_tranbegin tran ship_tranInsert into ship_history values (…)Update orders set status=…commit tran ship_tranbegin tran order_traninsert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…commit tran order_tranbegin tran order_traninsert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…commit tran order_tran

begin traninsert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…insert into ship_history values (…)update orders set status=…insert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…insert into orders values (…)insert into order_items values (…)insert into order_items values (…)update orders set total=…commit tran

Figure 37 – Primary vs. Replicate Transaction Nesting Impact of DSI Transaction Grouping

In the example on the right, Replication Server’s DSI thread has consolidated the individual transactions into another transaction (begin/commit pair underlined) grouping the transactions together. The obvious question is “Why bother doing this?” The answer simply is to decrease the amount of logging on the replicate system imposed by replication and to improve the transaction delivery rate. Consider the worst-case scenario of several atomic transactions such as:

insert into checking_acct values (123456789,000001,”Sep 1 2000 14:20:36.321”,$125.00,Chk,101) insert into checking_acct values (123456789,000002,”Sep 1 2000 14:20:36.322”,$250.00,Chk,102) insert into checking_acct values (123456789,000003,”Sep 1 2000 14:20:36.323”,$395.00,Chk,103) insert into checking_acct values (123456789,000004,”Sep 1 2000 14:20:36.324”,$12.00,Chk,104) insert into checking_acct values (123456789,000005,”Sep 1 2000 14:20:36.325”,$99.00,Chk,105) insert into checking_acct values (123456789,000006,”Sep 1 2000 14:20:36.326”,$5.32,Chk,106) insert into checking_acct values (123456789,000007,”Sep 1 2000 14:20:36.327”,$119.00,Chk,107) insert into checking_acct values (123456789,000008,”Sep 1 2000 14:20:36.328”,$1132.00,Chk,108)

As you notice, these fictitious transactions all were applied during an extremely small window of time. Now the question is, without transaction grouping, what would Replication Server do? The answer is, each of the above would get turned into separate individual transactions and submitted as follows (RS functions listed vs. SQL):

rs_begin rs_insert – insert for check 101 rs_commit rs_begin rs_insert – insert for check 102 rs_commit rs_begin rs_insert – insert for check 103 rs_commit rs_begin rs_insert – insert for check 104 rs_commit rs_begin rs_insert – insert for check 105 rs_commit rs_begin rs_insert – insert for check 106

Final v2.0.1

156

rs_commit rs_begin rs_insert – insert for check 107 rs_commit rs_begin rs_insert – insert for check 108 rs_commit

Which does not look that bad until you realize two very interesting facts: 1) the contents of the rs_commit function; and 2) how rs_commit is sent as compared to other functions. In regards to the former, rs_commit calls a stored procedure rs_update_lastcommit, which updates the corresponding row in the replication system table rs_lastcommit. As far as the second point, while this will be discussed in more detail in the next section, Replication Server does not batch the outer commit statements with the transaction batch if batching is enabled. Consequently, the replicate database would actually be executing something similar to:

begin tran insert into checking_acct (…,101) -- wait for success update rs_lastcommit … commit transaction -- wait for success begin tran insert into checking_acct (…,102) -- wait for success update rs_lastcommit … commit transaction -- wait for success begin tran insert into checking_acct (…,103) -- wait for success update rs_lastcommit … commit transaction -- wait for success begin tran insert into checking_acct (…,104) -- wait for success update rs_lastcommit … commit transaction -- wait for success begin tran insert into checking_acct (…,105) -- wait for success update rs_lastcommit … commit transaction -- wait for success begin tran insert into checking_acct (…,106) -- wait for success update rs_lastcommit … commit transaction -- wait for success begin tran insert into checking_acct (…,107) -- wait for success update rs_lastcommit … commit transaction -- wait for success begin tran insert into checking_acct (…,108) -- wait for success update rs_lastcommit … commit transaction -- wait for success

Why is this a problem? First, the amount of I/O has clearly doubled. Consequently, if the replicate system was already experiencing I/O problems, this would add to the problem. Secondly, the delivered transaction rate would not match that at the primary system. Consider each of the following primary database transaction scenarios:

Concurrent User – Concurrent users applied each transaction at the primary. At the replicate, only a single user is applying the transactions. So while the primary system can take full advantage of multiple CPU’s, group commits for the transaction log and every other feature of ASE to improve concurrency, the replicate simply has no concurrency.

Single User/Batch – In this scenario, a single user applies all the transactions at the primary in a large SQL batch. At the replicate, the batching is essentially undone as each of the atomic commits results in 2 network operations per transaction. This could be significant as anyone familiar with the performance penalties of not batching SQL can attest.

Final v2.0.1

157

Single User/Atomic – A single user performs each of the original inserts using a single atomic transaction per network call. While the replicate might appear to be similar, consider the following. As ASE performs each I/O the user process is put to sleep. As a result, the replicate system – with twice the i/o’s – will spend twice as much time “sleeping”, consequently halving its ability to process transactions.

Simply, transaction batching is critical to replication performance – although it can be an issue with parallel or multiple DSI’s as discussed later.

Key Concept #15: Transaction grouping reduces I/O caused by updating replication system tables and the corresponding logging overhead at the replicate system. This also improves throughput as the replication process within the replicate database server spends less time waiting for I/O completion.

While we can see the benefits of this, some may have been quick to notice that the individual transactions “seem” to have gotten lost. Actually, they are still there and tracked. One reason for this is that if any individual statement in the above group of transactions fail, the entire group is rolled back and the individual transactions submitted until the point of failure (again). So why didn’t RS engineering simply submit it as nested transactions? Several reasons:

• The nested commits would have prevented parallel DSI’s from working at all as it would have guaranteed contention on rs_lastcommit

• Not all DBMS’s support nested transactions (i.e. ODBC interfaces to flat files) • Rolling back a nested transaction is not possible (read the ASE docs carefully – you can rollback to a

savepoint, but not a nested transaction – described later in procedure replication).

DSI Transaction Grouping Rules

Unfortunately, not every transaction can be grouped together. A transaction group will end any time one of the following conditions is met:

1. There are no more transactions in the DSI queue. 2. The predefined maximum number of transactions allowed in a group has been reached. 3. The current or the next transaction will make the total size of the transactions (in bytes) exceed the

configured group size. 4. The next transaction is from a different origin. 5. The current or the next transaction is on disk. 6. The current or the next transaction is an orphan transaction. 7. The current or the next transaction is a rollback. 8. The current or the next transaction is a subscription (de)materialization transaction marker. 9. The current or the next transaction is a subscription (de)materialization transaction queue end marker. 10. The current or the next transaction is a dump/load transaction. 11. The current or the next transaction is a routing transaction. 12. The current or the next transaction has no begin command (i.e., it is a special RS-to-RS transaction). 13. The next transaction has a different user/password. 14. The first transaction has IGNORE_DUP_M mask on. 15. A transaction partitioning rule determines that the next transaction cannot be grouped with the existing

group. 16. A timeout expires

While this appears to be quite a long list, the rules for grouping transactions can simply be paraphrased into the rule that in order for transactions to be batched together, all of the following six conditions must be met.

1. Transactions cached in the DSI/SQT closed queue. 2. Transactions from the same origin. 3. Transactions will be applied at the replicate with the same username and password. 4. The transaction group size is limited by the lesser of dsi_xact_group_size and dsi_max_xacts_in_group. 5. Aborted, database/log dump, orphan, routing, and subscription transactions cannot be grouped.

Final v2.0.1

158

6. A transaction partitioning rule determines that the next transaction cannot be grouped with the existing group.

The fourth condition will be discussed in the next section on tuning transaction grouping. The fifth condition is due to system level reprocessing or ensuring integrity of the replicate system during materialization of subscriptions or routes and is rare – consequently not discussed. The last condition will be discussed in the section on parallel DSI’s later in this document. This leaves only the first three conditions that apply to most transactions. While the first condition makes sense simply from a performance aspect, the second condition requires some thought, while the third is fairly easy.

Earlier, one of the conditions which causes transactions not to be grouped was stated as “The next transaction has a different user/password”, which was summarized above that transactions grouped together must use the same user/password combination. Some find this confusing, assuming that it refers to the user who committed the transaction at the primary system. It does not. It refers instead to the user that will apply the transaction at the replicate. At this juncture, many might say “Wait a minute, I thought the maintenance user applies all the transactions?” This is mostly true. During normal operations, the maintenance user will be the login used to apply transactions at the replicate – thereby allowing full transaction grouping capabilities. However, some transactions are not applied by the maintenance user. For example, in Warm Standby systems, DDL transactions that are replicated are executed at the standby system by the same user who executed the DDL at the primary. This assures that the object ownership is identical. Additionally, Asynchronous Request Functions (discussed later) are also applied by the same user as executed at the originating system. In this latter case, it has less to do with the specific user and more to do with ensuring that the transaction is recorded using a different user login than the maintenance user – thereby allowing the changes to be re-replicated back to the originating or other systems without requiring the RepAgent to be configured for “send_maint_xacts_to_replicate”. In short, it should be extremely rare – and possibly not at all – that a transaction group is closed early due to a different user/password.

Now that we understand this, the next question might be “Why can’t we group transactions from different source databases?” The reason that the transactions have to be from the same origin is due to the management of the rs_lastcommit table and how the DSI controls assigning the OQID for the grouped transaction. When the DSI groups transactions together, it uses the last grouped transaction’s begin record to determine the OQID for the OQID for the grouped transaction. The reason is that on recovery, not using the last transaction’s OQID could result in duplicate row errors or an inconsistent database.

Consider a default grouping of 20 transactions into a single group that are applied to the replicate database server and then immediately the replicate database shuts down. On recovery, as most people are aware, the Replication Server will issue a call to rs_get_lastcommit to determine the last transaction that was applied. Remember, the transactions are grouped in memory – not in the stable queue. Consequently, if the OQID of the first transaction was used, then the first 19 transactions would all be duplicates – and not detected as such by the Replication Server as that was the whole reason for the comparison of the OQID in the first place!! As a result, the first 19 transactions would either cause duplicate key errors (if you are lucky) or database inconsistencies if using function strings. For that reason, when transactions are grouped together, the OQID of the last transaction’s begin record is used for the entire group.

Now then, following that logically along, since the rs_commit function updates only a single row in the rs_lastcommit table for the source database of the transaction, then all of the transactions grouped together must be from the same source. Note that currently, the DSI does not simply collect all of the closed transactions from the same source. If the third transaction in a series is from a different source database, then the group will end at two – even if the next four transactions are from the same source database as the first two. As you can imagine, a fragmented queue with considerable inter-dispersed transactions from different databases, the DSI will be applying transactions in very small groups. As mentioned earlier, the smaller the group size, the less efficient the replication mechanism due to rs_lastcommit and processing overhead, which leads us to the following concept:

Key Concept #16: Outbound queues that are heavily fragmented with inter-dispersed transactions from different source databases will not be able to effectively use transaction grouping

This may or may not be an issue. As you will see later, if using parallel DSI’s and a low dsi_max_xacts_in_group to control concurrency, this mix of transactions may not be an issue - especially if dsi_serialization_method is set to ‘single_transaction_per_origin’. For non-parallel DSI implementations, it does suggest that increasing dsi_max_xacts_in_group and similar parameters in such cases may prove fruitless.

Final v2.0.1

159

Tuning DSI Transaction Grouping

Prior to Replication Server 12.0, however, there really wasn’t a good way to control the number of transactions in a batch. The reason was that the only tuning parameter available attempted to control the transaction batching by controlling the transaction batch size in bytes – a difficult task with tables containing variable width columns and considering the varying row sizes of different tables. With version 12.0 came the ability to explicitly specify the number of original transactions that could be grouped into a larger transaction. These connection level configuration parameters are listed below.


dsi_xact_group_size Default: 65,536; Recommended: 2,147,843,647 (max)

The maximum number of bytes, including stable queue overhead, to place into one grouped transaction. A grouped transaction is multiple transactions that the DSI applies as a single transaction. A value of "-1" means no grouping.

dsi_max_xacts_in_group Default: 20; Max: 100; Recommended: see text

Specifies the maximum number of transactions in a group, allowing a larger transaction group size, which may improve data latency at the replicate database. The default value is a good starting point – lower generally should be considered if primarily updates are replicated and using parallel DSI’s and contention is an issue.

dsi_sqt_max_cache_size Default : 0 ; Recommended : see text

The number of bytes available for managing the SQT open, closed, read and truncate queues. This impacts DSI SQT processes by also being a limiter on the transaction batches that are cached in memory waiting for the DSIEXEC’s. For example, if the DSI SQT cache is too small, the DSIEXEC’s may not be able to group transactions to the number specified in dsi_xact_group_size.

dsi_partitioning_rule Default: none; Valid Values: origin, origin_sessid, time, user, name, and none

Specifies the partitioning rules (one or more) the DSI uses to partition transactions among available parallel DSI threads. Valid values are: origin, origin_sessid (if source is ASE 12.5.2+), time, user, name and none. This setting will be described in detail in the section on parallel DSI’s.

At first, the dsi_xact_group_size may appear to be fairly large. Remember, however, this includes stable queue overhead – which can be significant as the queue may require 4 times the storage space as the transaction log space. Additionally, it can be a bit difficult controlling the number of transactions with this parameter due to the varying row widths of different database tables, etc. As a result, Sybase added the dsi_max_xacts_in_group parameter and suggests that you set dsi_xact_group_size to the maximum and control transaction grouping using dsi_max_xacts_in_group. If you don’t adjust dsi_xact_group_size, the lesser of the two limits will cause the transaction grouping to terminate.

On the other hand, dsi_max_xacts_in_group can be raised from the default of 20 if using a single DSI – and perhaps should be if system is performing a lot of small transactions. However, in parallel or multiple DSI situations, this parameter may need to be lowered to reduce inter-thread contention. While this will be discussed later in the section on parallel DSI’s, contention is likely to occur in update heavy environments, or inserts with isolation level three due to next key (range) or infinity locks.

A good starting point for dsi_sqt_max_cache_size is to figure on 500-750KB per DSIEXEC thread in use with a minimum of 1MB. This may seem like an awfully small amount, but remember from the earlier example that 2MB was enough to cache ~30 transaction groups for one customer. As mentioned though, from this starting point, you will need to monitor the approximate transactions and transaction groups in cache and increase dsi_sqt_max_cache_size only when it can no longer hold 2 * dsi_max_xacts_in_group * num_dsi_threads transactions.

DSI Grouping Monitor Counters

To help determine the efficiency of DSI transaction grouping, the following monitor counters are available.

Counter Explanation

CmdGroups Total transaction groups sent to the target by a DSI thread. A transaction group can contain at most dsi_max_xacts_in_group transactions. This counter is incremented each time a 'begin' for a grouped transaction is executed.

Final v2.0.1

160

Counter Explanation

CmdGroupsCommit Total command groups committed successfully by a DSI thread.

CommitsInCmdGroup Total transactions in groups sent by a DSI thread that committed successfully.

GroupsClosedBytes Total transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_xact_group_size.

GroupsClosedLarge Total transaction groups closed by a DSI thread due to the next transaction satisfying the criteria of being large.

GroupsClosedMixedMode Total transaction groups closed by a DSI thread because the current group contains asynchronous stored procedures and the next tran does not or the current group does *not* contain asynchronous stored procedures and the next transaction does.

GroupsClosedMixedUser Total asynchronous stored procedure transaction groups closed by a DSI thread due to the next tran user ID or password being different from the ones for the current group.

GroupsClosedNoneOrig Total trxn groups closed by a DSI due to no open group from the origin of the next transaction (i.e. We have a new origin (source db) in the next trxn), or the RS scheduler forced a flush of the current group from the origin leaving no open group from that origin. Note that the highlighted condition could cause transaction groups to be flushed prior to reaching dsi_max_xacts_in_group – and likely will be the most common cause for transactions closed identified by this metric.

GroupsClosedResume Total transaction groups closed by a DSI thread due to the next transaction following the execution of the 'resume' command - whether 'skip', 'display' or execute option chosen.

GroupsClosedSpecial Total transaction groups closed by a DSI thread due to the next transaction being qualified as special – orphan, rollback, marker, duplicate, ddl, etc.

GroupsClosedTranPartRule Total transaction groups closed by a DSI thread because of a Transaction Partitioning rule.

GroupsClosedTrans Total transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group.

GroupsClosedWSBSpec Total transaction groups closed by a DSI thread for a Warm Standby due to the next transaction being special - empty, or a enable replication marker or subscription materialization marker or ignored due to duplication detection, etc.

NgTransTotal Total non-grouped transactions read by a DSI Scheduler thread from an outbound queue.

PartitioningWaits Total transaction groups forced to wait for another group to complete (processed serially based on Transaction Partitioning rule).

TransInCmdGroups Total transactions contained in transaction groups sent by a DSI thread. The number of trxns in a group is added to this counter each time a 'begin' for a grouped transaction is executed.

TransSucceeded Total transactions applied successfully to a target database by a DSI thread. This includes transactions that were committed or rolled back successfully.

TransTotal Total transaction groups generated by a DSI Scheduler while reading the outbound queue. This counter is incremented each time a new transaction group is started. If grouping is disabled, this is total transactions in queue.

YieldsScheduler This counter is incremented each time the main DSI Scheduler body yields following the dispatch of closed transaction groups to DSI Executor threads.

Final v2.0.1

161

In RS 15, the counters change slightly, mainly with the addition of more timing counters:

Counter Explanation

DSIReadTranGroups Transaction groups read by the DSI. If grouping is disabled, grouped and ungrouped transaction counts are the same.

DSIReadTransUngrouped Ungrouped transactions read by the DSI. If grouping is disabled, grouped and ungrouped transaction counts are the same.

DSITranGroupsSucceeded Transaction groups applied successfully to a target database by a DSI thread. This includes transactions that were successfully committed or rolled back according to their final disposition.

DSITransFailed Grouped transactions failed by a DSI thread. Depending on error mapping, some transactions may be written into the exceptions log.

DSITransRetried Grouped transactions retried to a target server by a DSI thread.

DSIAttemptsTranRetry When a command fails due to data server errors, the DSI thread performs post-processing for the failed command. This counter records the number of retry attempts.

DSITranGroupsSent Transaction groups sent to the target by a DSI thread. A transaction group can contain at most dsi_max_xacts_in_group transactions. This counter is incremented each time a 'begin' for a grouped transaction is executed.

DSITransUngroupedSent Transactions contained in transaction groups sent by a DSI thread.

DSITranGroupsCommit Transactions committed successfully by a DSI thread.

DSITransUngroupedCommit Transactions in groups sent by a DSI thread that committed successfully.

DSICmdsSucceed Commands successfully applied to the target database by a DSI.

DSICmdsRead Commands read from an outbound queue by a DSI.

GroupsClosedBytes Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_xact_group_size.

GroupsClosedNoneOrig Trxn groups closed by a DSI due to no open group from the origin of the next trxn. I.e. We have a new origin in the next trxn, or the Sched forced a flush of the current group from the origin leaving no open group from that origin.

GroupsClosedMixedUser Asynchronous stored procedure transaction groups closed by a DSI thread due to the next tran user ID or password being different from the ones for the current group.

GroupsClosedMixedMode Transaction groups closed by a DSI thread because the current group contains asynchronous stored procedures and the next tran does not or the current group does *not* contain asynchronous stored procedures and the next transaction does.

GroupsClosedTranPartRule Transaction groups closed by a DSI thread because of a Transaction Partitioning rule.

GroupsClosedTrans Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group.

CmdGroupsRollback Command groups rolled back successfully by a DSI thread.

RollbacksInCmdGroup Transactions in groups sent by a DSI thread that rolled back successfully.

GroupsClosedLarge Transaction groups closed by a DSI thread due to the next transaction satisfying the criteria of being large.

Final v2.0.1

162

Counter Explanation

GroupsClosedWSBSpec Transaction groups closed by a DSI thread for a Warm Standby due to the next transaction being special - empty, or a enable replication marker or subscription materialization marker or ignored due to duplication detection, etc.

GroupsClosedResume Transaction groups closed by a DSI thread due to the next transaction following the execution of the 'resume' command - whether 'skip', 'display' or execute option chosen.

GroupsClosedSpecial Transaction groups closed by a DSI thread due to the next transaction being qualified as special - orphan, rollback, marker, duplicate, ddl, etc.

DSIFindRGrpTime Time spent by the DSI/S finding a group to dispatch.

DSIDisptchRegTime Time spent by the DSI/S dispatching a regular transaction group to a DSI/E.

DSIDisptchLrgTime Time spent by the DSI/S dispatching a large transaction group to a DSI/E. This includes time spent finding a large group to dispatch.

DSIPutToSleep Number of DSI/E threads put to sleep by the DSI/S prior to loading SQT cache. These DSI/E threads have just completed their transaction.

DSIPutToSleepTime Time spent by the DSI/S putting free DSI/E threads to sleep.

DSILoadCacheTime Time spent by the DSI/S loading SQT cache.

Let’s take a look at some of these counters and how the can be used from the outbound queue/DSI perspective as well as clarifying some of these that appear to be confusing. Other than the SQT aspects, the most common counters in the DSI include (15.0 formulas/names in parenthesis):

CmdsRead, TransSucceeded (DSICmdsRead, DSITranGroupsSucceeded) XactsInGrp = NgTransTotal / TransTotal (DSIReadTransUngrouped/DSIReadTranGroups) GroupsClosedBytes, GroupsClosedLarge GroupsClosedNoneOrig, GroupsClosedTrans GroupsClosedMixedUser, GroupsClosedMixedMode

While there are others, these are the most common. The first set is mostly (again) monitoring type counters – CmdsRead should match SQM CmdsWritten (for the outbound queue) but likely won’t as the most frequent source of latency is the DSIEXEC due the replicate database. XactsInGrp, on the other had is clearly tied to configuration settings – specifically dsi_max_xacts_in_group. By comparing the number of ungrouped transactions (NgTransTotal) to the number of grouped transactions (TransTotal) we can observe much transaction grouping is going on. One of the keys to parallel transaction use is to increase this parameter as much as possible (until contention starts) – at lower settings, it is not likely that too many threads will actually be used. Even without parallel DSI, considering the overhead during the commit phase (updating rs_lastcommit, etc.), the more the merrier.

The next sets of counters will explain why a group of transactions were closed. The first set point to likely configuration issues. If you see very many GroupsClosedBytes, it is likely because you have not adjusted dsi_xact_group_size from its default of 64K to something more realistic such as 256K. As a result, no matter what you have dsi_max_xacts_in_group set to, a low value here will prevent the grouping. Similarly, the default value for dsi_large_xact_size of 100 is simply too small – and in fact, arguably large transactions are not effective in any case so you should set this to the upper limit of 2 billion and forget about it.

GroupsClosedNoneOrig and GroupsClosedTrans will be the most common causes, so they can be ignored if tuned properly. The first – while it may refer to the fact that the next transaction is from a different origin (corporate rollup), the most often it is referring to the fact the scheduler forced a flush. The second is incremented whenever a group is closed due to reaching dsi_max_xacts_in_group. A lot of these may indicate that dsi_max_xacts_in_group is too low (the default of 20 is typically plenty, but someone may have decreased it). However, if the next set appears, it may provide a reason why even though you have a well defined dsi_max_xacts_in_group, it isn’t being used. The first (GroupsClosedMixedUser) happens whenever the DSI has to connect as another user vs. the maintenance user – typically DDL commands. The second (GroupsClosedMixedMode) refers to asynchronous request functions. There are other ‘GroupClosed’ counters, but the point is to avoid GroupsClosedBytes and if GroupsClosedNoneOrig or GroupsClosedTran are not where expected, you may have to look to the others for the explanation.

Final v2.0.1

163

Let’s take a look at how these might work by looking at the earlier insert stress test.

Sam

ple

Tim

e

Cac

heM

em U

sed

Cac

hedT

rans

DSI

Xac

t InG

rp

DSI

Cm

dGro

ups

Tra

nsIn

Cm

d G

roup

s

Gro

ups

Clo

sedB

ytes

Gro

ups

Clo

sedT

rans

Gro

ups

Clo

sedL

arge

Gro

ups

Clo

sedO

rig

Gro

upsC

lose

d R

esum

e

Yie

lds S

ched

uler

11:37:47 0 0 0.0 0 0 0.0 0.0 0.0 0.0 0.0 0

11:37:57 2,097,408 195 2.7 17 51 0.0 0.0 0.0 104.8 0.0 103

11:38:08 2,099,712 171 4.6 63 289 0.0 4.8 0.0 93.5 0.0 389

11:38:19 2,099,200 171 4.7 68 322 0.0 1.5 0.0 98.5 0.0 418

11:38:30 2,097,920 171 4.4 75 334 0.0 6.7 0.0 93.3 0.0 433

11:38:41 2,098,432 171 4.7 67 315 0.0 3.0 0.0 98.5 0.0 414

11:38:52 2,101,504 171 4.8 64 310 0.0 1.6 0.0 100.0 0.0 436

11:39:03 2,100,224 171 4.6 68 319 0.0 2.9 0.0 97.1 0.0 416

11:39:14 2,099,968 171 4.6 68 316 0.0 2.9 0.0 97.1 0.0 421

11:39:25 2,100,224 171 4.3 67 291 0.0 6.0 0.0 95.5 0.0 396

The only derived columns above are the same in the previous example from the SQT – in fact the first four columns are repeated – partially to put in context some of the others. As you may remember, the dsi_max_xacts_in_group was 20 – and we are hoping to determine (if we can) why the actual value is more in the 4-5 range than close to 20. While there are additional DSI metrics for GroupsClosed______ not listed above, some of the more common reasons are listed in the above table. Note especially, that the GroupsClosed______ metrics are presented as a percentage (of 100%) and not the actual values (rationale is that it is easier to recognize the primary reasons this way) – hence the blue color highlighting the metrics above.

CachedTrans & DSIXactInGrp – Repeated from the DSI SQT cache metrics, these derived values are calculations of the number of transactions in the DSI SQT cache (based on average memory used per transaction) and the average number of transactions grouped together by the DSI thread respectively.

DSI.CmdGroups & TransInCmdGroups – These metrics report the actual number of transaction groups sent by the DSI to the DSI EXEC – and operate very similarly to the metrics DSI.TransTotal and NgTransTotal. Slight differences may occur, however, as variable substitution may cause the original grouping to exceed the byte limit on the transaction group. One way to think of the differences between TransTotal/NgTransTotal and CmdGroups/TransInCmdGroups is that TransTotal/NgTransTotal represents the planned transaction grouping where as CmdGroups/TransInCmdGroups represent the actual. To that extent, DSIXactInGrp (a derived statistic based on dividing NgTransTotal by TransTotal) represents a planned transaction grouping ratio vs. actual – while the actual may be slightly deviated, it is well within a margin of error.

GroupsClosedBytes – This counter is incremented any time the transaction group is closed because the number of bytes in the transaction group exceed dsi_xact_group_size. In the case above, the dsi_xact_group_size was 262,144 (256KB) – which although much smaller than the suggested maximum setting, did not contribute to the reason the transaction grouping was less than desired.

GroupsClosedTrans – Similar to above, this counter is incremented anytime a group is closed due to the number of transactions exceeding dsi_max_xacts_in_group. Interestingly, we see that 1.5-7% of the groups reached the maximum of 20 – so despite the computed average of 4 transactions per group, there are some (few) that do reach the maximum and likely many in between.

GroupsClosedLarge – This counter is incremented any time a group of transactions is closed due to the fact that the next transaction is considered large – either because it exceeds dsi_large_xact_size or because it involves text/image data (which automatically qualifies it as a large transaction).

GroupsClosedOrig – This counter is incremented any time a group is closed because the next transaction to be delivered to the destination comes from a different source database (think corp rollup). In addition – and a more common cause in WS systems - this counter is incremented when the DSI-S can’t find an open transaction group from the same origin – a situation usually caused when the scheduler forces the DSI to close pending transaction groups and send them to the DSIEXEC’s. That is the case here as the system in

Final v2.0.1

164

question was a WS implementation in isolation – so no other connection existed to cause this counter to be incremented. We just need to determine what is driving the scheduler…

GroupsClosedResume – This counter is incremented any time a group is closed due to the next transaction following a resume command. The reason for this is that often times a transaction group needs to be rolled back and applied as individual transactions up to the point of error – and then the DSI is suspended. As a result, when the DSI is resumed, the DSI rebuilds transaction groups from that point.

YieldsScheduler – This metric is illustrated here to show how often the DSI is yielding after a group has been submitted to a DSI EXEC. However, we see that the number of yields is 4-6x the number of transaction groups which suggests that the DSI was repeated checking to see if the DSI EXEC was finished with the current group and ready for the next.

From the above, it looks like the scheduler is closing transaction groups prior to reaching dsi_max_xacts_in_group – but otherwise no real indication of what the cause may be. Perhaps other DSI or DSI EXEC counters will help us learn why the scheduler is doing this – but we will look at them late. For now, let’s take a look at the customer examples from the 2 different days.

The first day’s counter values for DSI grouping are illustrated below:

Sam

ple

Tim

e

Cac

heM

em

Use

d

Cac

hedT

rans

DSI

Xac

t In

Grp

DSI

C

mdG

roup

s

Tra

nsIn

Cm

d G

roup

s

Gro

ups

Clo

sedB

ytes

Gro

ups

Clo

sedT

rans

Gro

ups

Clo

sedL

arge

Gro

ups

Clo

sedO

rig

Gro

upsC

lose

dR

esum

e

Yie

lds

Sche

dule

r

19:02:07 0 0 1.0 2 2 0.0 0.0 0.0 0.0 100.0 6

19:07:08 1,792 0 1.0 1,574 1,574 0.0 0.0 0.0 0.0 100.0 1,951

19:12:10 3,328 1 1.0 1,920 1,920 0.0 0.0 0.0 0.0 100.1 2,528

19:17:12 0 0 1.0 1,030 1,030 0.0 0.0 0.0 0.0 98.3 1,397

19:22:13 0 0 1.0 1,746 1,746 0.0 0.0 0.0 0.0 100.1 2,281

19:27:14 143,104 57 1.0 1,873 1,873 0.0 0.0 0.0 0.0 100.1 2,452

19:32:16 2,098,432 923 1.0 3,899 3,899 0.0 0.0 0.0 0.0 115.6 6,055

19:37:18 2,097,920 1,328 1.0 10,348 10,348 0.0 0.0 0.0 0.0 99.9 21,396

19:42:19 2,098,432 1,344 1.0 10,578 10,578 0.0 0.0 0.0 0.0 100.1 21,794

19:47:21 2,097,920 1,333 1.0 5,430 5,430 0.0 0.0 0.0 0.0 99.3 8,281

Almost instantly we see that most of the transactions were closed because the next transaction followed a ‘resume’ command – rather odd and suggestive of a significant number of errors. Some that are observant might have noted that some of these percentages are above 100% - remember, as mentioned earlier – transaction groups are automatically tried individually until the individual transaction with the problem re-occurs. It also simply could be due to calculating the percentage based on DSI.TransTotal vs. DSI.CmdGroups. Note as well that the ratio of YieldsScheduler to transactions ranges from slightly more than 1 to 2.

Now, let’s look at the next day:

Sam

ple

Tim

e

Cac

heM

em

Use

d

Cac

hedT

rans

DSI

Xac

t In

Grp

DSI

C

mdG

roup

s

Tra

nsIn

Cm

d G

roup

s

Gro

ups

Clo

sedB

ytes

Gro

ups

Clo

sedT

rans

Gro

ups

Clo

sedL

arge

Gro

ups

Clo

sedO

rig

Gro

upsC

lose

dR

esum

e

Yie

lds

Sche

dule

r

19:18:31 0 0 1.0 3 3 0.0 0.0 0.0 100.0 0.0 9

19:23:32 1,725,696 791 11.5 148 1,702 0.0 58.8 0.0 52.7 0.0 702

19:28:34 1,023,232 414 16.0 115 1,849 0.0 69.6 0.0 25.2 0.0 638

Final v2.0.1

165

Sam

ple

Tim

e

Cac

heM

em

Use

d

Cac

hedT

rans

DSI

Xac

t In

Grp

DSI

C

mdG

roup

s

Tra

nsIn

Cm

d G

roup

s

Gro

ups

Clo

sedB

ytes

Gro

ups

Clo

sedT

rans

Gro

ups

Clo

sedL

arge

Gro

ups

Clo

sedO

rig

Gro

upsC

lose

dR

esum

e

Yie

lds

Sche

dule

r

19:33:36 1,166,592 470 14.9 69 1,034 0.0 69.6 0.0 33.3 0.0 372

19:38:38 2,098,432 845 13.9 101 1,405 0.0 72.3 0.0 36.6 0.0 588

19:43:40 2,098,432 845 20.0 187 3,740 0.0 100.0 0.0 0.0 0.0 1,271

19:48:42 2,098,944 1,192 20.0 276 5,520 0.0 104.3 0.0 0.0 0.0 1,418

19:53:44 2,098,432 1,339 20.0 652 13,040 0.0 100.2 0.0 0.0 0.0 2,952

19:58:46 2,097,408 1,333 20.0 579 11,580 0.0 99.7 0.0 0.0 0.0 2,702

20:03:48 2,097,664 1,137 20.0 339 6,780 0.0 97.3 0.0 0.0 0.0 1,790

Note in this case, the transactions at the beginning are largely closed due to GroupsClosedOrig – likely due to the same scheduler driven reasons as the insert test. However, very quickly the reasons shift to GroupsClosedTrans as the DSIXactInGrp climbs and eventually reaches the dsi_max_xacts_in_group of 20.

DSIEXEC Function String Generation

DSI Executer Processing

While the DSI is responsible for SQT functions and transaction grouping, it is the responsibility of the DSI Executer (DSI-E) threads to actually perform the SQL string generation, command batching and exception handling. The key to the DSI-E is that the DSI-S simply passes the list of transaction id’s in the group to it. The DSI-E then reads the actual transaction commands from the DSI SQT cache region.

If you remember from the earlier discussion on LTL, the replicated functions (rs_insert, rs_update, rs_delete, etc.) actually are identified by the Replication Agent. This helps the rest of the Replication Server as it does not have to perform SQL language parsing (which is not in the transaction log anyhow – something many people have a hard time understanding – the transaction log NEVER logs the SQL). However, we need to send ASCII language commands to the replicate system (or RPC’s). As a result, the DSI-E thread execution looks like the following flow diagram.

Final v2.0.1

166

Translate replicated functions into SQL

via fstring definitions

“Stop”Errors?

Transaction group

from DSI

Suspend connection

Break transaction intodsi_cmd_batch_size

Batches of SQL

Send SQL batch to Replicate database

Rollback transaction Done?

Commit Transaction

Yes No

No

Yes

Translate replicated functions into SQL

via fstring definitions

“Stop”Errors?

Transaction group

from DSI

Suspend connection

Break transaction intodsi_cmd_batch_size

Batches of SQL

Send SQL batch to Replicate database

Rollback transaction Done?

Commit Transaction

Yes No

No

Yes

Figure 38 – DSI Executer SQL Generation and Execution Logic

Note that in the above diagram, only “stop” errors cause the DSI to suspend. If you remember, some error actions such as ignore (commonly set to handle database change, print and other information messages), retry, etc. allow the DSI to continue uninterrupted.

DSI Executer Performance

Beyond DSI command batching (next section), the tuning parameters available for the DSI Executer are listed in the following table (other parameters are available, however, do not specifically address performance throughput). Note that parameters specific to parallel DSI performance are not listed here.


Replication Server scope

fstr_cachesize (obsolete/deprecated)

Obsolete and deprecated. In RS 12.0, it was decided that this was not necessary (possibly viewed as duplicative as function string RSSD rows would be in STS cache as well) and the parameter was made obsolete (although still in the documentation). Mentioned here as often questions are asked whether changing this would help – short answer “No”. Long answer is this was deprecated by sts_full_cache_xxxxx. (essentially).

sts_cachesize Default: 100; Suggested: 1000

The total number of rows cached for each cached RSSD system table. Increasing this number to the number of active replication definitions prevents Replication Server from executing expensive table lookups. From a DSI Executer performance perspective, the STS cache could be used to hold RSSD tables such as rs_systext that hold the function string definitions. Of all the parameters below, this one is probably the most critical as insufficient STS cache would result in network and potentially disk i/o in accessing the RSSD.

Final v2.0.1

167


sts_full_cache_xxxxx For DSI performance the list of tables that should be fully cached include rs_objects, rs_columns, and rs_functions

Connection scope

batch Default: on; Recommended: on

Specifies how Replication Server sends commands to data servers. When batch is "on," Replication Server may send multiple commands to the data server as a single command batch. When batch is "off," Replication Server sends commands to the data server one at a time. This is “on” for ASE and should be on for any system that supports command batching due to performance improvements of batching. Some heterogeneous replicate systems – such as Oracle – do not support command batching, and consequently this parameter needs to be set to “off”. Note that for Oracle, we are referring to the actual DBMS engine – as of 9i and 10g, batch SQL is handled outside the DBMS engine by the PL/SQL engine.

batch_begin Default: on; Recommended: see text

Indicates whether a begin transaction can be sent in the same batch as other commands (such as insert, delete, and so on). For single DSI systems, this value should be ‘on’ (the default). If using parallel DSI’s and ‘wait_for_commit’, the value should be ‘on’ as well. For most other parallel DSI serialization methods (i.e. wait_for_start) this value should be ‘off’. The rationale for ‘off’ is that the DSIEXEC will post the ‘Batch Began’ message quicker to the DSI allowing the other parallel threads to begin quicker than waiting for the begin and the first command batch (and possibly only command batch) to execute before the message is sent.

db_packet_size Default: 512; Recommended: 8192 or 16384

The maximum size of a network packet. During database communication, the network packet value must be within the range accepted by the database. You may change this value if you have an Adaptive Server that has been reconfigured for “max network packet size” minimally at the desired size or greater. A recommended packet size of 16,384 on high speed networks or tuned to network MTU on lower speed networks is appropriate. Values less than 2,048 are suspect and should only be used if the target system does not support larger packet sizes. On ASE 15 systems, the connection will automatically be bumped to 2048 as the minimum packet size.

dsi_cmd_batch_size Default: 8192; Recommended: 32768

The maximum number of bytes that Replication Server places into a command batch. You need to be careful with this setting as too high of a setting may exceed the stack space in the replicate database engine. However, it should be at least the same as the db_packet_size if not doubled.

dsi_keep_triggers Default: “on” for most – “off” for WS; Recommended: “off”

Specifies whether triggers should fire for replicated transactions in the database. Set to "off" to cause Replication Server to set triggers off in the Adaptive Server database, so that triggers do not fire when transactions are executed on the connection. By default, this is set to "on" for all databases except standby databases. Arguably should be off for all databases, although caution should be exercised when replicating procedures. “On” is the default as it is the typical “safe” approach that Replication Server defaults assume, however, there should be compelling reasons not to have this turned “off” – including security as the replication maintenance user could be viewed as a “trusted agent” fully supportable in Bell-Lapadula and other NCSC endorsed security policies. Additionally, having it on is no guarantee of database consistency as will be illustrated later in the discussion on triggers. Simply put – if you leave this “on” – you WILL have RS latency & performance problems.

Final v2.0.1

168


dsi_replication Default: “off” for most – “on” for WS

Specifies whether or not transactions applied by the DSI are marked in the transaction log as being replicated. When dsi_replication is set to "off," the DSI executes set replication off in the Adaptive Server database, preventing Adaptive Server from adding replication information to log records for transactions that the DSI executes. Since these transactions are executed by the maintenance user and, therefore, not usually replicated further (except if there is a standby database), setting this parameter to "off" avoids writing unnecessary information into the transaction log. dsi_replication must be set to "on" for the active database in a warm standby application for a replicate database, and for applications that use the replicated consolidated replicate application model. The reason this is mentioned as a possible performance enhancement is its applicability in multiple DSI situations discussed later.

Some of these, such as the STS and other server level configurations, have been discussed before and have been included here simply for completeness. Additionally, several have to do with command batching which is discussed in the next section. Those that are highlighted are specifically applicable to DSI Executer performance.

DSI EXEC DML Monitor Counters

Several monitor counters in the DSIEXEC module help analyze throughput, transaction characteristics and general function string generation issues.

Counter Explanation

Command (DML or DDL Related)

CmdsApplied Total commands applied by a DSIEXEC thread.

CmdsSQLDDLRead Total SQLDDL commands processed by a DSI DSIEXEC thread.

DeletesRead Total rs_delete commands processed by a DSIEXEC thread.

ExecsGetTextPtr Total invocations of function rs_get_textptr by a DSIEXEC thread. This function is executed each time the thread processes a writetext command.

ExecsWritetext Total rs_writetext commands processed by a DSIEXEC thread.

InsertsRead Total rs_insert commands processed by a DSIEXEC thread.

UpdatesRead Total rs_update commands processed by a DSIEXEC thread.

Function String Generation

DSIEFSMapTimeAve Average time taken, in 100ths of a second, to perform function string mapping on a command.

DSIEFSMapTimeLast Time, in 100ths of a second, to perform function string mapping on the last command.

DSIEFSMapTimeMax The maximum time taken, in 100ths of a second, to perform function string mapping on a command.

The RS 15.0 equivalent counters are:

Counter Explanation

Read From SQT Cache

DSIEReadTime The amount of time taken by a DSI/E to read a command from SQT cache.

DSIEWaitSQT The number of times DSI/E must wait for the command it needs next to be loaded into SQT cache.

DSIEGetTranTime The amount of time taken by a DSI/E to obtain control of the next logical transaction.

Final v2.0.1

169

Counter Explanation

DSIERelTranTime The amount of time taken by a DSI/E to release control of the current logical transaction.

DSIEParseTime The amount of time taken by a DSI/E to parse commands read from SQT.

Command (DML or DDL Related)

TransSched Transactions groups scheduled to a DSIEXEC thread.

UnGroupedTransSched Transactions in transaction groups scheduled to a DSIEXEC thread.

DSIECmdsRead Commands read from an outbound queue by a DSIEXEC thread.

DSIECmdsSucceed Commands successfully applied to the target database by a DSI/E.

BeginsRead 'begin' transaction records processed by a DSIEXEC thread.

CommitsRead 'commit' transaction records processed by a DSIEXEC thread.

SysTransRead Internal system transactions processed by a DSI DSIEXEC thread.

CmdsSQLDDLRead SQLDDL commands processed by a DSI DSIEXEC thread.

InsertsRead rs_insert commands processed by a DSIEXEC thread.

UpdatesRead rs_update commands processed by a DSIEXEC thread.

DeletesRead rs_delete commands processed by a DSIEXEC thread.

ExecsWritetext rs_writetext commands processed by a DSIEXEC thread.

ExecsGetTextPtr Invocations of function rs_get_textptr by a DSIEXEC thread. This function is executed each time the thread processes a writetext command.

Function String Generation

DSIEFSMapTime Time, in 100ths of a second, to perform function string mapping on commands.

As you can see, the largest change is that the DSIEXEC has more counters tracking the time spent retrieving the commands/command groups from the SQT cache in the DSI thread.

An important aspect to these counters is to remember that they are per DSI EXEC thread – so with parallel DSI enabled, more than one value will be recorded. As mentioned earlier in the general discussion about the RS M&C feature, the rs_statdetail.instance_id column corresponds to the thread number for each value – allowing us to also track how efficiently each thread is utilized. For now, we will focus on just the function generation and DML aspects – later we will take a look at the parallel DSI aspect of the problem. However, it does mean that if looking across all the DSIEXEC’s, we will need to aggregate the counter values per sample period.

Some of the more useful general counters include:

CmdsApplied (DSICmdsSucceeded), CmdsPerSec=CmdsApplied/seconds InsertsRead, UpdatesRead, DeletesRead ExecsWritetext, ExecsGetTextPtr

These are fairly obvious as they help us establish rate information for throughput as well as which commands were being executed. The last set refer more to text/image processing and can be used to develop profiles (i.e. a relative indication of the size of the text/image is WritesPerBlob=ExecsWritetext/ExecsGetTextPtr). While these are interesting to monitor (and the number of updates may give a clue to how effective minimal column replication might be), the real effort at this stage is command batching.

Let’s take a look at how these counters can be used. First, let’s consider the insert stress test:

Final v2.0.1

170

Sam

ple

Tim

e

Cm

dsA

pplie

d

Cm

dsPe

rSec

Inse

rtsR

ead

Upd

ates

Rea

d

Del

etes

Rea

d

Exe

csG

et

Tex

tPtr

Exe

cs

Wri

tete

xt

Msg

Che

cks

Msg

Chk

s Pe

rCm

d

11:37:57 305 27 200 0 0 0 0 94 0.3

11:38:08 2,030 203 1,450 0 0 0 0 541 0.2

11:38:19 2,234 203 1,595 0 0 0 0 567 0.2

11:38:30 2,267 226 1,620 0 0 0 0 640 0.2

11:38:41 2,150 195 1,536 0 0 0 0 571 0.2

11:38:52 2,235 203 1,595 0 0 0 0 556 0.2

11:39:03 2,253 204 1,609 0 0 0 0 580 0.2

11:39:14 2,212 201 1,580 0 0 0 0 587 0.2

11:39:25 2,107 191 1,504 0 0 0 0 584 0.2

11:39:36 2,414 219 1,725 0 0 0 0 654 0.2

As you can see the cumulative throughput was ~200 commands/sec across all the DSI’s and it was all inserts (no surprise). The disparity between CmdsApplied and InsertsRead is simple – the begin tran/commit tran commands are counted as well. And interesting statistic is the message checks per command – which is averaging close to 25%. Note that the test machine can easily hit 900 inserts/sec using RPC calls and 200 inserts/sec using language commands – consequently the 200 inserts/sec rate may be the max we can get out of the replicate ASE using a Warm Standby configuration. Later when we look at the timing information, we will see statistics that help support that it is the replicate ASE that is the bottleneck.

Because the insert stress test is rather simplistic, let’s next take a quick look at the first day of the customer’s data that we have been looking at before we discuss the counters:

Sam

ple

Tim

e

Cm

dsA

pplie

d

Cm

dsPe

rSec

Inse

rtsR

ead

Upd

ates

Rea

d

Del

etes

Rea

d

Exe

csG

et

Tex

tPtr

Exe

cs

Wri

tete

xt

Msg

Che

cks

Msg

Chk

s Pe

rCm

d

19:02:07 6 0 0 2 0 0 0 8 1.3

19:07:08 6,292 20 615 1,914 615 0 0 4,802 0.7

19:12:10 7,679 25 0 3,839 0 0 0 5,909 0.7

19:17:12 4,119 13 0 2,059 0 0 0 3,210 0.7

19:22:13 6,983 23 0 3,491 0 0 0 5,357 0.7

19:27:14 7,491 24 0 3,745 0 0 0 5,738 0.7

19:32:16 15,595 51 526 6,841 430 0 0 11,711 0.7

19:37:18 41,343 137 10,347 1 10,299 0 0 31,044 0.7

19:42:19 42,255 140 10,469 270 10,360 0 0 31,734 0.7

19:47:21 21,711 72 5,431 9 5,411 0 0 16,299 0.7

Now, let’s take a look at some of these metrics – for the most part the description will concentrate on the customer numbers and only refer back to the insert test when necessary.

Final v2.0.1

171

CmdsApplied – CmdsApplied reports the number of SQL statements issued to the replicated database. As you can see in the above, the system is nearly idle at the beginning and then builds to executing tens of thousands of SQL commands per sample period.

CmdsPerSec – This metric is derived by dividing the CmdsApplied by the number of seconds in the sample interval. This can be used to gauge the real performance of the DSI threads vs. CmdsApplied as it gives an execution rate. Note that it peaks at ~140/sec – which really is not all that good (compared to the insert test steadily achieving 200 inserts/sec on a laptop and even that is not ideal) – but then we are dealing with a single DSI thread as well.

Inserts/Updates/DeletesRead – Much like the DIST counters, these counters track the number of inserts, updates and deletes read out of the outbound queue and sent to the replicate database. Again, we see the curious pattern of inserts/deletes mimicking each other. However, the number of updates also suggests that minimal column replication should be considered as well. One thing that is interesting is that the sum of the DML commands is only ½ of the CmdsApplied value. The reason is that the counters for the begin transaction & commit transaction are not shown above. For example, if the delete/insert were a pair in a single transaction, then at time 19:37, we would have ~10,000 deletes + 10,000 inserts + 10,000 begin tran + 10,000 commit trans – which does work out to 40,000 commands.

ExecsGetTextPtr/ExecsWritetext – These counters are related to text/image processing. The first metric refers to the number of text/image columns that are involved. The reason this can be deduced is that each text/image column per replicated row will require an execution of rs_get_textptr (see section on text replication later). The second counter is incremented for each writetext operation. While there are counters available, these two also give you a fairly good indication of the amount of text/image data flowing. For example, if they were equal, then you would know that the amount of text/image data is fairly small (<16KB) and can be issued with a single writetext call. If you see a ratio of 100 or more writetext commands per rs_get_textptr, you can be fairly confident that the text/image data is fairly substantial – which may be contributing to the slow delivery rate at the replicate database server.

MsgChecks – This metric tracks how often the DSI EXEC threads check for pending commands via the OpenServer message structures referenced in the internals section at the beginning of this document – specifically the batch sequencing and commit sequencing messages (but also the actual transaction commands are posted here as well).

MsgChksPerCmd – This metric is derived by dividing the number of MsgChecks by the CmdsApplied to get a ratio of how autonomous the DSIEXE is. For example, with large groups allowed, once it can start, the DSIEXEC can do a lot of processing without having to continuously check for transaction group/batch sequencing with the parent DSI thread. In this case, we see that we are checking with the parent DSI thread nearly every command – but then remember, the calculated dsi_max_xacts_in_group was 1 – which means with only 1 transaction per group, we are going to have more coordination between the DSI and the DSIEXEC with each new transaction. A key element here is that we are doing 2 message checks for every 2 commands – which makes sense if these are atomic transactions as each transaction would be 3-4 commands (begin, insert/delete, commit) and we would need to check group sequencing and batch sequencing (two message checks).

Now, let’s look at the second day’s metrics:

Sam

ple

Tim

e

Cm

dsA

pplie

d

Cm

dsPe

rSec

Inse

rtsR

ead

Upd

ates

Rea

d

Del

etes

Rea

d

Exe

csG

et

Tex

tPtr

Exe

cs

Wri

tete

xt

Msg

Che

cks

Msg

Chk

s Pe

rCm

d

19:02:07 9 0 0 3 0 0 0 12 1.3

19:07:08 6,758 22 627 2,133 615 0 0 2,058 0.3

19:12:10 7,371 24 0 3,685 0 0 0 2,095 0.2

19:17:12 4,171 13 6 2,079 0 0 0 1,198 0.2

19:22:13 5,603 18 0 2,802 0 0 0 1,638 0.2

19:27:14 14,959 49 0 7,479 0 0 0 4,114 0.2

19:32:16 22,128 73 4,012 3,071 3,978 0 0 6,085 0.2

Final v2.0.1

172

Sam

ple

Tim

e

Cm

dsA

pplie

d

Cm

dsPe

rSec

Inse

rtsR

ead

Upd

ates

Rea

d

Del

etes

Rea

d

Exe

csG

et

Tex

tPtr

Exe

cs

Wri

tete

xt

Msg

Che

cks

Msg

Chk

s Pe

rCm

d

19:37:18 52,085 172 13,047 73 12,885 0 0 14,344 0.2

19:42:19 46,301 153 11,584 37 11,520 0 0 12,738 0.2

19:47:21 26,980 89 5,382 2,797 5,268 0 0 7,445 0.2

Notice that although the CmdsPerSec are in the same order of magnitude, the MsgChksPerCmd is considerably less. If you remember, it was during this time frame that the transaction grouping was much more effective – hitting the goal of 20 constantly for the last. So, if transaction grouping is much more efficient (and notice that we are now only coordinating with the DSI approximately every fifth command – but with 2 checks (batch & transaction sequence) we really are checking every 10 commands), but we have not really improved the delivery rate, then something else is the bottleneck now – and may have been primary limiting factor for the day before.

This points out an interesting perspective. So many P&T sessions follow the same flawed logic:

1. Change one setting and run the test (this is already flawed as sometimes settings work together cooperatively).

2. If it didn’t improve anything, reset it to the original setting and try something else

The question is, how often – after finding something that helps – do DBA’s go back and retry something that didn’t help previously?? Answer: Not very often. In this case, we know that in the first day, transaction grouping was not occurring. Whatever the customer did to change the picture for day 2 helped the transaction grouping, but did not help the overall throughput. The tendency might be to reset whatever was changed – and look at something else. However, a better way of looking at P&T issues is to think of the system as a pipeline – with at least one or more bottlenecks. Removing the second or third one will not necessarily improve the throughput as the first one is still constricting the flow. Putting it back and then removing the first one doesn’t help either as the re-introduced second bottleneck restricts the flow making the removal of the first bottleneck appear to be without benefit as well. As a result, when using the M&C, it is best sometimes to look at each and if possible, remove each bottleneck as noticed – and leave it removed. In this case, they should leave the changes they made that affected transaction grouping intact – as larger groupings of transactions are much more efficient in non-Parallel DSI systems.

DSIEXEC Command Batching

In addition to transaction grouping, another DSI feature that is critical for replication throughput is DSI command batching. While some database systems, such as older Oracle (pre-9i), do not allow this feature or have limitations – those that do gain a tremendous advantage in reducing network I/O time by batching all available SQL and sending it to the database server in a single structure. This is analogous to executing the following from isql:

-- isql script insert into orders values (…) insert into order_items values (…) insert into order_items values (…) insert into order_items values (…) insert into order_items values (…) insert into order_items values (…) insert into order_items values (…) insert into order_items values (…) go

vs. without command batching, the same isql script would look like: -- isql script insert into orders values (…) go insert into order_items values (…) go insert into order_items values (…) go insert into order_items values (…) go insert into order_items values (…) go insert into order_items values (…)

Final v2.0.1

173

go insert into order_items values (…) go insert into order_items values (…) go

Anyone with basic performance and tuning knowledge will be able to tell that the first example will execute an order of magnitude faster from the client application perspective. How does this apply to Replication Server? Believe it or not, it does NOT mean that multiple transaction groups can be lumped into a large SQL structure and sent in a single batch. It does mean, however, that all of the members of a single transaction group may be sent as a single command batch – with the exception of the final commit (due to recovery reasons. The commit is withheld until all transaction statements have executed without error. If no errors, then the commit is sent separately. If errors occurred, either a rollback is issued or the DSI connection is suspended (most common) which implicitly rolls back the transaction). The way this works is as follows:

1. The DSI groups a series of transactions until one of the group termination messages is hit (for example, the maximum of 65,536 bytes).

2. The DSI passes the entire transaction group to the next DSIEXEC that is ready. 3. The DSIEXEC executes the grouped transaction by sending dsi_cmd_batch_size (8192 bytes by default)

sized command batches until completed.

In the example above, if the transaction group was terminated due to 65,536 byte limitation and we still had the default of 8,192 bytes per command batch, the entire transaction would be sent to the replicate database in ~8 batches of 8,192 bytes (depending on command boundaries as ASE requires that commands must be complete and not split across batches). Consequently, the effect of command batching – which is on by default for ASE replicates – is that performance of each of the transaction groups is maximized by reducing the network overhead of sending a large number of statements within a single transaction. In this way, lock time on the replicate is minimized, reducing contention between parallel DSI’s as well as contention with normal replicate database users.

Command batching is critical for large transaction performance. As you could imagine, a large transaction – especially one that gets removed from SQT due to cache limitations – will force the previous transaction group to end. As the large transaction begins, each dsi_cmd_batch_size bytes will be sent to the replicate database.

The common wisdom has been to set this to the same size as db_packet_size or some small multiple (i.e. 2x) of db_packet_size. However, as you will see when we look at the counters, this is actually not optimal – optimal is to set it as large enough that a single transaction group is sent as on command batch. It might be tempting then to assume that it would be best to set dsi_cmd_batch_size to the same as dsi_xact_group_size or 65,536 by default. One problem that many people who have coded large stored procedures might remember about this – each user connection has a limited stack size in ASE for their connection. Issuing too large a batch of SQL results in stack overflow – while later releases of ASE can easily handle the large batches, early releases from quite a few years ago could not. Consequently, RS sends a maximum of 50 commands even if the dsi_cmd_batch_size will support more. The optimal setting would be to set it to the largest command buffer that your DBMS can handle and let the network layer break it up into smaller chunks. The dsi_cmd_batch_size should rarely (hesitating to say “never” only to avoid setting a precedence) be set to less than 8,192 no matter what db_packet_size is set to- and never less than 2,048 as a large data row size of >1,000 bytes might easily violate this by the time the column names, etc. are added to the command. Remember, even with the default of 512 bytes (how many of us typically set the “-A” to higher??), isql is faster executing batches of SQL than individual statements. So lowering dsi_cmd_batch_size to db_packet_size is typically will degrade throughput.

Key Concept #17: Along with transaction grouping, DSI command batching is critical to throughput to replicate systems that support it. The optimal size for DSI command batching would allow the entire transaction group to be sent as a single command batch.

However, just like transaction grouping – the command batching limits are upper bounds/goals. Command batches could be flushed from the DSI EXEC for any number of reasons – some of which are tracked by the monitor counters.

Command Batch Monitor Counters

Several DSIEXEC module counters exist to help optimize command batching:

Final v2.0.1

174

Counter Explanation

Preparation

DSIEBatch The number of command batches started.

DSIEBatchSizeAve Average size, in bytes, of a command batch submitted by a DSI.

DSIEBatchSizeLast Size, in bytes, of the last command batch submitted by a DSI.

DSIEBatchSizeMax The maximum size, in bytes, of a command batch submitted by a DSI.

DSIEBatchTimeAve Average time taken, in 100ths of a second, to process a command batch submitted by a DSI.

DSIEBatchTimeLast Time, in 100ths of a second, to process the last command batch submitted by a DSI.

DSIEBatchTimeMax The maximum time taken, in 100ths of a second, to process a command batch submitted by a DSI.

DSIEICmdCountAve Average number of input commands in a batch submitted by a DSI.

DSIEICmdCountLast Number of input commands in the last command batch submitted by a DSI.

DSIEICmdCountMax The maximum number of input commands in a batch submitted by a DSI.

DSIEOCmdCountAve Average number of output commands in a batch submitted by a DSI.

DSIEOCmdCountLast Number of output commands in the last command batch submitted by a DSI.

DSIEOCmdCountMax The maximum number of output commands in a batch submitted by a DSI.

MemUsedAvgGroup Average memory consumed by a DSI/S thread for a single transaction group.

MemUsedLastGroup Memory consumed by a DSI/S thread for the most recent transaction group.

MemUsedMaxGroup Maximum memory consumed by a DSI/S thread for a single transaction group.

TransAvgGroup The average number of transactions dispatched as a single atomic transaction. If the value of this counter is close to the value of TransMaxGroup, you may want to consider bumping dsi_xact_group_size and/or dsi_max_xacts_in_group.

TransLastGroup If a DSIEXEC thread is capable of utilizing any degree of transaction grouping logic, this counter reports the number of transactions executed in the last grouped transaction.

TransMaxGroup The maximum number of transactions dispatched as a single atomic transaction.

Execution

DSIEBFBatchOff Number of batch flushes executed because command batching has been turned off.

DSIEBFBegin Number of batch flushes executed because the next command is a 'transaction begin' command and by configuration such commands must go in a seperate batch.

DSIEBFCommitNext Number of batch flushes executed because the next command in the transaction will be a commit.

DSIEBFForced Number of batch flushes executed because the situation forced a flush. For example, an 'install java' command needs to be executed, or the next command is the first chuck of BLOB DDL.

DSIEBFGetTextDesc Number of batch flushes executed because the next command is a get text descriptor command.

DSIEBFMaxBytes Number of batch flushes executed because the next command would exceed the batch byte limit.

Final v2.0.1

175

Counter Explanation

DSIEBFMaxCmds Number of batch flushes executed because we have a new command and the maximum number of commands per batch has been reached. This limit currently is 50 commands as measured from the input command buffer.

DSIEBFResultsProc Number of batch flushes executed because the next command is to have its results processed in a context different from the current batch.

DSIEBFRowRslts Number of batch flushes executed because we expect to have row results to process.

DSIEBFRPCNext Number of batch flushes executed because the next command is an RPC.

DSIEBFSysTran Number of batch flushes executed because the next command is part of a system transaction.

Sequencing

DSIESCBTimeAve Average time taken, in 100ths of a second, to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'.

DSIESCBTimeMax The maximum time taken, in 100ths of a second, to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'.

In RS 15.0, these counters are similar but lack the total, average, max as per:

Counter Explanation

Preparation

DSIEBatchTime Time, in 100ths of a second, to process command batches submitted by a DSI.

DSIEBatchSize Size, in bytes, of command batches submitted by a DSI.

DSIEOCmdCount Number of output commands in command batches submitted by a DSI.

DSIEICmdCount Number of input commands in command batches submitted by a DSI.

Execution

DSIEBFResultsProc Number of batch flushes executed because the next command is to have its results processed in a context different from the current batch.

DSIEBFCommitNext Number of batch flushes executed because the next command in the transaction will be a commit.

DSIEBFMaxCmds Number of batch flushes executed because we have a new command and the maximum number of commands per batch has been reached.

DSIEBFRowRslts Number of batch flushes executed because we expect to have row results to process.

DSIEBFRPCNext Number of batch flushes executed because the next command is an RPC.

DSIEBFGetTextDesc Number of batch flushes executed because the next command is a get text descriptor command.

DSIEBFBatchOff Number of batch flushes executed because command batching has been turned off.

DSIEBFMaxBytes Number of batch flushes executed because the next command would exceed the batch byte limit.

DSIEBFBegin Number of batch flushes executed because the next command is a 'transaction begin' command and by configuration such commands must go in a seperate batch.

DSIEBFSysTran Number of batch flushes executed because the next command is part of a system transaction.

Final v2.0.1

176

Counter Explanation

DSIEBFForced Number of batch flushes executed because the situation forced a flush. For example, an 'install java' command needs to be executed, or the next command is the first chuck of BLOB DDL.

Sequencing

DSIESCBTime Time, in 100ths of a second, to check the sequencing on command batches which required some kind of synchronization such as 'wait_for_commit'.

Note that the equivalent of DSIEBatch in RS 15.0 is to get the counter_obs column value for the DSIEBatchSize counter.

Command batching can kind of be compared to how many SQL statements before each ‘go’ you put in a file to be executed by isql. If you’ve ever done this test, you will quickly find that with smaller numbers (i.e. 2 or 3 inserts, then a ‘go’), it is much slower than with 100 or so. Consequently, you will want to want the following counters (RS 12.6 listed - equivalent RS 15.0 counters can be easily determined):

DSIEBatch DSIEBatchSizeMax, DSIEBatchSizeAve DSIEOCmdCountMax, DSIEOCmdCountAve DSIEBFCommitNext, DSIEBFBegin DSIEBFMaxCmds, DSIEBFMaxBytes DSIEBFRPCNext, DSIEBFGetTextDesc, DSIEBFSysTran

The first one is fairly simple – the number of command batches used. The next set report the size in bytes of the command batches. The default dsi_cmd_batch_size of 8192 typically is too small and most often results in 4-6 SQL commands per batch. Increasing this to 256K is likely advisable as well.

The set after that, DSIEOCmdCountMax/Ave, report the number of commands actually sent per batch vs. the bytes. Along with the above, these are some of the more important counters. The “O” vs. the “I” for the similar batch of counters (e.g. DSIEICmdCountAve) refers to Output vs. Input. In other words, the DSI submits a transaction grouping of commands – but they are commands in which the SQL generation has not yet happened. After SQL generation and variable substitution, the number of bytes per batch or other factor may reduce the actual number of commands sent in the batch to the replicate DBMS. The Output commands have the most interest to us. The best way to think of this is after every DSIEOCmdCountAve commands (on average) a “go” is sent ala isql. Obviously, the smaller the batches, the slower the throughput. The real goal, then, is to try to submit the entire transaction group in one command batch

Much like with DSI transaction grouping, with command batching, there can be many reasons why a command batch is terminated. All the counters beginning with DSIEBF (DSIEXEC Batch Flush). Some of the more common ones will be described in the following bullets.

DSIEBFCommitNext - This counter signals that the end of the transaction group has been reached. As mentioned above, if the goal is to submit the entire transaction group as a single batch, you want this counter to be the primary reason for command batch flushes to the replicated database.

DSIEBFBegin - This counter is typically incremented when batch_begin is off. If this is deliberate, it can be ignored.

DSIEBFMaxBytes - this clearly suggests that dsi_cmd_batch_size is too small as described in the above paragraph. As a result, the batch is sent because it exceeded dsi_cmd_batch_size.

DSIEBFMaxCmds - this counter tells when the batch size hits the internal limit of 50 commands before function string mapping. One reason for limiting the number of commands per batch is that some servers would have stack overflow if the number of command batch bytes exceed 64KB (including earlier copies of Sybase SQL Server).

DSIEBFRPCNext - This counter signals how often a batch was flushed because the next command had an output style of RPC instead of language. RPC’s can not be batch, consequently, the language commands before it being accumulated in a batch have to be flushed, then the RPC sent.

DSIEBFGetTextDesc - This counter tells how often a batch was flushed because the next command would be a writetext command. Since the writetext requires a text pointer, we first have to get the textpointer value from the replicate server.

Final v2.0.1

177

DSIEBFSysTran - This counter tells us how often a batch was flushed due to the next command being a DDL command. In order to replicate DDL statements, they are submitted outside the scope of a transaction - so in this case, not only is the batch flushed, but the transaction grouping stopped as well.

Let’s take a look at our insert stress test. There are three sample periods below. The first two are from when the dsi_max_batch_size was set at twice the packet size of 8192, and the latter when this was increased to 65,536. The difference between the first two has to do with the average transactions per group the DSI was submitting.

Sam

ple

Tim

e

DSI

EB

atch

DSI

EB

atch

T

imeM

ax

DSI

EB

atch

T

imeA

ve

DSI

EB

atch

Si

zeA

ve

DSI

EB

atch

Si

zeM

ax

DSI

EO

Cm

d C

ount

Max

DSI

EO

Cm

d C

ount

Ave

DSI

EB

F C

omm

itNex

t

DSI

EB

F M

axC

mds

DSI

EB

F M

axB

ytes

dsi_max_batch_size at 16384, 5 inserts/transaction, 1 tran per group avg

16:13:20 136 100 1 3,550 12,004 16 5 271 0 0

16:13:30 119 100 1 3,770 12,004 16 5 244 0 0

dsi_max_batch_size at 16384, 5 inserts/transaction, ~4 tran per group avg

11:17:39 66 100 4 7,299 15,999 21 9 131 0 50

11:17:50 59 100 5 7,864 15,999 21 10 120 0 43


11:38:08 63 100 4 9,141 39,170 50 12 126 9 0

11:38:19 68 100 4 8,952 39,170 50 12 137 8 0

It helps, of course to have the intrusive counters for timing purposes turned on. From the above, we can see that RS is taking about 10ms (counter is in 1/100ths of a second or 0.01 vs. milliseconds) per ungrouped transaction to process the batch. We will take a look next at the timing aspect, but for now, let’s look at the commands. Notice that the average number of commands per batch increased from 5 to ~10 to 12. It is interesting to note that the number of CmdsPerSec jumped from ~150 to ~200 (earlier execution statistics for first set not shown here) simply by increasing the number of commands per command batch. However, between #2 and #3, increasing the dsi_max_batch_size shifted the ~25% batch flushes due to hitting the configuration limit to a <10% due to hitting the maximum number of commands.

On interesting statistic to keep in mind is that each command batch will have 2 batch flushes at a minimum. This is because the actual commit is sent in a separate batch. Consequently, the ideal situation would be to have DSIEBFCommitNext ≈ 2 x DSIEBatch – which is what we have. However, this puts the DSIEBFMaxBytes in more perspective as it suggests that nearly every batch in the middle sample exceeded dsi_max_batch_size – requiring three command batches instead of two. Since each batch that begins will have at least one separate batch flush for the commit record, you can subtract DSIEBatch from DSICommitNext to reach a true DSICommitNext.

Since we are looking at the execution stage, it would help to take a look at the times for the various stages – RS 1) preparing the batch; 2) sending the batch to the replicate database and then 3) processing the results. Let’s use the same time samples from the insert stress test above and see if it can explain why we have latency (or at least which component is to blame):

Sam

ple

Tim

e

DSI

EB

atch

DSI

EB

atch

T

imeM

ax

DSI

EB

atch

T

imeA

ve

DSI

EO

Cm

d C

ount

Max

DSI

EO

Cm

d C

ount

Ave

Send

Tim

eMax

Send

Tim

eAvg

DSI

ER

esul

t T

imeM

ax

DSI

ER

esul

t T

imeA

ve

DSI

ER

esul

t T

imeP

erC

md

dsi_max_batch_size at 16384, 5 inserts/transaction, 1 tran per group avg

16:13:20 136 100 1 16 5 0 0 100 13 2.6

16:13:30 119 100 1 16 5 0 0 100 17 3.4

Final v2.0.1

178

Sam

ple

Tim

e

DSI

EB

atch

DSI

EB

atch

T

imeM

ax

DSI

EB

atch

T

imeA

ve

DSI

EO

Cm

d C

ount

Max

DSI

EO

Cm

d C

ount

Ave

Send

Tim

eMax

Send

Tim

eAvg

DSI

ER

esul

t T

imeM

ax

DSI

ER

esul

t T

imeA

ve

DSI

ER

esul

t T

imeP

erC

md


11:17:39 66 100 4 21 9 0 0 100 14 1.5

11:17:50 59 100 5 21 10 0 0 100 22 2.2


11:38:08 63 100 4 50 12 0 0 100 29 2.4

11:38:19 68 100 4 50 12 0 0 100 26 2.1

And we notice a very key clue – it is taking ~20ms (counter is in 1/100ths of a second or 0.01 vs. milliseconds) per command to process the results from each one. In fact, it is taking RS about 4 times longer to process the results than it does to process the batch internally – and when no grouping, this is ~15x longer. Given this lag, no matter how much tuning we do to RS, it will be extremely difficult to achieve much faster – we need to speed up the replicate SQL execution first. At 20ms per command, we could only hit ~50 commands/sec – of course using parallel DSI’s help some, but in this case, they barely let us 4x the throughput of this system. Ideally, we need to figure out a way to speed up the individual command processing – increasing the number of DSI’s may not help if the system is already CPU bound.

Now, let’s take a look at the customer system. Unfortunately, the customer system did not have the timing counters enabled, so we will only be able to look at the batching efficiency over the two days:

Sam

ple

Tim

e

DSI

EB

atch

DSI

EB

atch

T

imeM

ax

DSI

EB

atch

T

imeA

ve

DSI

EB

atch

Si

zeA

ve

DSI

EB

atch

Si

zeM

ax

DSI

EO

Cm

d C

ount

Max

DSI

EO

Cm

d C

ount

Ave

DSI

EB

F C

omm

itNex

t

DSI

EB

F M

axC

mds

DSI

EB

F M

axB

ytes

~1 tran per group avg

19:07:08 1,574 0 1 884 2,342 3 2 3,148 0 0

19:12:10 1,920 0 1 1,271 2,325 3 2 3,840 0 0

19:17:12 1,030 0 1 1,274 2,323 3 2 2,060 0 0

19:22:13 1,746 0 1 1,279 2,340 3 2 3,491 0 0

~16-20 tran per group avg

19:23:32 148 0 11 4,852 8,185 61 11 294 0 287

19:28:34 115 0 16 5,849 8,189 16 10 230 0 538

19:33:36 69 0 14 5,768 8,189 17 9 138 0 304

19:38:38 101 0 13 5,708 8,189 16 9 201 0 402

You can see the changes in some of the counters as described below:

DSIEBatch – Normally, a 10-20x drop in the number of batches sent would suggest a drop in throughput – either because of fewer transactions being replicated or due to slower throughput. In this case, however, if we had the space to show DSIEXEC.TransAveGroup jump the same amount.

DSIEBatchTimeAve – Again, this shows the same jump, but it works out to slightly less than 1/100th of a second (10ms) per transaction group – so we are not too concerned here – although we wish we would have had some of the other time based counters such as DSIEResultTimeAve for comparison.

Final v2.0.1

179

DSIEBatchSizeAve/Max – Obviously, this system is still using the default dsi_max_batch_size = 8192 – which, while it may not look to be a problem as the average is only 70% of the max – with the average being that close to the max, it tells us that the max is being hit pretty frequently (as DSIEBFMaxBytes does show). Remember also that only complete commands can be sent – therefore if the average command size is 1,500 bytes, the most we will be able to send is 5 commands or 7,500 bytes. As a result, it may be difficult for the max to be hit that often.

DSIEOCmdCountAve/Max – In the first day’s metrics, not only is transaction grouping an issue, but command batching is all but ineffective as well. Day two is a lot better, running ~10 commands per batch and peaks of ~60 commands per batch.

DSIEBFCommitNext/DSIEBFMaxBytes – Unlike the insert stress test system, this one hits >>50 without tripping DSIEBFMaxCmds. However, it does show that the most common reason for batch flushes is due to hitting the dsi_max_batch_size limit. If we sum the four values above, we get a total of DSIEBFCommitNext=863 and DSIEBFMaxBytes=1531 or nearly a 2:1 ratio for DSIBFMaxBytes. Curiously, DSIEBatch – which reports the number of batches began - is only at total of 333. While this may seem odd, remember that DSIEBatch is measured at the beginning – and likely some of the command batches exceeded dsi_max_batch_size several times within the same batch – resulting in multiple batch flushes per command batch – in addition to the separate commit flush. If we subtract 333 from DSIEBFCommitNext, we end up with 530 instead of 863 which is a 3:1 ratio for DSIBFMaxBytes – and a truer picture of the problem.

So, part of the issue with this system is that the dsi_max_batch_size is undertuned. While this may be a big bottleneck, it is not the largest and tuning it will help some but not likely as much as some may be looking for. Much like the multiple bottlenecks in a pipe, removing other bottlenecks may have greater impact – for example, 50% of the latency can be eliminated for this system simply by eliminating the delete/insert pairs and replacing with an update statement. Increasing dsi_max_batch_size is still a good idea.

Some of you may have noticed that during the 19:23:32 period (first sample in the second group in the table), that the value for DSIOCmdCountMax was 61 – definitely higher than the limit we stated as 50. The command limit is based on replicated commands from the input, whereas during SQL generation, additional commands may be necessary. For example, if we replicate a table containing identity columns, the actual replicated command is the rs_insert – a single command. However, the output command language would require:

set identity_insert tablename on insert into tablename set identity_insert tablename off

Consequently a single command becomes three. Consequently, while you may see DSIEOCmdCountAve/Max/Last higher than 50, the input counters DSIEICmdCountAve/Max/Last should never exceed 50. In the case above, when the DSIEOCmdCountMax was equal to 61, during the same period, DSIEICmdCountMax was equal to 41.

DSIEXEC Execution

Replication Server is simply another client to ASE or any other DBMS – it has no special prioritization nor special command processing. Consequently, RS execution of SQL statements is effectively very similar to the basic ct_results() looping in sample CT-Lib programs. The basic template might look similar to: ct_command() – called to create command batch ct_send() – send commands to the server while ct_results returns CS_SUCCEED (optional) ct_res_info to get current command number switch on result_type /* ** Values of result_type that indicate fetchable results: */ case CS_COMPUTE_RESULT... case CS_CURSOR_RESULT... case CS_PARAM_RESULT... case CS_ROW_RESULT... case CS_STATUS_RESULT... /* ** Values of result_type that indicate non-fetchable results: */ case CS_COMPUTEFMT_RESULT... case CS_MSG_RESULT... case CS_ROWFMT_RESULT... case CS_DESCRIBE_RESULT... /* ** Other values of result_type: */ case CS_CMD_DONE...

Final v2.0.1

180

(optional) ct_res_info to get the number of rows affected by the current command case CS_CMD_FAIL... case CS_CMD_SUCCEED... end switch end while switch on ct_results’ final return code case CS_END_RESULTS... case CS_CANCELED... case CS_FAIL... end switch

The only real difference would be if an RPC call was made or text/image processing. To some, the many variations of result type processing may seem to be a bit overkill as RS really doesn’t need or care about the results – let alone compute-by clause results. However, remember that with stored procedure replication, just about any SQL statement could be contained within the replicated procedure, consequently RS needs to know how to handle the results type. Those familiar with CT-Lib programming also know that within this ct_results() loop often is a ct_fetch() loop – which RS has to implement as well. Ideally, there will only be a single result for each DML command, but again, in the case of stored procedure replication, there might be any number of rows to be fetched and/or messages from print statements.

So why are we discussing all of this? For two main reasons. First, to help you understand how RS works. Secondly and most appropriate to this section is the counters that are mostly associated with execution statistics.

DSIEXEC Execution Monitor Counters

The following monitor counters deal specifically with sending the commands to the replicate DBMS, processing the results (and error handling) during processing. Normally, only a few of these are applicable as most replication environments are fairly basic (consequently values for other counters may be an indication of unexpected behavior that may be contributing to the issue at hand). Some of the counters are repeated from earlier sections, but since they are applicable here – particularly in light of some of the derived values – they are repeated here for ease of reference.

Counter Explanation

Batch sequencing (repeated from earlier)

DSIESCBTimeAve Average time taken, in 100ths of a second, to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'.

DSIESCBTimeMax The maximum time taken, in 100ths of a second, to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'.

ct_send() phase

SendTimeAvg Average time, in 100ths of a second, spent in sending command buffers to the RDS.

SendTimeMax Maximum time, in 100ths of a second, spent in sending command buffers to the RDS.

SendRPCTimeAvg Average time, in 100ths of a second, spent in sending RPCs to the RDS.

SendRPCTimeMax Maximum time, in 100ths of a second, spent in sending RPCs to the RDS.

SendDTTimeAvg Average time, in 100ths of a second, spent in sending chunks of text or image data to the RDS.

SendDTTimeMax Maximum time, in 100ths of a second, spent in sending chunks of text or image data to the RDS.

ct_results() processing

DSIEResultTimeAve Average time taken, in 100ths of a second, to process the results of a command batch submitted by a DSI.

DSIEResultTimeMax The maximum time taken, in 100ths of a second, to process the results of a command batch submitted by a DSI.

Exception Processing

Final v2.0.1

181

Counter Explanation

ErrsDeadlock Total times that a DSI thread failed to apply a transaction due to deadlocks in the target database (ASE Error 1205). Note that this does not track the times when deadlocks occur with parallel DSI’s, but only when RS deadlocks with another non-RS process.

ErrsLogFull Total times that a DSI thread failed to apply a transaction due to no available log space in the target database (ASE Error 1105).

ErrsLogSuspend Total times that a DSI thread failed to apply a transaction due to target the database in log suspend mode (ASE Error 7415).

ErrsNoConn Total times that a DSI thread failed to apply a transaction due to no connections to the target database (ASE Error 1601).

ErrsOutofLock Total times that a DSI thread failed to apply a transaction due to no locks available in the target database (ASE Error 1204).

Commit Sequencing

DSIESCCTimeAve Average time taken, in 100ths of a second, to check the sequencing on a commit.

DSIESCCTimeMax The maximum time taken, in 100ths of a second, to check the sequencing on a commit.

MsgChecks Total checks for Open Server messages by a DSIEXEC thread. Message checks are for group and batch sequencing operations as discussed earlier in association with the dsi_serialization_method

MsgChecksFailed Number of MsgChecks_Fail returned when a DSIEXEC thread calls dsie__CheckForMsg(). If a timer is specified, MsgChecks_Fail returns if timer expired before an event is returned.

DSIETranTimeAve Average time taken, in 100ths of a second, to process a transaction by a DSI/E thread. This includes function string mapping, sending and processing results. A transaction may span command batches.

DSIETranTimeMax The maximum time taken, in 100ths of a second, to process a transaction by a DSI/E thread. This includes function string mapping, sending and processing results. A transaction may span command batches.

In RS 15.0, the counters are similar:

Counter Explanation

Preparation & Batch Sequencing

DSIESCBTime Time, in 100ths of a second, to check the sequencing on command batches which required some kind of synchronization such as 'wait_for_commit'.

DSIEPrepareTime The amount of time taken by a DSI/E to prepare commands for execution.

Ct_send() phase

SendTime Time, in 100ths of a second, spent in sending command buffers to the RDS.

SendRPCTime Time, in 100ths of a second, spent in sending RPCs to the RDS.

SendDTTime Time, in 100ths of a second, spent in sending chunks of text or image data to the RDS.

DSIEExecCmdTime The amount of time taken by a DSI/E to execute commands. This process includes creating command batches, flushing them, handling errors, etc.

DSIEExecWrtxtCmdTime The amount of time taken by a DSI/E to execute commands related to text/image data. This process includes initializing and retreiving text pointers, flushing commands, handling errors, etc.

Final v2.0.1

182

Counter Explanation

ct_results() processing

DSIEResSucceed The number of times a data server reported successful executions of a command batch.

DSIEResFail The number of times a data server reported failed executions of a command batch.

DSIEResDone The number of times a data server reported the results processing of a command batch execution as complete.

DSIEResStatus The number of times a data server reported a status in the results of a command batch execution.

DSIEResParm The number of times a data server reported a parameter, cursor or compute value in the results of a command batch execution.

DSIEResRow The number of times a data server reported a row as being returned in the results of a command batch execution.

DSIEResMsg The number of times a data server reported a message or format information as being returned in the results of a command batch execution.

DSIEResultTime Time, in 100ths of a second, to process the results of command batches submitted by a DSI.

Exception Processing

ErrsDeadlock Total times that a DSI thread failed to apply a transaction due to deadlocks in the target database (ASE Error 1205). Note that this does not track the times when deadlocks occur with parallel DSI’s, but only when RS deadlocks with another non-RS process.

ErrsLogFull Total times that a DSI thread failed to apply a transaction due to no available log space in the target database (ASE Error 1105).

ErrsLogSuspend Total times that a DSI thread failed to apply a transaction due to target the database in log suspend mode (ASE Error 7415).

ErrsNoConn Total times that a DSI thread failed to apply a transaction due to no connections to the target database (ASE Error 1601).

ErrsOutofLock Total times that a DSI thread failed to apply a transaction due to no locks available in the target database (ASE Error 1204).

Commit Sequencing

DSIESCCTime Time, in 100ths of a second, to check the sequencing on commits.

DSIETranTime Time, in 100ths of a second, to process transactions by a DSI/E thread. This includes function string mapping, sending and processing results. A transaction may span command batches.

DSIEFinishTranTime The amount of time taken by a DSI/E to finish cleaning up from committing the latest tran. These clean up activities include awaking the next DSI/E (if using parallel DSI) and notifying the DSI/S.

However, the most useful DSIEXEC counters are the ‘time’ counters. In RS 12.6, the only counters were averages - which meant that the most useful way of looking at them was from a total perspective, requiring ‘re-calculating’ the original total that was used in the average:

FSMapTime=(DSIEFSMapTimeAve * CmdsApplied)/100.0 BatchTime =(DSIEBatchTimeAve * DSIEBatch)/100.0 SendTime=(SendTimeAvg * DSIEBatch)/100.0 ResultTime=(DSIEResultTimeAve * DSIEBatch)/100.00 CommitSeqTime=(DSIESCCTimeAve * TransApplied)/100.0

Final v2.0.1

183

BatchSeqTime=(DSIESCBTimeAve * TransApplied)/100.00 TotalTranTime=(DSIETranTimeAve * TransApplied)/100.00

RS 15.0 simplifies this thanks to the counter_total column in the rs_statdetail table. The key to all of these is to remember that we are executing command batches with the transaction group currently being dispatched by the DSIEXEC and that multiple groups may be executed by the DSIEXEC within the sample interval. Consequently, to get the time spent for each sample interval, we have to multiply the individual timing counters by the number of commands, batches or transactions processed by that DSIEXEC during that interval to get the total time spent on that aspect (note that this changes substantially in RS 15 as it tracks totals already). All the times reported by these counters are in 100ths of a second, consequently we need to normalize to seconds to make them more readable. From these we can most often find quite clearly where RS is spending the time. Let’s take the above ‘times’ in order of the execution and describe the likely causes:

FSMapTime - As noted earlier, this is the amount of time translating the replicated row functions into SQL commands. If there is a lot of time spent in this area, it could point to fairly big customized function strings - which you may not be able to do much about. However, you may wish to ensure that STS cache is sized appropriately.

BatchTime - As noted earlier as well, this is the amount of time creating the batches. Although it seems odd, generally when this value is high, it almost always goes hand in hand with dsi_cmd_batch_size being too small. One possibility is that the overhead of batch creation - beyond the mechanics of append the SQL clauses is high enough that when the number of batches is high due to a low batch size setting, it adds up considerably.

BatchSeqTime - This, as described earlier, is the time spent trying to coordinate sending of the first batch in parallel DSI’s. A lengthy time could indicate that the dsi_serialization_method is wait_for_commit and a previous transaction is running a long time – or that the DSI thread is simply too busy to respond to the Batch Sequencing message.

SendTime - This represents the amount of time spent sending the command batch to the replicate data server. A high time here may indicate inefficient batching or slow response to client applications from the replicate server.

ResultTime - This calculated value can be used to determine the amount of time spent processing results from the replicate server. In actuality, this includes the execution time as RS does very little result processing. Frequently, these metrics will among the highest and points to a need to speed up the replicate DBMS as the key to improving RS throughput.

CommitSeqTime - This is the amount of time spent waiting to commit. Again, a high value may indicate a near-serial dsi_serialization_method such was wait_for_commit - or it also could point to contention within the replicate server - possibly within the rs_threads group.

TotalTranTime - Most of the time for 12.6 systems will be reported as TotalTranTime – which when you subtract the other components (FSMapTime, SendTime), leaves execution time by the replicate database as the result. And if this is the largest chunk of time, tuning RS isn’t going to help – you have to either tune the replicate database, use parallel DSI’s (and the key here is to achieve the greatest degree of parallelism without introducing prohibitive contention) or use minimal columns/repdefs to reduce the SQL execution time.

Above, we have also highlighted the two message check counters (MsgChecks, MsgChecksFailed). To understand how these counters can be useful, think back to the earlier diagram of the DSI to DSIEXEC intercommunications concerning batch and commit sequencing. As discussed at the beginning of this paper, inter-thread communications are conducted using OpenServer message structures internally – allowing asynchronous processing between the threads. Consequently, when a DSIEXEC puts a message such as ‘Batch Ready’ on the DSI message queue, it then checks its own message queue for the response. If the response is there, only the MsgChecks counter is incremented. If the expected message is not there, the MsgChecksFailed is incremented along with the MsgChecks. While the number of failures could be an obvious indication of a lengthy batch/commit sequencing issue, we don’t really need to look at the value too closely as RS monitor counters will explicitly tell us how long the batch sequencing and commit sequencing times were. However, the number of message checks is kind of handy from a different perspective. A very high number in comparison to the number of transaction groups or command batches processed gives us an indication of whether transaction grouping is effective (along with other explicit counters for this). Unfortunately, these counters were removed in RS 15.0

DSI Post-Execution Processing

After the DSIEXEC finishes executing the SQL, it checks to see if it can commit. For parallel DSI’s this is done by first sending an rs_get_threadseq or using DSI Commit Control. If it can commit, it notifies the DSI – which in turn

Final v2.0.1

184

coordinates the commits among the DSIEXEC threads. If the thread is next to commit, the DSI sends a message to the DSIEXEC telling it to commit. Once the DSIEXEC has committed, it notifies the DSI that it successfully committed and the DSI in turn notifies the SQM to truncate the queue of the delivered transaction groups. Additionally, the DSI handles regrouping transactions after a failure.

End-to-End Summary The two most common questions that are asked are “Where do you begin?” followed closely by “How do you find where the latency is?” The answer actually is the second question. When you think about it, with 3 near-synchronous pipelines for normal replication (2 for WS), any latency will manifest itself in one of three locations:

1. Primary Transaction Log 2. Inbound Queue 3. Outbound Queue

So, the first place to begin is to identify which of those three are lagging. The fastest way to isolate the problem is to do the following:

Sp_help_rep_agent: Check the RepAgent state. If sleeping, then the RepAgent is caught up. If not sleeping, get sp_sysmon output to aid in further diagnostics.

Admin who, sqm: Compare Next.Read with Last Seg.Block – although this is not totally accurate, if the dsi_sqt_max_cache size is <4MB, it is likely that if Next.Read is greater than Last Seg.Block, any latency in minor.

Admin sqm_readers, queue#, 1: For WS applications, admin who,sqm is particularly ineffective. Similar to admin who,sqm though, this will show the Next.Read and Last Seg.Block relative positions.

The outcome of this will identify which of the three disk locations mentioned above contains the latency. Problem determination begins from that point forward according to the main near-synchronous pipelines:

TranLog RepAgent RS RepAgent User SQM (W) Inbound Queue Inbound Queue SQT DIST Outbound SQM Outbound Queue Outbound Queue DSI DSIEXEC RDB Inbound Queue WS DSI WS DSIEXEC WS RDB (Warm Standby only) Outbound Queue RSI RRS DIST Outbound SQM Outbound Queue (Route only)

For example, if the latency is in the inbound queue, for normal replication, you start by analyzing the SQT, DIST and outbound queue SQM threads, while for WS implementations, you focus on the WS DSI, DSIEXEC and RDB.

Once you know where you are beginning, the next step is to verify the latency by using the M&C and comparing the “commands”. Focusing on the RS, this will typically mean beginning with the SQM commands written. Alternatively, you can skip the admin who,sqm at the beginning and start simply by looking at the various “command” metrics across the full path through the RS. For example:

Sam

ple

Tim

e

Rep

Age

nt

Cm

dsT

otal

Src

SQM

C

mds

Wri

tten

Src

SQM

R

Cm

dsR

ead

SQT

Cm

dsT

otal

SQT

T

rans

Rem

oved

DIS

T

Cm

dsT

otal

Des

t SQ

M

Cm

dsW

ritt

en

Des

t SQ

MR

C

mds

Rea

d

DSI

Cm

dsR

ead

DSI

EX

EC

C

mds

App

lied

21:40:46 5,524 5,524 5,524 5,524 0 5,510 5,510 7,776 7,866 n/a

21:42:47 7,868 7,868 7,868 7,867 0 7,866 7,866 8,225 8,180 n/a

21:44:48 5,797 5,797 5,797 5,795 0 5,795 5,795 14,008 13,999 n/a

21:46:49 324 324 324 324 0 342 342 18,962 18,794 n/a

21:48:50 1 1 1 0 0 0 0 18,615 18,205 n/a

21:50:50 2 2 2 0 0 0 0 27,125 26,564 n/a

21:52:51 2 2 2 0 0 0 0 8,684 18,078 n/a

21:54:52 2 3 3 0 0 0 0 0 0 n/a

Final v2.0.1

185

Sam

ple

Tim

e

Rep

Age

nt

Cm

dsT

otal

Src

SQM

C

mds

Wri

tten

Src

SQM

R

Cm

dsR

ead

SQT

Cm

dsT

otal

SQT

T

rans

Rem

oved

DIS

T

Cm

dsT

otal

Des

t SQ

M

Cm

dsW

ritt

en

Des

t SQ

MR

C

mds

Rea

d

DSI

Cm

dsR

ead

DSI

EX

EC

C

mds

App

lied

21:56:53 0 0 0 0 0 0 0 0 0 n/a

22:02:21 6 6 6 3 0 3 3 3 3 n/a

22:04:22 0 0 0 0 0 0 0 0 0 n/a

22:06:22 844 844 844 842 0 747 747 747 741 n/a

22:08:23 3,192 3,192 3,192 3,191 0 3,187 3,187 3,187 2,873 n/a

22:10:24 8,688 8,688 8,688 8,683 0 8,744 8,744 8,744 5,359 n/a

22:12:25 9,411 9,411 9,411 9,407 0 9,357 9,357 6,873 4,298 n/a

22:14:26 1,366 1,366 1,366 1,364 0 1,442 1,442 3,837 4,326 n/a

22:16:26 2,869 3,075 3,075 2,869 0 2,999 2,999 2,999 3,516 n/a

The DSIEXEC Cmds were not available as the customer who gathered the above did not collect all the statistics. However, enough is there to quickly determine the following:

• There definitely is latency in the DSI/DSIEXEC pipeline • There may be latency at the source RepAgent, but we can not tell from the RS statistics.

Regardless of the example above, remember that latency in one thread may be the result of build up in threads further in the pipeline – classically SQT type problems.

After identifying where the problem is, the second step is to look for the obvious/common bottlenecks for each thread:

Thread/Module Common Issues

RepAgent User • RSSD interaction (rs_locater, etc.) • STS Cache • RepAgent Low packet size/scan batch size • SQM Write Waits

SQM (Write) • RSSD interaction • Slow Disks • Read Activity

SQM (Read) • Large Transactions • Write Activity • Physical Reads vs. Cached

SQT • Cache Size (too large or too small) • Large Transactions • DIST/Outbound Queue slow

DIST • No RepDefs • Large Transactions • RSSD Interaction • STS Cache • SQM Write Waits

DSI • Cache Size (too large or too small) • Large Transactions • Transaction Grouping configuration

Final v2.0.1

186


DSIEXEC • Replicate DBMS response time • Command Batching configuration • Lack of Parallel DSI’s • Text/Image replication

RSI • RRS DIST/SQM slow • Network issues

These can readily be spotted by looking at the monitor counters detail in the previous sections.

One aspect to consider is that each of the pipelines mentioned above begin and end with disk space. Even the replicate DBMS is disk space in effect as the DML statement execution depends on changing disk rows and logging those changes. The most frequent source of bottlenecks will be the components that talk to these disks – the RS SQM threads and the Replicate DBMS. In any case, RS M&C includes timers for these actions that allow you to isolate that it is these endpoints that are the problem.

From the previous sections we have tried to illustrate problems and provide general configuration guidance. A summary of the this guidance is repeated here:


RepAgent Thread • Use large packet sizes • Use larger scan batch sizes • Watch RS response time

RepAgent User • Sts_full_cache rs_objects, rs_columns • Max sqm_write_request_limit • Tune RepAgent

SQM (Write) • Increase sqm_recover_seg

SQM (Read) • Max sqm_write_request_limit • Right-size SQT & DSI SQT Cache

SQT • Right size cache • Break up large transactions (app change)

DIST • Use table RepDefs • Sts_full_cache rs_objects, rs_columns, rs_functions. • Set sts_cache_size to 1,000 or higher • Max md_sqm_write_request_limit

DSI • Right-size DSI SQT Cache – cache should be able to hold 1.5-2 times (max) the number of grouped transactions that you execute on average

• Target dsi_max_xacts_in_group • Max dsi_xact_group_size, dsi_large_xact_size to

eliminate their effects

DSIEXEC • Target dsi_cmd_batch_size to full tran group (40KB+ as starting point) or 50 commands

• Watch RDB DBMS response times • Use Parallel DSI’s

At this point, we are done looking in detail at the RS aspects to the problem and can focus on the replicate database & replicate DBMS. This is appropriate as probably 90% of all latency problems stem from the SQL execution speed at the replicate database.

Final v2.0.1

187

Replicate Dataserver/Database

You gotta tune this too !! Often when people are quick to blame the Replication Server for performance issues, it turns out the real cause of the problem is the replicate database. As with any client application, the lack of a tuned replicate database system really impedes transaction delivery rates. Two things contribute to the Replication Server’s quick blame for this:

1. As a strictly write intensive process, poor design is quickly evident 2. administrators will monitor replication delivery rates quicker than DBMS performance.

In fact, it is an extremely rare database shop these days that regularly monitors their system performance beyond the basic CPU loading and disk I/O metrics.

Key Concept #18: Not only is a well tuned replicate dataserver crucial to Replication Server performance, but a well instrumented primary and replicate dataserver is critical to determining the root cause of performance problems when the do occur.

The purpose of this section is not to discuss how to tune a replicate dataserver as that can be extremely situational dependent. However, several points to consider and common problems associated with replication will be discussed.

Maintenance User Performance Monitoring

For ASE based systems, it is critical to have the Monitoring Diagnostic API (MDA) Tables set up for performance monitoring of the primary and replicate dataservers (possibly the RSSD as well if located on an ASE with production users). Because the MDA tables can be accessed directly via SQL and provide process level metrics, you can get a clear picture of replication maintenance user specific activity. However, there are a couple of nuances when using MDA based monitoring of the replicate database:

• The maintenance user may disconnect/reconnect during the following circumstances: o Errors mapped to stop replication o Parallel transaction failure due to deadlock or commit control intervention o DSI fadeout due to inactivity

• As with any MDA based monitoring, a series of samples using a short sample interval will be necessary to determine

• Most MDA tables are not stateful - but only show the cumulative values for the current sample period • When querying the MDA tables, using the known parameters can reduce the query time significantly.

While it might be tempting to simply look for the maintenance user by SPID, the first point should illustrate that the SPID is likely not too reliable as any disconnect/reconnect can change the SPID. Even it reconnects with the same SPID, the KPID will differ meaning that counter values for the previous SPID will be lost for all but the stateful tables.

Additionally, when monitoring the maintenance user, we primarily are interested in determining the following conditions:

• How quickly statements are executed - since all we are executing are DML operations based on primary keys or atomic inserts (ignoring procedure replication), most statements should execute extremely quickly

• What the maintenance user process within ASE is waiting on • Possibly, how even the distribution of the workload for parallel DSI configurations - although this can be

skewed by large transactions and other conditions.

While the goal of this table is not to teach how to monitor ASE using MDA tables - existing white papers already cover this topic. As a result, the tables and queries contained in this section will focus primarily on the tables most applicable to monitoring maintenance user performance. Consider the following:

Final v2.0.1

188

Figure 39 - Useful ASE 15.0.1 MDA Tables for Monitoring Maintenance User Performance

The first trick is to identify which of the SPID/KPID combinations we are interested in. Logically, it might be tempting to retrieve the SPID/KPID pairs either from master..sysprocesses or from the monProcessLookup table (at the top in the diagram). However, from the above diagram, you can see ServerUserID is also in the monProcessActivity table - which will need to be queried anyhow. As a result, prior to each sample the query:

declare @SampleTime datetime select @SampleTime=getdate() select SampleTime=@SampleTime, * into #monProcessActivity from master..monProcessActivity where ServerUserID=suser_id(‘<maint_user_name>’)

The other tables then can be queried using a join with this table to narrow the results to only the SPID/KPID pairs used by the maintenance user in question. In the next paragraphs, we will use this diagram to special points of interest for monitoring the performance of the maintenance user.

Maintenance User Wait Events

The best starting point for detecting maintenance user performance issues is to begin by looking at the “Wait Events” from monProcessWaits (bottom center of diagram). This table is key to determining how long the maintenance user task spent waiting for disk I/O, network I/O, CPU access, etc. Assuming we had used the above query to determine which SPID/KPID’s we are interested in, the query to retrieve the wait events would be:

Final v2.0.1

189

select SampleTime=@SampleTime, w.* into #monProcessWaits from master..monProcessWaits w, #monProcessActivity a where a.SPID = w.SPID and a.KPID = w.KPID

select w.*, e.EventDescription into #WaitEvents from #monProcessWaits w, master..monWaitEventInfo e where w.WaitEventID = e.WaitEventID

Once we have the “wait events”, we need to find the ones of key interest. Looking at the schema for the monProcessWaits table, we see that there are two columns for the metrics - Waits and WaitTime. A logical assumption might be to focus on the WaitTime, however, there is a slight consideration that may make this not as important. ASE measures time based on a timeslice or “ticks”, which by default is 100 milliseconds. In measuring wait events, the server simply subtracts the timeslice a process was put to sleep in from the timeslice value when it was woken up. If it is the same timeslice, a wait event is recorded with a WaitTime of 0. Consequently a handy query for weighting the Waits and the WaitTime equitably might be a query similar to:

select SampleTime, WaitEventID, Waits, WaitTime, MaxWaitTime=(case when Waits * 100> WaitTime then Waits * 100 else WaitTime end), EventDescription

from #WaitEvents where Waits * 100 > 0 order by 5 -- order by MaxWaitTime

The following table lists some common wait events that you might see for a maintenance user

WaitEventID Event Description

CPU Related

214 waiting on run queue after yield

215 waiting on run queue after sleep

Disk Read Related

29 waiting for regular buffer read to complete

Memory/Cache Related

33 waiting for buffer read to complete

34 waiting for buffer write to complete

36 waiting for MASS to finish writing before changing

37 wait for MASS to finish changing before changing

Disk Write Related

51 waiting for last i/o on MASS to complete

52 waiting for i/o on MASS initiated by another task

Transaction Log/Write Related

54 waiting for write of the last log page to complete

55 wait for i/o to finish after writing last log page

Network Receive

250 waiting for incoming network data

Network Send

171 waiting for CTLIB event to complete

251 waiting for network send to complete

Final v2.0.1

190

WaitEventID Event Description

Contention/Blocking Related

150 waiting for a lock

41 wait to acquire latch

Internals/Spinlocks

272 waiting for lock on ULC

Some of the more common issues are discussed below.

CPU Contention

If there is a high degree of CPU contention (wait events 214 & 215), you will need to consider the priority of the maintenance user as well as the numbers of parallel DSI threads being used. In the case of the former, if the replicate database is also being used be production users for reporting purposes or in a peer-to-peer fashion, the maintenance users are competing for CPU time with the production users. If the replication latency is greater than desired, you have a couple of options available:

• Increase the maintenance user priority to EC1 • Use engine grouping to restrict reporting users to a subset of engines as well as focusing the maintenance user

at the remaining engines • Increase the number of engines

If CPU contention is high and parallel DSI threads are being used, consider reducing the number of threads to see if any improvement in throughput occurs. A good starting rule of thumb is 5-10 threads per engine as a maximum.

Disk Read Delays

While delays due to disk reads certainly could be due to slow disk drives or disk contention, a much more likely cause for the maintenance user is excessive I/O due to a bad query plan. This can happen particularly for updates and deletes when the table is missing indexes on the primary key columns and during inserts when the clustered index is not unique and is non-selective (based on low-cardinality columns). This can be confirmed by looking at the statement and object statistics as will be described in “Query Related Causes” section later.

Memory/Cache Contention

Normally, individual logical I/O’s as represented by wait events 33 & 34 will not be a problem. If they are, one possible cause - particularly when the machine is used by production users - is too few cache partitions. The most common memory contention issue for maintenance users, however, will be focused on the Memory Address Space Segment (MASS) spinlocks. A MASS is a way of controlling concurrent access to group of contiguous pages in memory - typically 8 pages. For example, if a query results in an APF pre-fetch of an entire extent, all 8 pages are read from disk and placed into cache. While those pages are being placed into cache, other users are prevented from trying to use those same pages by the MASS bit. Once in memory, user DML statements may cause several pages to be updated (marked dirty). When the housekeeper, checkpoint process or other write operation forces the pages to be flushed, for IO efficiency, ASE will do multi-page write of the pages within the MASS - again, to safely record the page as having been flushed, concurrent user access during the write operation is blocked.

In the case of replication server maintenance users, the most common form of MASS contention is in a high insert environment, the parallel DSI threads will all be attempting to append rows to heap tables or tables whose clustered index is ordered by a monotonic sequential key (including datetime values). As a result, if one parallel DSI just filled one page, the next insert from a different parallel DSI may have to allocate a new page for the object and may try to append it to the same MASS area. Using cache partitions may alleviate this problem.

Disk Write Delays

As mentioned in the previous paragraph, ASE does all I/O write requests using as large of an I/O as possible. For example, if 2 or 3 contiguous pages in a cache MASS area are dirty, ASE will attempt a 2 or 3 page I/O sized write (4-6K for 2K page sized servers). Note that writes of data pages normally only happen when either the housekeeper flushes a page, when the wash marker is reached, or a checkpoint process flushes the pages based on the recovery interval. As a result, if you see a lot of write based delays, you may first want to look at the monDeviceIO/monIOQueue tables (not in the above diagram) along with OS utilities such as sar to see if slow disk response times, or ASE configuration values are causing the IO times to be longer than normal.

Final v2.0.1

191

However, if the majority of the write delays are due to waiting for the MASS to complete from a different user, this suggests that in a high insert environment you need more cache partitions or the clustered index is forcing parallel DSI’s to insert into the same page - and the housekeeper/checkpoint is forcing a disk flush before the page is completely full.

Transaction Log Delays

In the MDA tables, transaction log based delays are collectively grouped with disk write activity - but due to the differences in causes, we separated them into different sections for this discussion.

In the above list, there were two transaction log delay wait events - 54 & 55. The first one (54) actually is referring to waiting to get access to the transaction log to flush the maintenance user’s ULC to the primary log cache. Commonly we might associate this with log semaphore contention. This can be verified by looking at the monOpenDatabases table, which has columns that track the AppendLogRequests and the AppendLogWaits. If the maintenance users appear to be waiting on the log semaphore and the replicate system is not being used by production users, it could point to a need to increase the ULC size at the replicate or speed up the physical log I/O of the process that currently has the log semaphore.

The second condition (55) suggests that either the log device is slow in responding or that the number of writes per log page is causing the last log page to be busy. As of ASE 15.0, one possible solution for this is to enable ‘delayed commit’ - either for the entire database - or just for the maintenance users. If modifying just for the maintenance users, you will need to modify one of the class scope function strings executed at the beginning of the DSI connection sequence - such as rs_usedb. The danger in this is that non-ASE 15.0 servers may not understand this command, so you will likely need to create a user defined function class that inherits from rs_sqlserver_function_class to minimize the impact and the work involved to implement this capability.

Network Receive Delays

This is likely the largest single cause of latency and as a result, any real attempt at improving the throughput of a maintenance user will likely need to begin with this. As a whole, the problem can be caused by:

• RS slow in sending commands to the ASE due to spending time on other processes • ASE slow in parsing, compiling, optimizing language commands as typical DML statements are sent by RS

The first one can be double checked by looking at the DSIEXEC time related counters. If no real appreciable time is being spent in batching, function string conversion and nearly all the time is spent in the send/execute and results processing windows, then it is most likely is the second cause.

The second cause is a bit nasty. While Replication Server could be viewed as sending very simplistic SQL statements (atomic inserts, updates and deletes based on primary keys), the issue is that every statement sent to the replicate DBMS needs to parsed, compiled, optimized and then executed. In reality, execution (less any contention or other causes) is by far the least of these times. This has been proven in test scenarios involving high insert environments in which using fully prepared SQL statements were 3-10 times faster than the equivalent language commands. The reason was that fully prepared SQL statements create a dynamic procedure that is executed repeatedly by simply sending the parameter values with each call vs. a language command. It was further proven that the most expensive part of the delay was due to compilation or optimization as it was determined that language procedure calls did not exhibit the same delays as language DML statements.

Beginning with ASE 12.5.2, Sybase introduced statement caching. When enabled, as each SQL command is received, it is hashed with an MD5 hash for that login and environment settings (such as isolation level). If the hash matches an already executed query, that query’s optimization plan is used instead. However, the ASE 12.5.2 statement cache did not benefit Replication Server environments due to the following reasons:

• The literal values were included in the hash key - consequently updates or deletes - especially those caused by a single statement at the source - could not use the statement cache as the literal values for the primary keys differed.

• Statement caching was not used for atomic insert/values statements.

In ASE 15.0.1, the first restriction was removed by adding a configuration setting to control ‘literal parameterization’ as well as a session setting. RS environments are strongly encouraged to enable this if the environment sustains a lot of update or delete activity. In the future, ASE 15.0.2 is looking at providing (note this is a future release - normal caveats about future functionality apply) the same capability for atomic insert/values statements which should benefit RS environments greatly. In addition, on a parallel effort, Replication Server engineering is looking at an enhancement to RS 15.0 (again, caveats regarding future release functionality apply) that would enable RS to send dynamic SQL vs. language statements. Early tests with this have reported substantial improvements.

Final v2.0.1

192

Until either ASE 15.0.2 or RS 15.0 are enhanced to resolve the ASE optimization issue, significant improvements in RS throughput can be achieved by using stored procedures and changing the function strings to call the stored procedures instead of the default language commands.

Network Send Delays

Network send delays can be caused by several factors within a replicate database

• The maintenance user task was running on one engine, but needs to perform network I/O on a different network engine that it is connect to.

• ASE CPU contention is preventing a task to be scheduled quick enough to tell if the network send was acknowledged.

• The replicated procedure or trigger contains a number of print statements - particularly if the setting ‘set flushmessage on’ is enabled.

• RS is slow at processing the results.

The first is a most likely cause on larger systems. Unfortunately, while engine to CPU affinity can be performed via dbcc tune(), task to engine affinity is not explicitly supported within Sybase ASE. If the replicate DBMS has a large number of SMP engines, the only real alternative is to use engine groups to try to constrain the maintenance users to a subset of cpu’s - thereby reducing the task migration. However, this should be done with extreme caution and only after verifying that task migration is occurring. One way that it can be verified is by reducing the sample interval significantly and then monitoring the monProcess.EngineNumber column for the same SPID/KPID pairs. If task migration is occurring a lot, an engine group may be desired.

On smaller systems or non parallel DSI environments, the most likely cause will be the second cause. Again, this may point to the need to either increase the process priority for the maintenance user or use engine grouping to deconflict with other production users.

The third cause can be alleviated by changing the proc/trigger code by bracketing print statements as well as the set flushmessage setting with a check for either the replication_role or the maintenance user by name - or by ensuring that triggers are disabled at the replicate if the print statements are within triggers. However, it is unlikely that this will be a significant cause.

Contention/Blocking Related Delays

With parallel DSI’s or other production users on the replicate system, you will need to monitor this closely. Of the two listed, the logical lock event (150) corresponds directly to a lock contention issue either at a page or row level. The specific table involved can be diagnosed via monOpenObjectActivity. While monLocks may seem the most apparent, because the lock hash table changes so rapidly, it would be difficult to spot transient blocks.

Latch contention is likely caused by inserts into the same index pages by parallel threads and typically are not a major concern as latch duration is extremely short.

Internal/Spinlock Delays

Another common wait event for maintenance users is the waiting for a lock on their own ULC cache. This can be caused by two primary issues:

• A low/default configuration for the server configuration “user log cache spinlock ratio” • ULC flushing to the transaction log

The first one is a setting that is often not changed by DBA’s. By default, this means that a single spinlock is used for every 20 ASE processes. For most replicate/standby databases attempting to use parallel DSI threads, the result is that likely only a single spinlock is used for all the parallel threads. Since this is a dynamic parameter, you may wish to reduce this to a low single digit (1-3) to see if it alleviates any delays.

A second cause is that when a user’s ULC is flushed to the transaction log, the ULC is locked from the user to prevent overwriting of the log pages in the ULC. If the above doesn’t help, then this is the likely cause. Unless the ULC is full for the maintenance user, there likely is not a lot that can be done about alleviating this problem.

Warm Standby, MSA and the Need for RepDefs

When Sybase implemented Warm Standby Replication - and later Multi-Standby Architecture (MSA) - the need for individual replication definitions for each table was made optional. The goal was to extremely simplify replication installation and setup for simple systems. However, replication definitions are strongly recommended in high volume systems and in most cases due to the following reasons:

Final v2.0.1

193

• As mentioned earlier, minimal column replication is allowed with replication definitions - although this is enabled for the standby database in a WS or MSA setup by default without a repdef, a common implementation today includes reporting/historical database feeds from the standby system. When minimal column replication is enabled, replicate database performance can be improved for updates as the number of unsafe indexes is reduced and a direct in-place update may be doable instead of a more expensive implementation.

• Primary keys are identified. Without a primary key, the RS has to assume all non-text/image/rawobject columns are part of the primary key. The result not only is that the where clause that is generated a lot longer, but during execution, each part of the where clause has to be compared vs. strictly the primary key values. By having a repdef and defining the primary key, the time it takes to generate the SQL statement within RS is shorter and the execution at the replicate is also shorter.

• In some cases, not having a repdef can lead to database inconsistencies - especially when the table contains a float, real, or double datatype, ansinull is enforced or other similar conditions (such as data modifications due to a trigger if dsi_keep_triggers is “on”). Even with repdefs, if different character sets/sort orders are used, database inconsistencies could result.

While the first two have either been explained before or are self-evident, the last bullet may catch some by surprise. Let’s take a look at each of these, with the exception of the discussion on triggers which is covered in a later section. Before we do this, however, it is extremely important to note that unless the replication definition contains the ‘send standby’ clause, it will not be used by Warm Standby or MSA for primary key or other determination.

Approximate Numerics & RepDefs

Without a replication definition, all non-BLOB columns are included in the primary key/where clause generation for updates and deletes. Most data movement systems encode data values as ASCII text values for transport between the systems. When applied to the destination system, the destination database language handler translates the string literal ASCII number to the binary numerical representation – typically by calling the C library routine atof(). If a different host platform is involved, different operating system versions or different cpu hardware within the same family, the translation on the destination machine may be slightly different that at the origin. For example, inserting a value of 12.0 on the primary may result in a translated value of 11.999999999999998 at the destination. Even worse, an insert of 12.0 at the primary may get stored as 12.000000000001 at the primary, replicated as 12.00000001 and stored at the replicate as 12.000000002. If basic scientific principals such as rounding to a specified number of significant digits were implemented in the application, this slight difference in the stored value may not be an issue for the application. However, Replication Server does not support significant digit rounding.

The problem becomes especially acute when the float column is a member of the primary key, or if the primary key is not specified and all columns are used to define the where clause for update or delete DML operations. Because of the approximate nature of the float datatype, the new value may not match the stored value resulting in not finding the row. Again, for example, assuming that the original system stored a “12.0” perfectly, however, when the row was sent to the destination, it ended up as 11.999999999998. Consider the impact of the following type of query for a subsequent update:

Update data_table Set column = new_value Where obj_id=12345 and float_column = 12.0

Note that the result is not an error. What happens is that the update simply affects 0 rows. Similarly a delete hits zero rows. This can result in either database inconsistencies or errors that stop replication. Consider what happens if an application deletes a row and then later the same row is reinserted. While this does not appear to be common, it can happen in work tables as well as older GUI’s that translated primary key updates into delete and insert statements. The result is that at the primary, possibly everything is fine. However, at the replicate, it is likely a duplicate key error will result on the insert. The reason is that the delete will likely miss the desired row due to the float datatype. The subsequent insert will then fail as any unique index or constraint will flag the duplicate and raise the error (unless ignore_dupe_key is set).

When database inconsistencies are reported to Sybase with a Warm Standby system, the presence of approximate numeric datatypes/lack of repdefs leads the causes by a wide margin when materialization errors are excluded. As a result, float or any approximate numeric should not be used as a primary key or a searchable column - and if a table contains a float datatype, a replication definition must be used.

Final v2.0.1

194

ANSINULL enforcement

If ANSINULL is enabled, database comparisons using a syntax such as column=null are always treated to be false. By definition then, if a warm standby is created and ansinull is enforced, then without a primary key, it is likely that nearly every update and delete will fail to work correctly as any column containing a null value will result in 0 rows affected.

Those that are alert may point out that this requires the connection to issue the ‘set ansinull on’ statement whereas the default is ‘set ansinull off’ (or fipsflagger). However, in 12.5.4, both of these settings can now be exported from a login trigger - consequently care must be taken to ensure that the login trigger doesn’t set these automatically for the maintenance user.

Different Character Sets/Sort Orders

If replicating between different character sets and sort orders, a primary key may help reduce database inconsistencies caused by character conversion/sort comparison. The most common example of this is when the original system uses binary sort order and the standby uses case-insensitive sort order. Whether or not the table has a replication definition, if any part of the actual key includes character data, database inconsistencies can happen. Consider the case in which last name may be part of the primary key and two records are inserted with the only distinction in the key values being that in one case the name is “McDonald” and the other “Mcdonald” - while other non-key attributes may differ. Now, if the table has a repdef, the generated update or delete could resemble:

Delete data_table Where first_name = ‘Fred’ and last_name= ‘McDonald’

With a repdef and primary key, the replicated delete may affect more than one row at the replicate. Without a replication definition, the other attributes may differ and prevent the problem. Consequently, if the primary uses a case sensitive sort order and the replicate uses a case insensitive sort order, replication definitions may not be recommended, but even then, database consistency is not guaranteed.

In other cases, when using different character sets, not specifying a primary key - especially if a localized system only uses numeric keys vs. character data - could result in database inconsistencies. As a result, it is safe to say that any warm standby or MSA implementation between different character sets or sort-orders is risky and could result in data inconsistencies.

Query Related Causes

While the language command optimization issue (see Network Receive Delays above) is likely the biggest cause of throughput issues for high-insert intensive environments, a close second - especially for update/delete intensive transactions are standard query related problems.

As an example, as of this writing, a common financial trading application includes a delete statement without a where clause. While it is likely that this was done prior to truncate table being a grantable option (ASE 12.5.2) forcing non table owners to a table truncation in this fashion, the biggest problem was that the table did not have any defined primary key constraint nor any unique indices (although an identity column existed and had a nonunique index defined solely on that column). Equally problematic was that this table easily contains ~1 million rows or more. In a typical lazy standby implementation that does not have a repdef defined, the result is instantaneously disastrous as the RS latency stretches for hours. The problem is that while the delete is a single statement at the primary, as you can guess by now, each row becomes a single delete at the replicate - and lacking any index information based on the where clause - it promptly becomes a table scan for each delete. One million table scans to be precise.

While this may be an extreme example, when triggers are enabled, procedure replication is being used - or if repdefs are not being used, you will need to carefully monitor the query performance at the replicate. The main tables that will help with this are illustrated here:

Final v2.0.1

195

Figure 40 - MDA Tables Useful for Query Analysis

Note that the table monSysPlanText was excluded from the above - this is due to the fact that while the query plan could confirm what is happening - due to the need to configure an appreciable pipe size and the impact the configuration value has on execution speed, we have avoided it. However, for particularly perplexing issues, it still maybe required.

To begin with, you will want to make sure that the monProcessActivity.TableAccesses, IndexAccesses and LogicalReads/PagesWritten have the correct relative ratios for the maintenance users. For example, if the number of TableAccesses are high, it could be an indication of a table scan - which should also be evident as the number of LogicalReads may be orders of magnitude higher than expected. The obvious question is ‘What are the expected orders of magnitude?’ The answer is that it depends on the operation, minimal column replication setting and volatility of the indexed columns. Consider the following table:

Operation I/O pattern Typical Cost

Insert 1 index traversal to locate insert point (reads), write for the data row; index traversals to locate index key insert points and writes for each index key

50-75

Update PK index traversal to locate row, write for the data row, index traversals for each unsafe index plus index key overwrites

10-50

Final v2.0.1

196

Operation I/O pattern Typical Cost

Delete PK index traversal to locate row, write to delete row, index traversals for all indexes plus index key deletion

50-75

As a result, if the delta between two samples shows that the maintenance user did 100,000 logical I/O’s but only did 60 page writes, this points to a likely indexing issue.

To find the issue, the next step is to try to isolate which object it is occurring for. There are several possibilities for this. The first is monProcessObject, but it is unlikely to help as it only records the object statistics for the currently executing statement in the batch. Consequently, unless the server just happened to be still executing the bad statement, it is unlikely that this will provide any useful information. monProcessStatement has the same issue.

The second likely answer is to use monOpenObjectActivity. If no other production users are on the system, the task is a simple comparison of the LogicalReads/PagesWritten ratio - and in addition, you can look for a table in which the IndexID=0 and a non-null LastUsedDate (indicative of a table scan).

Failing that, you can use monSysStatement and again compare the LogicalReads/PagesModified (and in ASE 15.0.1 the new RowsAffected column) for the maintenance user SPID/KPID pairs. While this can prove beyond a shadow that an ineffective index was being used (or if proc replication or triggers enabled - bad logic within them), the actual table involved can not be identified without monSysSQLText.

Regardless, if triggers are still enabled or procedure replication is occurring, you will need to watch monSysStatement closely for the maintenance user and attempt to keep the total IO cost of any triggers/procedures to the absolute minimum - which may mean that triggers may have to be rewritten to avoid joins with the insert/deleted tables and be optimized for single row DML statements.

Triggers & Stored Procedures

In this discussion, we are not focusing on stored procedure replication - but rather what can happen when triggers are enabled and in particular when the trigger calls stored procedures at the replicate database.

Triggers & Database Inconsistencies

Other than float/approximate datatype issues, the second (and a distant second) most common cause of inconsistencies as a result of not having replication definitions is when triggers are enabled. For a standard warm-standby, triggers are disabled by default via “dsi_keep_triggers”. However, if replicating stored procedures, DBAs may have changed this setting as they have been instructed to do so to ensure the integrity of actions with replicated procedures. Or, some DBAs have simply enabled triggers out of fear that without them database inconsistencies could result. Additionally, for MSA implementations, the default setting is that triggers are enabled.

Some of the most common fields modified by triggers include auditing data (such as last update time), aggregate values, derived values, etc. Typically, these columns are not part of the primary key. As a result, if no replication definition is found, the update or deletes may fail as the actual values for these columns may differ.

There is a common fallacy that triggers should be enabled for all replication except Warm Standby – and that this is the only way to guarantee database consistency. Actually this is only true for the following situations:

1. Not all the tables in the database are being replicated, and one of the replicated tables has a trigger that maintains another table (i.e. a history table) that is not replicated, but a similar table maintenance is desired at the replicate

2. A stored procedure that is replicated has DML statements that affect tables with triggers that update other tables (replicated or not) in the same database.

The latter reason is likely the most common – however, leaving dsi_keep_triggers to ‘on’ just for this cause is grossly inefficient as a more optimal solution would be to have the proc check @@options and manually issue ‘set triggers on/off’ as necessary. To balance the above, there are cases where leaving the triggers enabled would result in database inconsistencies as well. Consider the following:

1. All tables in the database are replicated. 2. The trigger calls a stored procedure that does a rollback transaction or returns a negative return code

between -1 and -99

The first case is fairly obvious. Any trigger that causes an insert (i.e. maintains a history table) or does an update to an aggregate value will cause problems at the replicate – either throwing duplicate key errors – or the triggered DML statements from the primary will clobber the triggered changes at the replicate – and the values may be different.

Final v2.0.1

197

The second case is really interesting and requires a bit of knowledge of ASE internals. Returning a negative number from a stored procedure return code is something that is fairly common among SQL developers. Now, we all know that just because something is documented as something developers shouldn’t do doesn’t mean that we all obey it. Case in point is that the ASE Reference Manual clearly states that:

One aspect for the customer to consider is that return values 0 through -99 are reserved by Sybase. For example:

0 Procedure executed without error

-1 Missing object

-2 Datatype error

-3 Process was chosen as deadlock victim

-4 Permission error

-5 Syntax error

-6 Miscellaneous user error

-7 Resource error, such as out of space

-8 Non-fatal internal problem

-9 System limit was reached

-10 Fatal internal inconsistency

-11 Fatal internal inconsistency

-12 Table or index is corrupt

-13 Database is corrupt

-14 Hardware error

Now then, consider the following schema:

use pubs2 go create table trigger_test ( rownum int identity not null, some_chars varchar(40) not null, primary key (rownum) ) lock datarows go create table hist_table_1 ( rownum int not null, ins_date datetime not null, primary key (rownum, ins_date) ) lock datarows go create table hist_table_2 ( rownum int not null, ins_date datetime not null, primary key (rownum, ins_date) ) lock datarows go create procedure bad_example @rownum int as begin declare @curdate datetime select @curdate=getdate() insert into hist_table_2 values (@rownum, @curdate) return -4 end go create trigger trigger_test_trg on trigger_test for insert as begin declare @currow int select @currow=rownum from inserted insert into hist_table_1 values (@currow, getdate()) exec bad_example @currow end go

Note the highlighted line – the proc returns -4 – no error raised…..just a negative return code. We would expect that by inserting a row into trigger_test that the trigger would fire, inserting a row in hist_table_1, then calling the proc which would insert a row in hist_table_2….let’s try it:

Final v2.0.1

198

---------- isql ---------- 1> use pubs2 1> truncate table trigger_test 1> begin tran 1> insert into trigger_test (some_chars) values ("Testing 1 2 3...") 2> select @@error (1 row affected) 1> commit tran 1> select * from trigger_test 2> select * from hist_table_1 3> select * from hist_table_2 rownum some_chars ----------- ---------------------------------------- (0 rows affected) rownum ins_date ----------- -------------------------- (0 rows affected) rownum ins_date ----------- -------------------------- (0 rows affected) Output completed (0 sec consumed) - Normal Termination

What happened???? It looks like the insert happened – we did get back the standard “(1 row affected)” message after all – and no error was raised….but curiously, neither did we get the results of @@error….hmmmmmm…and all the tables are empty. Let’s change the trigger slightly to:

create trigger trigger_test_trg on trigger_test for insert as begin declare @currow int select @currow=rownum from inserted insert into hist_table_1 values (@currow, getdate()) exec bad_example @currow select @@error select * from hist_table_1 end go

And add an extra insert to the execution:

---------- isql ---------- 1> use pubs2 1> 2> begin tran 1> insert into hist_table_1 values (0, getdate()) 2> insert into trigger_test (some_chars) values ("Testing 1 2 3.....") 3> select @@error (1 row affected) ----------- 0 (1 row affected) rownum ins_date ----------- -------------------------- 0 Jan 4 2006 1:21AM 401 Jan 4 2006 1:21AM (2 rows affected) 1> commit tran 1> select * from trigger_test 2> select * from hist_table_1 3> select * from hist_table_2 rownum some_chars ----------- ---------------------------------------- (0 rows affected)

Final v2.0.1

199

rownum ins_date ----------- -------------------------- (0 rows affected) rownum ins_date ----------- -------------------------- (0 rows affected) Output completed (0 sec consumed) - Normal Termination

Whoa! Still no error inside the trigger immediately after the proc call with -4 returned, and the rows were being inserted….but…no data. The reason is that if a nested procedure inside a trigger (or another procedure) returns a negative return code, ASE assumes that the system actually did raise the corresponding error (i.e. -4 is a permission problem) and that it is supposed to rollback the transaction.

All of course, without errors….which means if this happened at the replicate database, the replicate would get out of synch with the primary and no errors would get thrown. Ouch!!!

Trigger/Procedure Execution Time

Besides data inconsistency problems when triggers exist, the biggest problem with triggers is that the typical coding style for triggers is not optimized for single row executions. It is not uncommon to see throughout a trigger multiple joins to the inserted/deleted tables or joins where if a single row was all that was affected could be eliminated using variables. This results in a lot of unnecessary extra I/O that lengthens the trigger execution time needlessly.

Trigger and procedure execution time are extremely, extremely critical. One metric of interest may be to know that trigger based referential integrity is 20 times slower than declarative integrity (via constraints). Remember, in order to maintain commit order, the Replication Server basically applies the transactions in sequence – even in parallel DSI scenarios, the threads block and wait for the commit order. As a result, while procedure execution is great for Replication Server performance from thread processing perspective, the net effect is that as soon as a long procedure begins execution, the following transactions in the queue effectively are delayed. Note, that this is not unique to stored procedures – long running transactions will have the same effect (i.e. replicating 50,000 row modifications in a single transaction vs. a procedure that modifies them have the same effect at the replicate system – however, the procedure is much less work for the Replication Server processing).

As a result, particular attention should be paid to stored procedure and trigger execution times (if you for some odd reason opt not to turn triggers off for that connection). Any stored procedure or trigger that employs cursors, logged I/O in tempdb, joins with inserted/deleted tables, etc. should be candidates for rewriting for performance. Ideally, triggers should be disabled for replication at the replicate via the DSI configuration ‘dsi_keep_triggers’.

Key Concept #19: Besides possibly causing database consistency issues, trigger execution overhead is so high and probable coding style so inefficient, that triggers may be the primary cause of replication throughput problems – and as a consequence triggers should be disabled via ‘dsi_keep_triggers’ until proven necessary and then enabled individually if possible.

To see how to individually enable triggers, refer back to the trick on replicating SQL statements via a procedure call and using @@options to detect the trigger status.

Concurrency Issues

In replicate only databases, concurrency is mainly an issue between the parallel DSI threads or when long running procedures execute and lock entire tables. However, in shared primary configurations – workflow systems or other systems in which the data in the replicate is updated frequently, concurrency could become a major issue. In this case, user transactions and Rep Server maintenance user transactions could block/deadlock each other. This may require decreasing the dsi_max_xacts_in_group parameter to reduce the lock holding times at the replicate as well as ensuring that long running procedures replicated to that replicate database are designed for concurrent environments.

Final v2.0.1

200

Key Concept #20: In addition to concurrency issues between maintenance user transactions when using Parallel DSI’s, if the replicate database is also updated by normal users, considerable contention between maintenance user and application users may exist. Reducing transaction group sizes as well as designing long running procedures to not cause contention are crucial tasks to ensuring the content does not degrade business performance at the replicate or Replication Server throughput.

Similar to any concurrency issue, depending on what resources are the source of contention, it may be necessary to use different locking schemes, etc. at the replicate than at the primary (or same if Warm Standby). Consider the following activities:

Strategy Comment

Additional Indexes Additional indexes, particularly if replicating to a denormalized schema or data warehouse could increase contention. While not necessarily avoidable, it may require a careful “pruning” of OLTP specific indexes.

DOL Locking Eliminate index contention and data row contention by implementing DOL locking at the replicate system.

Table Partitioning Provide parallel DSI’s multiple last pages to avoid contention without implementing DOL locking.

Triggers Off Have RS DSI disable triggers – especially data validation triggers

Obviously, the above list is not complete, but may provide ideas to resolve contention issues when the contention is not due to the holding of locks longer due to transaction grouping.

Final v2.0.1

201

Procedure Replication

Is it true that I can’t replicate both procedures and affected tables??

Procedure vs. Table Replication

The above question is a common misconception that you cannot replicate both procedures and tables modified by replicated procedures. This is partially based on the following paragraph:

“If you use function replication definitions, do not attempt to replicate affected data using table replication definitions and subscriptions. If the stored procedures are identical, they will make identical changes to each database. If the affected tables are also replicated, duplicate updates would result.”

- page 9-3 in Replication Administration/11.5

However, consider the following paragraphs:

In replicating stored procedures via applied functions, it may be advisable to create table replication definitions and subscriptions for the same tables that the replicated stored procedures will affect. By doing this you can ensure that any normal transactions that affect the tables will be replicated as well as the stored procedure executions.

However, DML inside stored procedures marked as replicated is not replicated. Thus, in this case, you must subscribe to the stored procedure even if you also subscribe to the table.

- page 3-145 in Replication Reference/11.5

Confused?? A lot of people are. What it really refers to is if you replicate a procedure, the DML changes within the procedure will not be replicated, no matter what. The way this is achieved is that normally, as a DML statement is logged, if the object’s OSTAT_REPLICATE flag is set, then the ASE logger sets the transaction log record’s LSTAT_REPLICATED flag. For a stored procedure, this means that the stored procedure receives the LSTAT_REPLICATED flag, and the ASE logger does not mark any DML records for replication until after that procedure execution has completed. This is illustrated with the following sample fragment of a transaction log:

XREC_BEGINXACT (implicit transaction) XREC_EXECBEGIN proc1 (proc execution begins) XREC_INSERT Table1 (insert DML inside proc) XREC_INSERT Table2 (insert DML inside proc) XREC_DELETE Table3 (delete DML inside proc) XREC_EXECEND (end proc execution) XREC_ENDXACT (end implicit tran)

Only the highlighted records will have the LSTAT_REPLICATED flag set, and consequently forwarded by the Replication Agent to the Replication Server.

Attempting to force both to be replicated (i.e. executing a replicated procedure in one database with replicated DML modifications in another) could lead to database inconsistencies. The only way to force this replication is to a) replicate a procedure call in one database and b) that procedure modify data in a table that is also replicated in another database. This would allow both to be replicated as two independent log threads would be involved. The one that would be evaluating the DML for replication would not be aware that the DML was even inside a procedure that was also replicated.

Which brings us to the point the second reference was making. The second reference stated that it “may be advisable to create table replication definitions and subscriptions for the same tables…”. The reason for this is exactly the fact that DML within a procedure is NOT replicated – and needs reverse logic to understand the impact. Consider the scenario of New York, London Tokyo, San Francisco and Chicago all sharing trade data. A procedure at New York is executed at the close of the day to update the value of mutual funds based on the closing market position of the funds stock contents. All the other sites subscribe to the mutual fund portfolio table. Now, consider what would happen if only San Francisco and Chicago subscribed to the procedure execution. Neither London nor Tokyo would ever receive the update mutual fund values!!! Why?? Since the DML within the replicated procedure is not marked for replication, the Replication Agent would only forward the procedure execution log records and NOT the logged mutual fund table modifications. Since neither subscribed to the procedure, they would not receive anything. This is illustrated below:

Final v2.0.1

202

BTX proc1I Table1I Table2D Table3D Table4CT

IBQ New York

OBQ Chicagoexec proc1exec proc1

OBQ London

OBQ Tokyo

OBQ San FranciscoBTBTexec proc1exec proc1CTCT

(Nothing)(Nothing)

Exec proc1Exec proc1

(Nothing)(Nothing)

Exec proc1Exec proc1

Chicago

London

Tokyo

San FranciscoNew York

Figure 41 – Replicated Procedure & Subscriptions

Which brings us to the following concept:

Key Concept #21: If replicating a procedure as well as the tables modified by the procedure, any replicate that subscribes to one should also subscribe to the other to avoid data inconsistency.

A notable exception to that is that if replicating to a data warehouse, the data warehouse may not want to subscribe to a purge or archive procedure executed on the OLTP system.

However, there is a gotcha when replicating procedures and tables. If replicating procedures and the dsi_keep_triggers setting is ‘off’ database inconsistencies might develop. The reason is evident in the below scenario:

1. At the primary, a replicated procedure is executed. In the procedure, an insert occurs on Table A. Table A’s trigger modifies Table B

2. Procedure is replicated as normal via Rep Agent to Replication Server. 3. When applied, the procedure is executed. Because triggers are off, only the insert to Table A occurs.

Preventing this can be done in one of two ways. First the obvious – set dsi_keep_triggers to ‘on’. However, this could significantly affect throughput. The other – and possibly better approach – is to consider how the triggers got disabled in the first place – via a function string executing the command “set triggers off”. This then can be included in the procedure logic via a sequence similar to:

create procedure proc_a @param1 datatype [, @paramn datatype] as begin if proc_role(“replication_role”)=1 set triggers on … dml statements … if proc_role(“replication_role”)=1 set triggers off return 0 end

By ensuring user has replication role, other users executing the same procedure would not get permission violations. This brings up another key concept about procedure replication:

Key Concept #22: If replicating procedures, special care must be taken to ensure that DML triggered operations within the procedure are also handled or otherwise you risk an inconsistent database at the replicate.

Procedure Replication & Performance

Now that we have cleared that matter up and we understand that we can replicate procedures and tables they affect simultaneously, the question is how does this affect performance. The answer – as in all performance questions – is: “It

Final v2.0.1

203

depends”. Replicating procedures can both improve replication performance as well as degrade replication performance. The former is often referenced in replication design documents, and consequently, will be discussed first.

Reduced Rep Agent & RS Workload

Consider a normal retail bank. At a certain part of the month, the bank updates all of the savings accounts with interest calculated on the average daily balance during that month. This literally can be tens of thousands to hundreds of thousands of records. If replicating the savings account table to regional offices, failover sites, or elsewhere, this would mean the following:

1. The Replication Agent would have to process and send to the Replication Server every individual account record.

2. The account records would have to be saved to the stable device. 3. Each and every account record would be compared to subscriptions for possible distribution. 4. The account records would have to be saved again to the stable device – once for each destination. 5. Each account record would have to update as individual updates at each of the replicates

The impact would be enormous. First, beyond a doubt, the Replication Agent would lag significantly. Secondly, the space requirements and the disk I/O processing time would be nearly insurmountable. Third, the CPU resources required for tens to hundreds of thousands of comparisons are enormous. And lastly, the time it would take to process that many individual updates would probably exceed the required window.

How would replicating stored procedures help?? That’s easy to see. Rather than updating the records via a static SQL statement at the primary, a stored procedure containing the update would be executed instead. If this procedure were replicated, then the Replication Agent would only have to read/transfer a single log record to the Replication Server, which in turn would only have to save/process that single record. The difference could be hours of processing saved – and the difference between a successful replication implementation or one that fails due to the fact the replicate can never catch up due to latency caused by excessive replication processing requirements.

Key Concept #23: Any business transaction that impacts a large number of rows is a good candidate for procedure replication, along with very frequent transactions that affect a small set of rows.

Increased Latency & Contention at Replicate

So, if stored procedures are can reduce the disk I/O and Replication Server processing, how can replicating a stored procedure negatively affect replication? The answer is two reasons: 1) the latency between begin at the primary and commit at the replicate; and 2) extreme difficulty in achieving concurrency in delivering replicated transactions to the replicate once the replicated procedure begins to be applied.

Let’s discuss #1. Remember, Replication Server only replicates committed transactions. Now, using our earlier scenario of our savings account interest procedure, let’s assume that the procedure takes 4 hours to execute. We would see the following behavior:

1. Procedure begins execution at 8:00pm and implicitly begins a transaction. 2. Replication Agent forwards procedure execution to RS nearly immediately. 3. RS SQT thread caches execution record until the procedure completes execution and the completion

record is received via the implicit commit. 4. At midnight the procedure completes execution. 5. Within seconds, the Replication Agent has forwarded the commit record to RS and RS has moved the

replicated procedure to the Data Server Interface (DSI). 6. The DSI begins executing the procedure at the replicate shortly after midnight 7. Assuming all things being equal, the procedure will complete at the replicate at 4:00am

Consequently, we have a total of 8 hours from when the process begins until it completes at the replicate, and 4 hours from when it completes at the primary until it completes at the replicate. This timeframe might be acceptable to some businesses. However, what if the procedure took 8 hours to execute? Basically, the replicate would not be caught up for several hours after the business day began – which may not be acceptable for some systems such as stock trading systems with more real time requirements. An example of this happening can be illustrated with the following scenario. Let’s assume that we have a bank that has a sustained 24x7 transaction rate of 20,000tph and that the interest calculation procedure takes 8 hours to run. For sake of the example, let’s assume that we have Replication Server

Final v2.0.1

204

tuned to the point that it is delivering 500tpm or 30,000tph. This is illustrated in the following diagram (each of the lines represents one hours worth of transactions (20K=20,000tph)):

T17 T00T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01

Interest Calculation ProcedureInterest Calculation Procedure

20K20K40K40K

60K60K80K80K

100K100K120K120K

140K140K160K160K

180K180K200K200K

220K220K240K240K

260K260K280K280K

300K300K

340,000 xactn in 17 hours(plus interest calculation)

=20,000tph


=20,000tph320K320K340K340K

Figure 42 – Procedure & Transaction Execution At The Primary

Normally we would be happy as it would appear that we have a 50% surge capacity built into our system and we can go home and sleep through the night. Except that we would probably get woken up at about 4am by the operations staff due to the following problem:


Interest Calculation ProcedureInterest Calculation Procedure


=70,000 xactns behind


=70,000 xactns behind

30K30K60K60K

90K90K120K120K

150K150K180K180K

210K210K240K240K

270K270K

Figure 43 – Procedure & Transaction Execution At The Replicate

Even at 30,000tph, we are significantly behind. More than 7 hours in fact. Why? Remember, transactions must be delivered in commit order. Consequently, a full 240,000 transactions must be delivered by the RS before it can send the proc for execution. This delays the procedure from starting for 4 hours after it completes at the primary. Now that we are executing the procedure, it must complete before any other transactions can be sent/committed (discussed in next paragraph). Whatever the cause, we are now 70,000 transactions behind – which sounds not that bad – a mere two hours or so at 30,000tph rate (2h:20min to be exact). But…. During those 140 minutes, another 27,000 transactions arrive! Another way to look at it is that the RS has a net gain of 10,000tph. Consequently, 70,000 transactions behind represents 7 hours before we are caught up.

That explains the latency issue – what of the concurrency? Why can’t the normal transactions continue to execute at the replicate simultaneous with the procedure execution the same way it did at the primary? This requires a bit of thinking, but consider this: while the procedure is executing at the primary, concurrent transactions by customers (i.e.

Final v2.0.1

205

ATM withdrawals) may also be executing in parallel, as illustrated in the first timeline above. Since they would commit far ahead of the interest calculation procedure, they would show up at the replicate within a reasonable amount of time. Assuming this pattern continues even after the procedure completes (i.e. checks clearing from business retailers), as illustrated in the second timeline, the following would happen:

1. Procedure completes at primary. It is followed by a steady stream of other transactions – possibly even a batch job requiring 3 hours to run.

2. Since RS guarantees commit order at the replicate, RS processes the transactions in commit order and internally forwards them to the DSI thread for execution at the replicate.

3. If only using a single DSI, the follow-up transactions would not even begin until the interest procedure had committed – some 8 hours later. If multiple DSI’s and no contention, the DSI would have to ensure that the follow-up transactions did not commit first and would do so by not sending the commit record for the follow-up transactions until the procedure had finished.

4. Due to contention, the replicated batch process may not even begin execution via a parallel DSI until the replicated interest procedure committed.

In addition to the fact that transactions committed shortly after the interest procedure suddenly have a 8 hour latency attached, the question that should come up is “Can the Replication Server catch up?”. The answer is doubtfully prior to the start of the business day. So, …

Key Concept #24: Replicated procedures with long execution times may increase latency by delaying transactions from being applied at the replicate. The CPU and disk I/O savings with RS need to be balanced against this before deciding to replicate any particular procedure.

As a result, it may be advisable to actually replicate the row modifications. This could be done by not replicating the procedure but have the procedure cursor through each account. This would be the same as atomic updates, each a separate transaction (after all, there is no reason why Annie Aunt’s interest calculation needs to be part of the same transaction as Wally the Walrus – but whether or not that is how it is done at the primary, at the replicate they would be all part of the same transaction due to the fact the entire procedure would be replicated and applied within the scope of a single transaction.). While it may take RS several hours to catch up, entirely on the replicate – it just might be less than the latency incurred due to replicating the procedure.

Is there a way around this problem without replicating the individual row updates? Possibly. In this particular example, assuming the average daily balance is stored on a daily basis (or other form so that changes committed out of order do not affect the final result), a multiple DSI approach could be used to the replicate system, in which the replicated procedure could use it’s own dedicated connection to the replicates. Consequently, the Replication Server would be able to keep up with the ongoing stream of transactions, while concurrently executing the procedure. However, this would only work in such places where having a transaction that committed at the primary after the interest calculation but commits before it at the replicate does not cause a disparity in the balance. More will be discussed about this approach in a later section after the discussion about Parallel DSI’s.

The following guidance is provided to determine whether or not to replicate the procedure or allow the affected tables to replicate as normal. You probably should consider replicating stored procedures when:

OLTP Procs - Frequently executed stored procedures with more than 5 DML operations with fast execution times.

Purge Procs - Purge procedures when one of the targets for replication is a reporting system which is used for historical trend analysis.

Large Update Procs - Procedures containing mass updates in a single statement, which when the individual rows affected when replicated will exceed any reasonable setting for number of locks.

You should consider not replicating the procedure and allowing the affected rows to replicate when:

Cursor Procs - Procedures that process a large set using cursor processing and applying the changes as atomic transactions.

Queue Procedures - Procedures that are processing sequential lists such as job queues (replicating these could result in inconsistent databases).

Long Running Procs - Procedures that either perform a lot of I/O (selects or updates) that causes it to have a long runtime (more than a few seconds).

Final v2.0.1

206

System Functions in Proc - Procedures that contain calls to getdate(), suser_name(), user_name() or other system functions, which when executed at the replicate by the maintenance user will result in different data values than at the primary.

Triggers Executed by Proc - Procedures that contain DML operations that in turn invoke normal database triggers – particularly if the connection’s dsi_keep_triggers is set to ‘off’ – disabling trigger execution (this can be corrected by using “set triggers on/off” within the procedure, however, if a vendor package, you may not have the ability to change the source.

Improper Transaction Management in Proc - Procedure does not implement proper transaction management (discussed earlier) unless it can be corrected to behave properly.

As with all guidance, it is offered as a starting point, you should test your transactions to determine which is best for your environment.

Procedures & RPC’s vs. Language (DML)

From the very earliest times, we have heard that stored procedures are faster than language batches. This is not always true for reads - but it is certainly true for write operations - and for the same reason in both cases: query optimization. As we all know, stored procedures are optimized at the initial execution and then subsequent executions re-use this pre-optimized plan. While this can be a problem for reports and other complex queries that have a lot of flexibility in query search arguments, it can significantly help DML operations. If you think about it, each DML statement sent by RS to the replicate database goes through the same sequence:

1. Command parsing 2. Query compilation/object resolution 3. Query optimization 4. Query execution

It turns out that step 2 and especially step 3 take significantly more time than one would think. While the difference varies by platform and cpu speed, a stored procedure containing a simple insert executes anywhere from 2-3x faster for C code and up to 10x faster for JDBC applications than the individual insert/values statement.

The obvious question is how can this be exploited for a DML-centric process such as Replication Server? The answer is understanding what all constitutes a stored procedure in ASE:

• Traditional stored procedure database objects - executed either as language calls or RPC’s • Fully prepared SQL Statements/Dynamic SQL - these create a dynamic procedure on the server which is

invoked via the RPC interface • Queries using the ASE Statement Cache - a query contained in the statement cache is compiled as a dynamic

procedure.

The first is quite easily understood - we are referring to the usual database objects created using the “create procedure” T-SQL command. As mentioned, a stored procedure can either be executed as a language call or an RPC call. Most scripts that call stored procedures via isql are using a language call, while inter-server invocations such as ‘SYB_BACKUP…sp_who’ gets executed using the RPC interface (particularly in that case as it is running against the Sybase Backup Server which doesn’t support a language interface).

Fully prepared statements or dynamic SQL are used in very high speed systems with a large number of repeating transactions. For JDBC, this involves setting the connection property DYNAMIC_PREPARE=true, while for CT-Lib applications, the ct_dynamic() statement is used along with ct_param() and a slightly different form of ct_send(). In either case, what happens is that the ASE server creates a dynamic procedure that is executed repeatedly via the RPC interface. A pseudo-code representation of application logic for this might resemble:

stmtID=PrepareSQLStatement(‘insert into table values (?,?)’) while n <= num_rows stmtID.setParam(1,<value[n,1]>) stmtID.setParam(2,<value[n,2]>) stmtID.execute() end loop stmtID.DropSQLStatement()

Note the ‘language’ portion of the command is parameterized as it is sent to the server - and that it is only sent to the server one time. Subsequent calls simply set the parameter values and execute the procedure. Obviously, since Replication Server is a compiled application, we can’t change the CT-Lib source code to invoke fully prepared statements, so this approach is not open to us (although a future enhancement to Replication Server 15.0 is considering an implementation using dynamic SQL). Note, however, that unlike language statements, RPC calls can not be batched

Final v2.0.1

207

- however, early tests suggest that the performance advantage of a dynamic SQL implementation within RS even without command batching is much faster than language statements with command batching. Additionally, there likely will be restrictions on using custom function strings for obvious reasons.

The second method using the ASE statement cache likely will not help either. The ASE statement cache introduced in ASE 12.5.2 is a combination of the two. Since ASE 12.5.2, as each SQL language command is parsed & compiled, a MD5 hash value is created. If the statement cache is enabled, this hash is compared to existing query hashes already in the statement cache. If no match is found, the query is optimized and converted into a dynamic procedure much like the preceding paragraph. If found, the query literals are used much like parameters to a prepared statement and the pre-optimized version of the query is executed.

However, the reason it is stated that this may not help is that in early 12.5.x and 15.0 ASE’s, the statement MD5 hash value included the command literals. For example, the following queries would result in two different hashkeys being created:

-- query #1 select * from authors where au_lname=’Ringer’ -- query #2 select * from authors where au_lname=’Greene’

The problem with this approach was that statements executed identically with only a change in the literal values would still incur the expense of optimization. Additionally the restrictions on statement cache fairly much limited it to update statements, delete statement and select statements. Insert table () values () statements were not cacheable nor were SQL batches such as if/else constructs. As a result of these restrictions, the statement cache in ASE 12.5.x does not benefit Replication Server unless there is an extremely high incidence of the same row being updated.

In ASE 15.0.1, a new configuration option ‘statement cache literal parameterization’ was introduced along with the associate session level setting ‘set literal_autoparam [off | on]’. When enabled, the constant literals in query texts will be replaced with variables prior to hashing. While this may result in a performance gain for some applications, there are a few considerations to keep in mind:

• Just like stored procedure parameters, queries using a range of values may get a bad query plan. For example, a query with a where clause specifying where date_col between <date 1> and <date 2>.

• Techniques to finding/resolving bad queries may not be able to strictly join on the hashkey as the hashkey may be representing multiple different query arguments.

However, the restriction on insert/values is still in effect, consequently, the improved statement caching will only help applications with significant update or delete operations. If your system experiences a large number of update or delete operations, this should be considered, if not for the server, certainly for the maintenance user by altering the rs_usedb function to include the session setting syntax. ASE 15.0.2 is supposed to lift the restriction on caching insert statements, consequently, when it is released, RS users may see a considerable gain in throughput.

Consequently for insert intensive applications, the only means to exploit this today (through RS 15.0 ESD #1) is to use custom function strings that call stored procedures and create stored procedures for each operation (insert, update, delete). You may wish to test using a function string output style of RPC as well as language to determine whether language based procedures with command batching give you performance gains over RPC style or vice versa. However, keep in mind that both ASE 15.0.2 and a future release of RS that will implement dynamic SQL/fully prepared statements may eliminate the need for this. As a result, it may be more practical to upgrade to those releases when available vs. converting to function strings and procedure calls.

Procedure Transaction Control

One of the least understood aspects of stored procedure coding is proper transaction control. It is commonly thought that the following procedure template has proper transaction control and represents good coding style. However, in reality, it demonstrates improper transaction control.

create procedure proc_name <parameter list> as begin <variable declarations> begin transaction tran_name insert into table1 if @@error>0 rollback transaction

Final v2.0.1

208

insert into table2 if @@error>0 rollback transaction insert into table3 if @@error>0 rollback transaction commit transaction end

The problem arises when the procedure is called from within another transaction, as in: Begin transaction tran_1 Exec proc_name <parameters> Commit transaction

The reason the problem occurs is the mistaken belief that if nested “commit transactions” only commit the current nested transaction, then a nested rollback only rolls back to the proper transaction nesting level. Consider the following code:

Begin tran tran_1 <statements>

begin tran tran_2 <statements>

begin tran tran_3 <statements> if @@error>0 rollback tran_3 commit tran tran_3

if @@error>0 rollback tran tran_2 commit tran tran_2

if @@error rollback tran tran_1 commit tran tran_1

While nested commits do only commit the innermost transaction, application developers need to keep the following rules in mind, particularly regarding rollback transaction statements:

• Rollback transaction without a transaction_name or savepoint_name rolls back a user-defined transaction to the beginning of the outermost transaction.

• Rollback transaction transaction_name rolls back a user-defined transaction to the beginning of the named transaction. Though you can nest transactions, you can roll back only the outermost transaction.

• Rollback transaction savepoint_name rolls a user-defined transaction back to the matching save transaction savepoint_name.

The above bullets are word for word from the Adaptive Server Enterprise Reference Manual. The underlined sentence sums it up quite simply – unless you use transaction savepoints (explicit use of “save transaction” commands) – you can only rollback the outermost transaction. As a result, any rollback transaction encountered automatically rolls back to the outermost transaction unless a savepoint name is specified (it also points to the fact that only outer transactions and savepoints can have transaction names). Consequently, a procedure that attempts to implement transaction management can have undesired behavior during a rollback if itself was called from within a transaction. This is crucial as Replication Server always delivers stored procedures within an outer transaction as part of the normal transactional deliver.

The second common problem with procedures is the fact that if transaction management is not implemented at all, simply raising an error and returning an non-zero return code does not represent a failed execution. Consider the following common code template:

create procedure my_proc <parameter list> as begin

insert into table_1 if @@error > 0 begin

raiserror 30000 <error message>, <variables> return –1

end return 0 end

It often surprises people that if the procedure is marked for replication and an error occurs, it still gets replicated and fails at the replicate resulting in the DSI thread suspending. The reason is simple. Even though an error was raised, the implicit transaction (started by any atomic statement) was not rolled back. Consequently, this leads to the following points:

• Stored procedures that are replicated should always be called from within a transaction, should check to see if in a transaction and rollback the transaction as appropriate during exception processing.

Final v2.0.1

209

• Alternatively, stored procedures that are replicated should be implemented as sub-procedures that are called by a parent procedure after local changes have completed successfully AND then the sub-procedure should be called from within a transaction managed by the parent procedure.

• Stored procedures that implement transaction management should ensure a well-behaved model is implemented using appropriate save transaction commands (see below).

The first point is illustrated with the following template: create procedure my_proc <parameter list> as begin

if @@trancount<1 begin

raiserror 30000 “This procedure can only be called from within an explicit transaction”

return -1 end insert into table_1 if @@error > 0 begin

raiserror 30000 <error message>, <variables> rollback transaction return –1

end return 0

end

Notice the highlighted sections that are modifications to the previous code. The second point is probably the best implementation for replicated procedures as it allows minimally logged functions for row determination (exact details how are beyond the scope of this discussion) and ensures the local changes are fully committed before the “call” to the replicated procedure is even attempted. A sample code fragment would be similar to:


insert into table_1 if @@error > 0 begin

raiserror 30000 <error message>, <variables> return –1

end begin tran my_tran @retcode=exec replicated_proc <parameters> if @retcode!=0 begin

raiserror 30000 “Call to procedure replicated_proc failed” rollback transaction return –1

end else commit tran return 0

end

Note that this would rollback an outer transaction as well if called from within a transaction.

Finally, implementing proper transaction control for a stored procedure actually resembles something similar to the following:


declare @began_tran int if @@trancount=0 begin

select @began_tran=1 begin tran my_tran_or_savepoint

end else begin

select @began_tran=0 save tran my_tran_or_savepoint

end <statements> if @@error>0 begin

rollback tran my_tran_or_savepoint raiserror 30000 “something bad happened message” return -1

end if @began_tran=1 commit tran

return 0

Final v2.0.1

210

end

Again, note the highlighted sections. Since only the outermost transactions actually commit the changes, using nested transaction is a fruitless exercise. A more useful mechanism as demonstrated is to implement savepoints at strategic locations that can be rolled back as appropriate. Each procedure, when called, simply needs to determine if it has been called from within a transaction or not. If not, it begins a transaction. If it was called within a transaction, it simply implements savepoints to rollback the changes it initiated. However, it would still be the responsibility of the parent procedure to rollback the transaction (by checking the return or error code as appropriate).

Procedures & Grouped Transactions

To understand why this can lead to inconsistencies at the replicate – and more to the point, “seemingly spurious duplicate key errors”, you need to consider the impact of transaction batching and error handling. Consider the following SQL batch as if sent from isql:

insert statement_1 insert statement_2 insert statement_3 insert statement_4 insert statement_5 go

If statement 3 fails with an error, statements 4 & 5 still execute as members of the batch. Now, put this in context of replication transaction grouping – which if issued via isql would resemble the following:

begin transaction rs_update_threads 2, <value> insert statement_1 insert statement_2 exec replicated_proc_1 insert statement_3 exec replicated_proc_2 insert statement_4 insert statement_5 rs_get_thread_seq 1 --end of batch -- if succeeded rs_update_lastcommit commit tran -- if it didn’t succeed, disconnect to force a rollback -- rollback tran

Now, let’s suppose that the second call to replicated_proc (exec replicated_proc_2) fails and a “normal” transaction management model was implemented as discussed earlier vs. a proper implementation. The effect would be that the entire transaction batch would get rolled back to where the transaction began, however the subsequent inserts (#4 & #5) would succeed (remember, a rollback does not suspend execution, it merely undoes changes). Fortunately, in one sense, the error raised would cause RS to attempt to rollback and retry the entire transaction group individually. However, since inserts #4 & #5 were executed outside the scope of a transaction, they would not get rolled back by the RS. On retry (after the error was fixed for the replicated proc), upon reaching inserts #4 & #5, both would raise “duplicate key errors”. Checking the database would reveal the rows already existing, and simply resuming the DSI connection and skipping the transaction would have keep the database consistent, but leave a very confused DBA wondering what happened.

Procedures with “Select/Into”

The latter example probably raised a quick “but..but..” from developers who are quick to state that replicating procedures with “select/into..” is not possible due to “DDL in transaction” errors at the replicate system. Very true if procedure replication is only at the basic level – which typically is not the optimal strategy for procedure replication. While this may seem to be more appropriately discussed in the primary database section earlier, the transaction “wrapping” effect of Replication Server has often caused application developers to change the procedure logic at the primary. Case in point, procedures with select/into execute fine at the primary, however, fail at the replicate due to DDL in tran errors. Many developers then are quick to re-write both to eliminate the select/into – not only affecting the performance at the replicate, but also endangering performance at the primary. So, in a way, it does make sense to discuss it here.

The best way to decide what to do with procedures containing “select/into” is by assessing the number of physical changes actually made to the real tables the procedure modifies and the role of the worktable created in tempdb. Several scenarios are discussed in the following sections. A summary table is included first for ease of reference between the scenarios.

Final v2.0.1

211

Solution Applicability

replicate tables vs. procedure • complex (long run time) row identification • small number of real rows modified

Work table & subprocedure • complex (long run time) row identification • small number of rows in work table • large number or rows in real tables

procedure rewrite without select/into • row identification easy • work tables contain large row counts • large number of rows modified in real table

Replicate Affected Tables vs. Procedures

In this case, it is a classic case of replicating the wrong object. In some cases, the stored procedure may use a large number of temporary tables to identify which rows to modify or add to the real database in a “list paring” concept. In this case, the final number of rows affected in replicated tables is actually fairly small. Consider the following example:

Update all of the tax rates for minority owned business within the tax-free empowerment zone to include the new tax structures.

Since these empowerment zones typically encompass only an area several blocks in size, the number of final rows affected will probably be only a couple dozen. However, the logic to identify the rows may be fairly complicated (i.e. a certain linear distance from a epicenter) and may require “culling” down the list of prospects using successive temp tables until only the desired rows are left. For example, the first worktable may be a table simply to get a list of businesses and their range to the epicenter – possibly using the zip code to reduce the initial list evaluated. The second list would be constrained to only those within the desired range that are minority owned. The pseudo code would look something like:

select business_id, minority_owner_ship, (range formula) into #temptable_1 from businesses where zip_code in (12345,12346) select business_id, minority_owner_ship, distance into #temptable_2 from #temptable_1 where distance < 1 and minority_owner_ship > 0.5 update businesses set tax_rate = tax_rate - .10 from #temptable_2 t2, businesses b where b.business_id=t2.business_id

Now, lets take a look at what if this was in a procedure. The first temporary table creation might take several seconds simply due to the amount of data being processed and the second may also take several seconds due to the table scan that would be required for the filtering of data from the first temp table. The net effect would be a procedure that requires (just for sake of discussion) possibly 20 seconds for execution – 19 of which are the two temp table creations. The decision to replicate the rows or the procedure then becomes on of determining whether the average number of rows modified by the procedure take longer to replicate than the time to execute the procedure at the replicate. For instance, let’s say that when executed, the average execution of the procedure is 20 seconds modifying 72 rows. If it takes 10 seconds to move the 72 rows through Replication Server and another 13 seconds to apply the rows via the DSI, it still may be better to replicate the rows vs. changing the procedure to use logged I/O and permanent worktables as that might slow down the procedure execution to 35 seconds.

Worktable & Subprocedure Replication

However, in many cases, it is simply too much to replicate the actual rows modified. Take the above example again, only this time, lets assume that the target area contains thousands of businesses. Replicating that many rows would take too long. However, think of the logic in the original procedure at the primary:

Step 1 – Identify the boundaries of the area Step 2 – Develop list of businesses within the boundaries Step 3 – Update the businesses tax rates

Final v2.0.1

212

Now think about it. Step 1 really needs a bit more logic. In this example, identifying the boundaries as the outer cross streets does not help you identify whether an address is within the boundary unless employing some form of grid system ala Spatial Query Server (SQS). The real logic would probably be more likely:

Step 1 – Identify the outer boundaries of the area Step 2 – Identify the streets within the boundaries Step 3 – Identify the address range within each street Step 4 – Develop list of businesses with address between range on each street Step 5 – Update the businesses tax rates

Up through step 3, the number of rows are fairly small. Consequently the logic for a stored procedure could be similar to:

(Outer procedure – outer boundaries as parameters) Insert list of streets and address range into temp table (Inner procedure) Update business tax rate where address between range and on street.

As a result, you simply need to replicate the worktable containing the street number ranges and the inner procedure. The procedure at the primary then might look like:

create procedure set_tax_rate @streetnum_n int, @street_n varchar(50), @streetnum_s int, @street_s varchar(50), @streetnum_e int, @street_e varchar(50), @streetnum_w int, @street_w varchar(50), @target_demographic varbinary(255), @new_tax_rate decimal(3,3) as begin -- logic to identify N-S streets in boundary using select/into -- logic to identify E-W streets in boundary using select/into begin tran insert into street_work_table select @@spid, streetnum_n, streetnum_s, streetname from #NS_streets union all select @@spid, streetnum_e, streetnum_w, streetname from #EW_streets exec set_tax_rate_sub @@spid, @target_demographic, @new_tax_rate commit tran return 0 end create procedure set_tax_rate_sub

@proc_id int, @target_demographic varbinary(255), @new_tax_rate decimal(3,3)

as begin update businesses set tax_rate= @new_tax_rate from businesses b, street_work_table swt where swt.streetname=b.streetname and b.streetnum between swt.low_streetnum and swt.high_streetnum and swt.process_id = @proc_id and b.demographics & @target_demographics > 0 delete street_work_table where process_id=@proc_id return 0 end

By replicating the worktable (street_work_table) and the inner procedure (set_tax_rate_sub) instead of the outer procedure, the difficult logic to identify the streets between the others is not performed at the replicate, allowing the use of select/into at the primary database for this logic, while reducing the number of rows actually replicated to the replicate system. Note the following considerations:

• Inner procedure performs cleanup on the worktable. This reduces the number of rows replicated as only the inserts into the worktable get replicated from the primary.

• @@spid is parameter to the inner procedure and column in the worktable. The reason for this is that in multi-user situations, you may need to identify which rows in the worktable are for which user’s transactions. Since the spid at the replicate will be the spid of the maintenance user and not the same as at the primary, it must be passed to the subprocedure so that the maintenance user knows which rows to use.

• The inner procedure call and inserts into the worktable are enclosed in a transaction at the primary. This is due to the simple fact that if the procedure hits an error and aborts, the procedure execution was successful according to the primary ASE. As a result it would still be replicated and attempted at the replicate. By

Final v2.0.1

213

enclosing the inserts and proc call in a transaction, the whole unit could be rolled back at the primary, resulting in a mini-abort in the RS that would purge the rows from the inbound queue.

The last point is fairly important. Any procedure that is replicated should be enclosed in a transaction at the primary. This will allow user-defined exits (raiserror, return –1) to be handled correctly provided that the error handling does a rollback of the transaction. Despite the fact an error is raised and a negative return status returned from the procedure, it still is a successful procedure execution according to ASE, consequently replicated to all subscribing databases where the same raiserror would occur resulting in a suspended DSI.

A crucial performance suggestion for the above is to have the clustered index on the worktable have the spid and one or more of the other main columns as indexed columns. For example, in the above example, the clustered index might include spid, and streetname. Then if the real data table (businesses) has an index on streetname, the update via join can use the index even if no other SARG (true in the above case) is possible.

While this technique may appear to have limited applicability, in actuality, it probably resolves most of the cases in which a select/into is used at the primary database and not all the rows are modified in the target table (establishing the fact some criteria must exist – replicate the criteria vs. the rows). Situations it is notably applicable for include:

Area Bounded Criteria – DML involving area boundaries identified via zip codes, area code + phone exchange, countries, regions, etc. A classic example is the “mark all blood collections from regions with E-Coli outbreak as potentially hazardous” example often used in replication design examples as good procedure replication candidates. The list of blood donations would be huge, but the list of collection centers located in those regions is probably very small.

Specified List Criteria – In certain situations, rather than using a range, a specified list is necessary to prevent unnecessarily updating data inclusive in the range at the replicate (a consolidated system) but not in the primary. For example, a list of personnel names being replicated from a field office to the headquarters. This could include dates, account numbers, top 10 lists, manufacturers, stores, etc.

As well as any other situation in which a fairly small list of criteria exists compared to the rows actually modified.

Procedure Rewrite without Select/Into

This, unfortunately, is the most frequent fallback for developers suddenly faced with the select/into at replicate problem – and agreeably, sometimes it is necessary. However, this usually requires permanent working tables in which the procedure makes logged inserts/updates/deletes. This should only be used when the identifying criteria is the entire set of rows or a range criteria that is huge in itself. An example is if a procedure is given a range of N-Z as parameters. While it is possible to create a list of 13 characters and attempt the above, the end result is the same – thousands of rows will be changed. A classic case would be calculating the finance charges for a credit card system. In such a situation – even if the “load” was distributed across every day of the month by using different “closing dates” – tens of thousands to millions of rows would be updated each execution of the procedure. Since most credit cards operate on an average daily balance to calculate the finance charges, the first step would be to get the previous month’s balance (hopefully stored in the account table), subtract any payments (as these always apply to “old” balances first). This is a bit more difficult than simply taking the average and dividing by the number of days. Consider the following table:

Day Charge Balance

Begin 1,000.00 1,000.00 1 1,000.00 2 1,000.00 3 1,000.00 4 50.00 1,050.00 5 1,050.00 6 1,050.00 7 1,050.00 8 1,050.00 9 1,050.00

10 75.00 1,125.00 11 1,125.00 12 1,125.00 13 150.00 1,275.00 14 1,275.00 15 1,275.00

Final v2.0.1

214

Day Charge Balance

16 125.00 1,400.00 17 1,400.00 18 1,400.00 19 1,400.00 20 1,400.00 21 1,400.00 22 1,400.00 23 1,400.00 24 500.00 1,900.00 25 1,900.00 26 1,900.00 27 1,900.00 28 1,900.00 29 1,900.00 30 1,900.00

Avg Bal 1,366.67

As you can see, there is no way to simply take the sum of the new charges ($900) and get the final answer. As a result, the system needs to first calculate the daily balance for each account and then insert the average daily balance multiplied by some exorbitant interest rate (i.e. 21% for department store cards) for the finance charge. For sake of argument, let’s assume this is done via a series of select/into’s (possible with about 3-4 – an exercise left for the reader). Obviously, no matter what time the procedure runs, it will run for several hours on a very large row count. Replicating the procedure is a must as replicating all the row changes at the end of every day (assuming every day is a “closing date” for 1/30th of the accounts), could be impractical. Consequently, instead of using select/into’s to generate the average daily balances, a series of real worktables would have to be used.

Separate Execution Connection

This last example (finance charges on average daily balance) clearly illustrates a problem though in replicating stored procedures. At the primary system – assuming no contention at the primary – the finance charges procedure could happily run at the same time as user transactions (assuming the finance charge procedure used a cursor to avoid locking the entire table). However, as described before, in order to guarantee that the transactions are delivered in commit order, the Replication Server applies the transactions serially. Consequently, once the procedure started running at the replicate, it would several hours before any other transactions could begin. Additionally, at the replicate, the entire update would be within a transaction – if it didn’t fail due to exhausting the locks, the net result would be a slow lockdown of the table. This, of course, is extremely unsatisfactory.

One way around this is to employ a separate connection strictly for executing this and other business maintenance. In doing so, normal replicated transactions could continue to be applied while the maintenance procedure executed on it’s own. The method to achieve this is based on multiple (not parallel – multiple) DSI’s which is covered later in this section. Needless to say, there are many, many considerations to implementing this which are covered later, consequently, this should only be used when other methods have failed and procedure replication is really necessary. One of those considerations is the impact on subsequent transactions that used/modified data modified by the maintenance procedure. Due to timing issues with a separate execution connection, it is fully possible that the update makes it to the replicate first – only to be clobbered by later execution within the maintenance record.

One of the other advantages to this approach, is that statement and transaction batching could both be turned off. This would allow the procedure at the replication to contain the select/into provide that system administrators were willing for a manual recovery (similar to system transactions). With both statement and transaction batching off, the following procedure would work.

create procedure proc_w_select @parm1 int as begin declare @numtrans int select @numtrans=@@trancount while @@trancount > 0 commit tran -- select into logic begin tran -- updates to table

Final v2.0.1

215

commit tran while @@trancount < @numtrans begin tran return 0 end

This is similar to the mechanism used for system transactions such as DDL or truncate table. In the case of system transactions, Replication Server submits the following:

rs_begin rs_commit -- DDL operation rs_begin rs_commit

The way this works is that the rs_commit statements update the OQID in the target database. During recovery, only three conditions could exist:

rs_lastcommit OQID < first rs_commit OQID – In this case, recovery is fairly simple as the empty transaction prior to the DDL has not yet been applied. Consequently, the RS can simply begin with the transaction prior to the DDL.

rs_lastcommit OQID >= second rs_commit OQID – Similar to the above, recovery is simple as this implies that the DDL was successful since the empty transaction that followed it was successful. As a result, Rep Server can begin with the transaction following the one for which the OQID was recorded.

rs_lastcommit OQID = first rs_commit OQID – Here all bets are off. Reason is that one of two possible situations exists. Either 1) the empty transaction succeeded but the DDL was not applied (replicate ASE crashed in middle); or 2) both were applied. Since the DDL operation is not within an rs_commit, the OQID is not updated when it finishes. Consequently the administrator has to check the replicate database and make a conscious decision whether or not to apply the system transaction. Hence the added “execute transaction” option to resume connection command. By specifying execute transaction, the administrator is telling RS to re-apply the system transaction as it never really was applied. If instead it had run but the second rs_commit had not, then simply leaving it off the resume connection is sufficient.

Accordingly, by committing and re-beginning the transactions at the procedure boundaries, you are not sure if the proc finished if the OQID is equal to the OQID prior to the proc execution. If it was successful, resume connection DS.DB skip transaction provides similar functionality to leaving of “execute transaction” for system transactions.

However, it is critical that the procedure be fully recoverable – possibly even to a point where it could recover from a previous incomplete run. If the actual data modifications were made outside a transaction, then when a failure occurs during the execution, reapplying the procedure after recovery would result in duplicate data. So, for example, the finance charge procedure would only develop the list of average monthly balances from accounts that did not already have a finance charge for that month.

Final v2.0.1

217

Replication Routes

To Route or Not to Route, That is the Question… One of the key differences between Sybase’s Replication Server and competing products is the routing capabilities. In fact, it is the only replication product on the market that supports intermediate routes. Routing was developed for Sybase Replication Server from the onset to support long-haul network environments while providing performance advantages in that environment over non-routed solutions. The goal of this section is to provide the reader with a fundamental understanding of this feature, how it works, considerations and performance aspects.

Routing Architectures

Replication routing architectures is not a topic for the uninitiated as it has significant similarities with messaging/EAI technologies. That’s a topic for later. Understanding routing architectures requires an understanding of the basic route types and then the different topologies and the types of problems they were designed to solve.

Route Types

Anyone who has been around Sybase RS for more than a few months knows that there are two different types of routes that Rep Server provides: Direct and Indirect.

Direct Routes

A direct route implies that the Primary Replication Server (PRS) and Replicate Replication Server (RRS) have a direct logical path between them (logically adjacent). In fact, it is common to have two connections since routes are unidirectional in Sybase Replication Server. This has more to do with how routes work from an internals perspective, however, and should not be viewed as limitation. Sybase very easily could have used a single command to construct a bi-directional route, however, it would have posed a problem with indirect routes and the flexibility of having different intermediate sites between two endpoints. The below diagram illustrates two one-directional routes between the primary and replicate servers:

PDB LOG RA PDS

PRSRSSDLOG

PRSRSSD DS

RDS RDBLOGPRS

RRSRSSDLOG

RRS RSSD DS

RRS

RSM

Figure 44 - Two One-Direction Direct Routes between Primary & Replicate

Indirect Routes

An indirect route infers that the Primary Replication Server (PRS) and Replicate Replication Server (RRS) are separated by one or more Intermediate Replication Servers. An intermediate route was illustrated at the beginning of this paper with the following diagram:

Final v2.0.1

218

PDB LOG RA PDS

PRSRSSDLOG

PRSRSSD DS

RDS RDBLOGPRS

RRSRSSDLOG

RRS RSSD DS

RRS

IRSRSSDLOG

IRSRSSD DS

IRS

RSM

PDB LOG RA PDS

PRSRSSDLOG

PRSRSSD DS

RDS RDBLOGPRS

RRSRSSDLOG

RRS RSSD DS

RRS

IRSRSSDLOG

IRSRSSD DS

IRS

RSM

Figure 45 - An Example of an Intermediate Route

Each of the Replication Servers above first has a direct route to its neighbor and then an indirect route to the replicate. At first glance, some may question the reason for even using intermediate routes, but many of the topologies (as we will see) fairly much require them.

Route Topologies

Once routing gets implemented, it doesn’t take long before the term topology starts being discussed. Topology is nothing more than a description of the connections between the different sources & targets. With each topology, certain things are understood (i.e. a hierarchical topology implies a rollup factor) and certain aspects are also immediately known (i.e. bidirectional replication, etc.). There are only a limited number of base topologies, however, large implementations may find that they combine different topologies within their data distribution architecture. Each of the base topologies are discussed in the following sections.

Point-to-Point

A point-to-point topology is characterized by every RS having a direct connection to every other RS. Classic implementations include Remote Standby and Shared Primary (Peer-to-Peer).

Remote Standby

In a typical Warm Standby system, a single Replication Server is used. This restriction is mainly due to the fact that routing is implemented as a connection and hence an outbound connection. Since WS only uses inbound queue, it has been restricted to a single RS. In some environments where the standby system is extremely remote (i.e. 100’s of miles) away, the connectivity between the RepAgent and the RS become a bit of a problem. The reason is that with the longer WAN’s, not only is the bandwidth lower, but also the line quality and other factors become an issue. Consequently sometimes it may be advisable to set up a “replicated copy” in which all the tables are published and subscribed to using standard replication definitions and subscriptions and use two replication servers - one local and one remote.

Final v2.0.1

219

New York(Primary)San Francisco

(Standby)

Figure 46 - Example of a Remote Standby

This has some distinct performance advantages:

• Empty begin/commit pairs and other types of non-replicated data gets filtered out immediately at the primary • The transaction log continues to drain as normal and is not impacted by WAN outages • Other destination systems are not impeded by having transactions first go to remote site as a normal WS

would indicate. Instead, they can subscribe at the local node. • Tends to be more resilient from network issues

It also has some very acute disadvantages:

• Doesn’t support a logical connection internal to RS • Doesn’t support automated failover • Has increased latency in respect to RS processing, especially with large transactions

The first point may appear to be fairly minor, but in reality, it can be a real bear to deal with. While it is true that if the system is isolated, this is not a problem, it is equally true that if the system participates in replication to/from other sites, it gets real sticky. The reason is that some of the nuances of a logical connection are not well known. Consider the following scenario’s:

New York(Primary)

San Francisco(Standby)

Chicago(Different app)

??? ???

Figure 47 - The Standby as a Target Puzzler

Now, comes one of those times in this paper where you have to engage your thinking cap…

• How’s the switch affected from Chicago’s viewpoint (question marks above)? Remember, the two would be different connections in the same domain - duplicate subscriptions are not the answer. Having transactions applied to SF directly could cause database out-of-sync issues. The issue is that NY users can modify the

Final v2.0.1

220

source data, later updated by Chicago replicated transactions. But, due to latency and timing, the Chicago replicated updates get to SF first, then the replicated NY changes. Result is that Chicago’s transactions would appear to have been lost.

• Using NY RS as an intermediate route for SF RS from Chicago (CH NY SF as a RS route) would not be the answer either. Again, consider the problem posed at the end of the last bullet. The Chicago transaction still has a distinct probability of getting to SF first if the transactions are executed close together.

• So, if we just replicate to NY from Chicago, what happens when NY fails? Some of the Chicago transactions will be stranded in the transaction log while others will be in the queue - the outbound queue in NY RS, which will not drain since NY ASE is dead. Potentially others are still stranded in the Chicago RS outbound queue for the route. Simply trying to switch Chicago to SF could result in missing transactions since the currently active segment in the queue is past those transactions and routing does not forward transactions intact (later discussion in internals).

By now you are beginning to see the real purpose behind the logical connection for a WS. While this is a different topic altogether (Warm Standby Replication), two of the important aspects of a WS connection is that the transactions sent to the logical pair are routed correctly in the event of a failover and that transactions are applied to the primary which in turn re-replicates them to the standby (‘send warm standby xacts’ effectively encapsulates ‘send maint xacts to replicate’, however, an rs_marker is used to signal when to begin sending all transactions to avoid transactions applied by the other node). Additionally, rs_lastcommit is replicated, consequently once replicate systems reconnect to the logical pair, they see the last transaction that made it to the pair (hence the ‘strict’ save interval as well). However, we are digressing deep into a topic that deserves its own discussion.

A simpler solution to the problem above is to have Chicago not use a route to NY & SF, but to use a multiple-DSI approach and a different maintenance user (and connection name due to the domain). Regardless, the point of this entire discussion is that while it may be tempting to set up replicated standby’s for more local systems, be absolutely 200% positive that it is the best approach. If performance is the issue, it probably is solvable via other means than this implementation as it is doubtful that this implementation really will improve performance over a properly tuned WS implementation. The driver for this sort of implementation should be network resilience.

Shared Primary (Peer-to-Peer)

The other classic implementation for point-to-point topologies is a shared primary or peer-to-peer implementation. In a Peer-to-Peer implementation a distinct model of data ownership is defined - either on different sets of tables, column-wise within tables or row-wise within tables. This type of implementation is often illustrated as:

New YorkSan Francisco

Chicago

NYCHSF

NYCHSF

NYCHSF

Figure 48 - Typical Shared Primary/Peer-to-Peer Implementation

This technique is often referred to as “data ownership” from a replication standpoint, but infers another concept called “application partitioning”. In a shared primary implementation, application partitioning is done implicitly at each site by restricting the users from modifying other sites data. Now it is important to note that request functions have been used by some customers to modify another sites data by sending the change request to that site - or by having the change request implement an ownership change.

Final v2.0.1

221

MP Implementations

Another successful implementation of the shared primary implementation that really drives home this point is when the system is divided for load balancing purposes. In a typical environment, the reads (selects) grossly outnumber the writes (DML) and consequently is the driving force when a machine is at capacity. In such a case, a larger machine often is the answer. But what if no larger machines are available? Additionally, a single large machine is a single point of failure and leaves customers exposed. Some customers started using RS from the earliest days to maintain a loose model of a massively parallel system by using peer-to-peer replication. A typical implementation looked like:

A-GH-QR-Z

A-GH-QR-Z

A-GH-QR-Z

A-GH-QR-Z

A-GH-QR-Z

A-GH-QR-Z

Transaction routers

Figure 49 - MPP via Load Balancing with RS

This implementation is more or less a cross between a MPP share-disk approach (Oracle, Microsoft) and a MPP shared-nothing approach (IBM, Sybase). As weird as the above may look, it has some advantages over both models. Interestingly enough, Oracle 9i Real Application Clusters (RAC) enforces application partitioning (forget the marketing hype - read the manuals) and implements a block ownership and block transfer. The problem of course is that the block transfers are on demand, which slows a cross node query (hence their own benchmarks do not allow users to read a block they didn’t write). Microsoft quite explicitly uses a transaction router to enforce application partitioning. IBM and Sybase (old Navigation Server/Sybase MPP) split the data among different nodes and used result set merging. For ASE 15.0, Sybase is planning on implementing MPP via a federated database using unioned views. The above implementation has a couple of advantages over RAC/MS (shared-disk) as well as result set merging (shared-nothing).

1. First, RAC/MS (and each node of a share-nothing) has a single copy of the database - and consequently a single point of failure.

2. Queries involving remote data execute substantially quicker as the data is local. 3. Shared-nothing approaches essentially union data. In some cases an aggregate function across the

datasets then becomes an application implementation (i.e. count(*) or sum(amount) across nodes involves summing the individual results vs. unioning the results).

4. Cross node writes can be handled as request functions or via function strings (i.e. aggregates) to prevent blocking on contentious columns (think balance for a bank account - now consider cross account transfers). Shared-disk architectures in particular have problems with this as Distributed Lock Managers are necessary to coordinate and cache coherency resolution is necessary. Shared-nothing architectures have severe problems as well as this often reverts to a 2PC.

The downside, of course, is that each node is looking at a point in time (historical) copy of the data from other nodes, which may not be current. A little known fact, of course, is that the same is true of Oracle RAC - the blocks are copies from when the transaction began. Probably a little closer in time than with the above, but still a problem. An additional downside is that each node must be able to support the full write load while handling a fraction of the query load. If it can not support the full write load under any query load, then a shared-primary implementation and pure application partitioning will be necessary in which only data truly needed at the other nodes is replicated.

Incidentally, a fine example of a transaction router is OpenSwitch, although it would be easy to implement in an application server as well.

Final v2.0.1

222

Hub & Spoke

Hub & Spoke implementations are common implementations where the point-to-point implementations are no longer practical due to scalability and management. Consider the common point-to-point scenario described in the last section. It is fine as long as the number of sites is in the 3-4 range and possibly could be extended to 5. However, remember that the number of connections from each site is one less than the totals sites. In fact, it would be twice that number due to the unidirectional nature of routes - so for M sites, the total number of connection is M*(M-1)*2. For 3 sites, a total of 12 would be needed….5 would require 40. As you can tell, as the numbers grow beyond 5, the number of connections gets to be entertaining. Consequently a “hub & spoke” implementation could be used with a common “arbitrator/re-director” in the middle.

Figure 50 - Hub & Spoke Implementation

Note that the site in the center “lacks” a database. The reason for this is that it’s sole purpose is to facilitate the connections.

An astute observer may be quick to point out that logically you still need to create the individual routes as if it were point-to-point with the only difference in the above being that the “hub” is specified as the intermediate node. A true statement - however, it does not take into consideration the processing and possibly the disk space that is saved at each site. Every replicated row goes to the same outbound queue where it is passed to another replication server (“hub”) which determines the distination(s).

Circular Rings

A circular ring is a topology in which each Replication Server has direct routes only to those “adjacent” to it. This is largely due to the fact that most communications flow sequentially about the ring, typically in a single direction. A classic example was illustrated earlier in “follow-the-sun” technical support systems. Such systems typically use globally dispersed corporate centers to avoid having 24-hour shifts locally. For example, Sybase has support centers in Concord, Massachusetts; Chicago, Illinois; Dublin, California; Hong Kong, China; Sydney, Australia; and Maidenhead, England. Additional support staff are distributed to other locations as well (Brazil, Netherlands, etc.), but these represent the “main” support centers for English speaking customers. Globally, this can be represented by:

Final v2.0.1

223

Figure 51 - Sybase’s Follow-The-Sun TS Implementation

Just by looking at it, you can discern the “ring” between the centers. While Sybase’s implementation is different, you could picture as a support case is opened, it is sent to the next site as a precaution. If a handoff is necessary, a ownership change for that case is effected. As soon as the support person at the next site makes any modification, it will cause it to replicate and consequently the next site will have the info.

Geographic Distribution

This logically leads to the next and one of the more common topologies - “Geographic Distribution”. The primary reason for this topology used to be the limited bandwidth between the continents. As that has largely been resolved in recent years, the biggest benefit from this then becomes Replication Server performance as efficiencies are realized by implementing a system as such. Consider the following topology:

Figure 52 - Possible Geographic Distribution Topology for a Global Corporation

This is where IBM, Oracle and Microsoft lose it. Because of their lack of indirect routes, they must perform direct routes from/to every site. In the above illustration, there are ~35 sites, yet the most that any one site has a direct route to is 5. A change to a lookup table is easily distributed to all of the sites. A system that does not have indirect routing would have to create 35x34 or 1190 connections in order to support replication to/from every site. The amount of processing saved is enormous.

Hierarchical Trees

The above topology is considered a basic one even though it combines elements of others. In it, sites that need to communicate to other local sites have direct routes to those sites. Looking at in a slightly different view and you get the illustration of cascading nodes. As a result, it is very similar to probably one of the most common routing implementations (along with remote standby) - hierarchical. A hierarchical topology is very similar to an index tree for databases in that there is a root node and several “levels” until the bottom is reached. It is different in an aspect that the intermediate levels also represent functional nodes.

One of the clearest examples of a hierarchical implementation can be witnessed in a large retail department store chain. We will use a mythical chain of Syb-Mart. Each Syb-Mart store sells the usual clothing, furniture, tools, automotive goods, etc. Some of these items bear the Syb-Mart label while others are national brands. Each store reports its

Final v2.0.1

224

receipts to a regional office, which in turn feeds to an area office, which in turn feeds to a national headquarters, and finally to corporate headquarters. This hierarchy can be illustrated as follows:

National

Regional

Field

Area

Corporate

Figure 53 - Syb-Mart Mythical Hierarchical Topology

Both sales and HR information (such as timesheet data, hirings, firings, etc.) would move up the tiers (perhaps using function strings to only apply aggregates at each higher level), while pricing information (sale prices, price increases, etc.) could be replicated down the tiers.

On of the difficult concepts to grasp is that each of the tiers need not be simply a “roll-up” of all the information below. It is often viewed that each of the tiers are consolidations of each of the tiers below, perhaps with the addition of some aggregate values. It is true that many of the “business objects” - products, product SKU’s, prices, promotions, and perhaps on-site inventories may be present in all the tiers, along with individual employee records (such as name, employee id, address, store, etc.). However, the field sites may have record of each individual transactions (business “events”), while the higher level tiers would only retain daily/monthly/yearly aggregates. Some HR information, such as individual employee timesheets might also only record aggregates at each level, but at the top level, each record may be present in detail for payroll purposes. This last example is one that is sometimes missed - detail records “going” to the top, while intermediate locations only receive aggregates. In fact, it is arguable, that all detail records should roll-up to the top, if for no other reason than to feed the corporate data warehouse.

The biggest problem with hierarchical tiers is a re-organization in which field sites migrate from on regional center to another. The problem is not the routing, which is trivial to modify, but rather the subscription de-materialization/re-materialization and supporting data elements. For example, in the above illustration, each of the field sites would be similar and somewhat independent of the regional site. The stores current database status regarding past sales, current inventory, etc. would not change. In this case, simply dropping the subscriptions to the previous regional center and adding them to the new regional center (without materialization of course) may be all that is necessary from the stores perspective. There may be minor additional rows needed at the regional center to handle the new field site (or some removed), but all-in-all fairly simple. However, HR information is a little different. In the case of HR data, employees would no longer be (possibly) accountable to the original region and it more than likely would be a security risk to have employee data still resident in a system to which no one has need to know that information anymore. The new regional center would need to know the employee data, of course. This is kind of an interesting paradox in that at some point in the tiers, the personnel would still be under the same “area” or “national” aspect. At whatever levels in between, either bulk or atomic de-materialization and re-materialization would be required.

Hierarchical implementations still remain one of the most common, but database administrators need to plan for the capability to re-organize quickly. As soon as a re-organization is announced, they need to review what the original and final physical topologies would resemble and then determine the actions necessary to carry it out.

Final v2.0.1

225

Logical Network

For large systems, it may be best to borrow an analogy from the hardware domain and implement a logical network. A logical network essentially is a “back-bone” of Replication Servers whose sole purpose is to provide efficient routing and ease connection management - similar to the hub-and-spoke earlier. However, it typically is a mix of geographic distribution as well as, and more often resembles the geographic distribution in topology - usually because a corporate bandwidth strategy is allocated from corporate to main regional centers (more than likely larger metropolitan areas with the infrastructure in place). Let’s consider our Syb-Mart hierarchical example above. Assuming a very wide distribution of stores (one in every friendly neighborhood) consider the following hypothetical map of high-bandwidth networks (maintained by that great monopoly phone system).

Major Metropolitan City

Syb-Mart Regional HQ

High Bandwidth Network

Major Metropolitan City

Syb-Mart Regional HQ

High Bandwidth Network

Figure 54 - Hypothetical High-Bandwidth Connections It would make sense to put a Replication Server at each of the metropolitan areas above to implement the “back-bone”. For example, stores in Charleston SC technically report to the Eastern Regional HQ in Boston, MA. In a pure hierarchical model, a direct connection would be created between them. Certainly, the network routers from the phone company would take care of physically routing the traffic most effectively, consequently, it may be possible to do so. However, in the past years, train crashes in tunnels in Baltimore, brownouts in San Francisco, backhoes in Reston, VA, etc. have disrupted communications - some for days. By using a “back-bone” with multiple paths, company systems personnel could easily re-route replication along alternate routes. Additionally, each of the major metropolitan centers could function as “collectors” for all of the stores in their region, reducing network traffic for price changes, while ensuring that data flows along the quickest route possible.

Routing Internals

Now that we understand logically how routing can be put to use, let’s discuss the internals of how it works.

RS Implementation

Support for routing within the Replication Server is fairly unique. From a source system’s perspective, the route is the same as any other destination. However, in moving the data through the system, routes exploit some neat features. Consider the following diagram.

Final v2.0.1

226

Figure 55 - Replication Server Routing Internal Threading

The path for routing is as follows:

1. The Rep Agent sends the LTL stream to the Rep Agent User thread as normal 2. The Rep Agent user thread performs normalization and then passes the information to the SQM for

storage as usual 3. The SQM writes the data to the inbound queue. 4. The SQT thread performs transaction sorting as usual 5. The SQT thread passes the sorted transactions to the DIST thread 6. The DIST thread passes each transaction to the subscribing sites SQM. If the subscriber is a local

database, then it sends the data to that database’s SQM thread. However, if the subscriber is a remote database, it finds the next RS on the route and sends the data to the SQM for that RS.

7. The outbound SQM for the route writes the data to the outbound queue as normal 8. The Replication Server Interface (RSI) thread reads the data from the outbound queue via the SQM 9. The RSI forwards the rows to the RS via the RSI User thread in the remote RS. 10. The RSI User thread sends the data to the DIST thread which only needs to call the MD module to read

the bitmask of destinations and determine the appropriate outbound queues to use. 11. The DIST send the rows to the SQM of the destination database 12. The SQM writes the data to the outbound queue 13. The DSI-S reads the data from the outbound queue (via SQM) and then sorts the transactions into

commit order. 14. The DSI-S performs transaction grouping and submits each group to the DSI-Execs as usual 15. The DSI-Exec’s generate the appropriate SQL and apply to the replicate database.

Consider the following points about the above:

• There will be a SQM and RSI thread for each direct route created from any RS. Consequently, if a RS has 3 direct routes to 3 other RS’s, there will be 3 RSI outbound threads and associated SQM’s and outbound queue’s.

• A route does not have an inbound queue. The “inbound” processing (if you would call it that) is to simply determine which queues to place the data in - either an outbound queue for a local database. The RSI User thread (a type of EXEC thread similar to the RepAgent User thread) merely serves as a connection point.

• The MD is the only module of a Distributor thread necessary. All of the subscription resolution (SRE) and transactional organization (TD) have already been completed at the primary RS. If you remember, we stated that a bitmask was used to reflect the destinations. For local databases, this bitmask translates to an outbound queue. For remote databases, a single copy of the message with the bitmask is placed into the RS outbound queue. Hence only a single copy of the message is necessary for each direct route.

Final v2.0.1

227

• Unlike the DSI interface, the RSI interface is non-transactional in nature. For example, it does not make SQT calls and does not base delivery on completed transactions. Instead, it operates much on the same principals of a Replication Agent – it simply passes the row modifications as individual messages to the replicate Replication Servers and tracks recovery on a message id basis (and consequently, it is the only mechanism in Replication Server in which orphan transactions can happen – due to a data loss in the outbound queue mainly).

A common misconception is that the “admin quiesce_force_rsi” is used to quiesce all RS connections - DSI and RSI. However, in really only applies to RSI connections as DSI threads are in a perpetual state of attempting to quiesce. The reason this command is used is that similar to the RepAgent RepAgent User thread communications, the RSI thread batches messages to send to remote RS’s. In return, the message acknowledgements are sent only on a periodic or as requested basis. The “admin quiesce_force_rsi” checks to see if the RS is quiescent, the same as “admin quiesce_check”. In addition, where-as “admin quiesce_check” merely checks to see if RSI acknowledgements have been received, “admin quiesce_force_rsi” forces all of the RSI threads to send any outstanding messages and then prompt for a acknowledgements.

RSI Configuration Parameters

The following configuration parameters are available for tuning replication routing.

Parameter (Default) Description

disk_affinity Default: off

Specifies an allocation hint for assigning the next partition. Enter the logical name of the partition to which the next segment should be allocated when the current partition is full.

rsi_batch_size Default: 262,144 Recommended: 4MB if on RS 12.6 ESD #7 or RS 15.0 ESD #1.

The number of bytes sent to another Replication Server before a truncation point is requested. The range is 1024 to 262,144. This works similar to the Replication Agent’s scan_batch_size configuration setting. This normally should not be adjusted downwards unless in a fairly unstable network environment and want the RSI outbound queue to be kept trimmed. In RS 12.6 ESD #7 and RS 15.0 ESD #1, this was increased to a max of 128MB

rsi_fadeout_time Default: -1

The number of seconds of idle time before Replication Server closes a connection with a destination Replication Server. The default (-1) specifies that Replication Server will not close the connection. In low volume routing configurations this may be set higher (i.e. 600 = 10 minutes) to reduce connection processing in the replicate Replication Server.

rsi_packet_size Default: 2048 Recommended: 8192

Packet size, in bytes, for communications with other Replication Servers. The range is 1024 to 8192. In high-speed networks, you may want to boost this to 8192. The RSI uses an 8K send buffer to hold pending messages to be sent. When the number of bytes in the buffer will exceed the packet size, the send buffer is flushed to the replicate RS.

rsi_sync_interval Default: 60

The number of seconds between RSI synchronization inquiry messages. The Replication Server uses these messages to synchronize the RSI outbound queue with destination Replication Servers. Values must be greater than 0. This is analogous to the scan_batch_size parameter of a Replication Agent, but is measured in seconds instead of rows.

rsi_xact_with_large_msg Default: shutdown

Specifies route behavior if a large message is encountered. This parameter is applicable only to direct routes where the site version at the replicate site is 12.1 or earlier. Values are “skip” and “shutdown.”

save_interval Default: 0 minutes

The number of minutes that the Replication Server saves messages after they have been successfully passed to the destination Replication Server. See the Replication Server Administration Guide Volume 2 for details.

As you can see, there are very few adjustments needed to the defaults for routing.

RSI Monitor Counters

Replication Server 12.6 extended the basic counters from 12.1 to the following counters to monitor RSI activity.

Final v2.0.1

228

Counter Explanation

BytesSent Total bytes delivered by an RSI sender thread.

PacketsSent Total packets sent by an RSI sender thread.

MsgsSent Total RSI messages sent by an RSI thread. these messages contain the distribute command.

MsgsGetTrunc Total RSI get truncation messages sent by a RSI thread. This count is affected by the rsi_batch_size and rsi_sync_interval configuration parameters.

FadeOuts Number of times that a RSI thread has been faded out due to inactivity. This count is influenced by the configuration parameter rsi_fadeout_time.

BlockReads Total number of blocking (SQM_WAIT_C) reads performed by a RSI thread against SQM thread that manages a RSI queue.

SendPTTimeLast Time, in 100ths of a second, spent in sending the packet of data to the RRS.

SendPTTimeMax Maximum time, in 100ths of a second, spent in sending packets of data to the RRS.

SendPTTimeAvg Average time, in 100ths of a second, spent in sending packets of data to the RRS.

Replication Server 15.0 changed these slightly to:

Counter Explanation

BytesSent Bytes delivered by an RSI sender thread.

PacketsSent Packets sent by an RSI sender thread.

MsgsSent RSI messages sent by an RSI thread. These messages contain the distribute command.

MsgsGetTrunc RSI get truncation messages sent by a RSI thread. This count is affected by the rsi_batch_size and rsi_sync_interval configuration parameters.

FadeOuts Number of times that a RSI thread has been faded out due to inactivity. This count is influenced by the configuration parameter rsi_fadeout_time.

BlockReads Number of blocking (SQM_WAIT_C) reads performed by a RSI thread against SQM thread that manages a RSI queue.

SendPTTime Time, in 100ths of a second, spent in sending packets of data to the RRS.

RSIReadSQMTime The time taken by an RSI thread to read messages from SQM.

Essentially, other than adding the new counter RSIReadSQMTime, the only other change is inline with the others in than the SendPTTimeLast/Max/Avg is collapsed into a single counter SendPTTime.

Again, by looking at some of these in comparison with each other, an idea of different performance metrics could be established. For example, if comparing PacketsRead and BytesSent, an idea of the usefulness of changing the rsi_packet_size parameter can be determined. Additionally, by comparing with other threads, the ability of the RSI to keep up can be determined (i.e. SQM:CmdsWritten and RSI:MsgsSent). If using RS 15.0 and the route seems slow, the last two can be of use to determine if it is the network (or downstream RRS) or the outbound queue reading speed that is the largest source of time.

One thing to note is that the RSI does not have an SQT library function - messages are simply sent in the order they appear in the outbound queue. The problem with this is that the RSI lacks the SQT cache that can help buffer activity when the downstream system is lagging slightly - which may translate into more blocks being read physically than desired. As a consequence, since the RSI includes an SQMR logic, the SQMR counters for BlocksRead and BlocksReadCached may be helpful in determining why a route may be lagging.

Final v2.0.1

229

Routing Performance Advantages

In certain circumstances, a routed connection will perform better than a non-routed connection. Some of these are described below. It is important to note that routes may not out-perform in “all” circumstances - in fact a common fallacy is that a route will outperform a normal Warm Standby setup even if the sites are located fairly close.

SQL Delivery

In some cases, nearly all of the cpu is consumed with processing the inbound stream. As a result, little cpu is available for the DSI connection to generate and apply the SQL. However, since the RS threads are executed at the same priority, the DSI connection ends up getting the same amount of cpu time as the other threads. In this case, often the symptom is a fully caught up outbound queue, but a lagging inbound queue (due to DIST thread having to wait for access to the outbound queue SQM) or a lagging RepAgent. Prior to RS 12.5/SMP, in these cases, it made sense to split the replication processing in half by using a route. Consequently, one cpu could concentrate on the inbound connection, while another cpu (perhaps on the same box) would concentrate on SQL delivery.

This is frequently the excuse why some set up their standby systems as remote standby’s even when close together. As noted earlier, this has some tremendous puzzlers to solve the minute the standby pair is a target of replication from another system. Additionally, the amount of cpu “gained” over a normal “WS” must exceed the cost of additional cpu used for the DIST thread (typically suspended in WS only configurations) as well as the extra I/O cost to write to the outbound queue. This is very difficult to substantiate as some of the highest throughputs measured with Replication Server at customer sites has all been with traditional Warm-Standby configurations. Consequently, it might be said that the most appropriate place for a “SQL Delivery” based performance improvement using routing is when the system is a normal replicate database and not a standby.

Distributed Processing

One of the more common implementations in routing environments is using multiple RS’s to distribute the processing load when a single RS needs to communicate with a large number of connections. While a single Replication Server can handle dozens of connections, the amount of resources necessary on a single machine would be tremendous. Additionally, prior to RS 12.5/SMP, a single RS could easily be swamped trying to maintain a large number of high volume connections. Consequently, even from the earliest days of version 10.x, customers were implementing multiple replication servers using routing as a way of getting multi-processor performance. In such implementations, generally a single RS was implemented at each “source” with multiple Replication Server’s serving the destinations as necessary.

In some cases, this was even implemented between only two nodes - a primary and a replicate. While obvious for remote nodes, it would not appear to be as necessary when both nodes are local. However, in some extremely high volume situations, the inbound processing could fully utilize a cpu. Under these circumstances, when not using the SMP version of RS, it may make sense to offload the DSI processing to another cpu via replication routing. This is particularly true in the case of corporate rollup scenarios in which the DSI’s SQT library may be exercised more fully since transactions from different sources may be intermingled.

With RS 12.5/SMP, this advantage is totally eliminated for local nodes. For remote nodes, a route still may be optimal to ensure network resilience.

Network Resilience

One of the biggest advantages to replication routes is its ability to provide network resilience. This capability is directly attributable to the concept of indirect routes. In recent years, there have been a number of incidents that have illustrated how easy it is to disrupt wide-area networks. Not too many years ago, a train crash and resulting fire in a tunnel in Baltimore, Maryland USA disrupted network communications for MCI for several days. Similarly, the World Trade Center disaster on 9/11 left many business in Manhattan electronically stranded - and those that routed services through it equally disadvantaged. By using an indirect route, should a physical network outage occur, replication system administrators can simply re-direct the route over an alternate direct route.

Routing Performance Tuning

There really is not much to tune for a route. Out of the box, the configuration settings are fairly optimal for most environments, although some recommendations as above are appropriate. An intermediate node in the route really experiences minimal loading outside of the outbound queue for the outgoing route. However, you still shouldn’t have an intermediate node attempting to service dozens of direct routes when a more conservative approach would be much more efficient. Consequently, route performance becomes more of a network tuning exercise. If the route is over a very low bandwidth network or is sharing the bandwidth with extremely high bandwidth applications such as video teleconferencing, you can expect very low performance from the route. For most cases, however, a sudden drop in

Final v2.0.1

230

routing throughput will be due to an unexpected network issue such as an outage, DNS errors, or other network related problems.

There is one aspect to consider, however, if multiple databases are involved - there is only one RSI for each route. This can lead to IO saturation in some instances. Consider the differences between the following two scenarios:

Figure 56 - A Common Multi-DB Routing Implementation

Figure 57 - A More Optimal Multi-DB Routing Implementation

Why is this more optimal? In the first example, all 12 databases use the same route. This means that 12 DIST threads in one RS are all trying to write to the same outbound queue and a single RSI is trying to send the messages for 12 connections. This may be fine for low volume systems, but for high volume systems, the outbound queue for the RSI connection is likely going to be a source of contention and may become an IO bottleneck as well. In the bottom example, there are 4 routes - and the load is split between the 4 routes using 4 outbound queues (one for each route) and 4 RSI’s send the messages. Additionally each of the routes could have disk affinity enabled, reducing the chances for an IO bottleneck on a single device.

It might be tempting to thing then that New York should have 4 RS’s as well. While this may be true simply from a loading perspective, it may not help routing performance considering the direction London New York. Remember, the route will have a unique DIST thread at the RRS that will be writing directly into the outbound queue for the destination connection. Consequently, as soon as we created 4 routes to London, there are 4 DIST threads - one for each route - in the NY_RS to handle the traffic in reverse.

Final v2.0.1

231

As mentioned, though, the New York RS may be overloaded with the 12 connections. In fact, considering workload distribution and using multiple RS’s, the following depict the bad, better, better-yet, best architectures for a large multi-database source system:

Figure 58 - Bad - Not a Good Plan

Figure 59 - Not Much Better - But Unfortunately, All Too Common

Figure 60 - Ahhh….Feels Much Better

Final v2.0.1

232

Figure 61 - The Best Yet!!!

The rationale for the above stems from multiple factors:

• Currently with RS 15.0, RS can best deal with about 2 high volume connections and a total of 10 connections before latency is impacted due to task switching. While more connections may be doable in low volume situations, this is optimal

• As mentioned above, the division of routes allows load balancing of IO processing for the route messages.

Final v2.0.1

233

Parallel DSI Performance

I turned on Parallel DSI’s and didn’t get much improvement – what happened? The answer is that if using the default settings, not a whole lot of parallelism is experienced. In order to understand parallel DSI’s, a solid foundation in Replication Server internal processing is necessary. This goes beyond just understanding the functions of the internal threads – it also means understanding how the various tuning parameters as well as types of transactions affect replication behavior, particularly the DSI. In the following sections, we will discuss the need for parallel DSI, internal threads, tuning parameters, serialization methods, special transaction processing and considerations for replicate database tuning

Need for Parallel DSI

There are five main bottlenecks in the Replication Server:

1. Replication Agent transaction scan/delivery rate 2. Inbound SQT transaction sorting 3. Distributor thread subscription resolution 4. DSI transaction delivery rate 5. Stable Queue/Device I/O rate

In early 10.x versions of Replication Server, it was noticed that the largest bottleneck in high volume systems was #4 – DSI transaction delivery rate. The reason was very simple. At the primary database, performance was achieved by concurrent processes running on multiple engines using a task efficient threading model. On the other hand, at the replicate database, Replication Server was limited to a single process. Consequently, if the aggregate processing at the primary exceeded the processing capability of a single process, the latency would increase dramatically. Much of this time was actually not spent on processing as most replication systems were typically handling simple insert/update/delete statements, but rather the “sleep” time waiting for the I/O to complete. Consider the following diagram.

OLTP 1

OLTP 2OLTP 3

OLTP 4OLTP 5

High Volume OLTPHigh Volume OLTPBalanced work/load in run/sleep queueBalanced work/load in run/sleep queue

100 tpm100 tpm eacheach= 500 tpm total= 500 tpm total

200200 tpmtpm max max

High sleep timeHigh sleep time1 1 cpu cpu busy busy RS queue growing steadilyRS queue growing steadilyOutbound queue steadyOutbound queue steady

Figure 62 – Aggregate Primary Transaction Rate vs. Single DSI Delivery Rate

It should be noted that in the above figure, the numbers are fictitious. However, it does illustrate the point how a single threaded delivery process can quickly become saturated. Early responses to this issue “talked” around it by attributing this to Replication Server’s ability to “flatten” out peak processing to a more “manageable” steady-state transaction rate. While this may be appealing to some, organizations with 24x7 processing requirements or those with OLTP during the day and batch loading at night quickly realized that this “flattening” required a lull time of little or no activity during which replication would catch up. Due to normal information flow, the organizations did not have this time to provide.

The obvious solution was to somehow introduce concurrency into the replication delivery. The challenge was to do so without breaking the guarantee of transactional consistency. The result was that in version 11.0, Parallel DSI’s were introduced to improve the replication system delivery rates.

Final v2.0.1

234

Key Concept #25 – Replication/DSI throughput is directly proportionate to the degree of concurrency within the parallel DSI threads.

Parallel DSI Internals

Earlier in one of the first sections of this paper, we discussed the internal processing of the Replication Server. From this aspect, very little is different for Parallel DSI’s, however, considerable skill and knowledge is necessary to understand how these little differences are best used to bring about peak throughput from Replication Server. While this section discusses the internals and configuration/tuning parameters, later sections will focus on the serialization methods as they are key to throughput, as well as tuning Parallel DSI’s.

Parallel DSI Threads

The earlier diagram discussing basic Replication Server internal processing included in the illustration Parallel DSI’s (step 11 in the below)

Inbound (1)Outbound (0)


Stable Device

RSSD

RepAgent

Primary DB

Replicate DB

Rep AgentUser

DistributorSRE TD MD

SQM

SQT

SQM

DSIDSI-Exec

DSI-ExecDSI-Exec

STS MemoryPool

dAIO

SQT

1 2

3

45

6

78

9

10

11

12

Inbound

Outbound



Stable Device

RSSD

RepAgent

Primary DB

Replicate DB

Rep AgentUser


SQM

SQT

SQM

DSIDSI-Exec

DSI-ExecDSI-Exec

STS MemoryPoolSTS MemoryPool

dAIO

SQT

11 22

33

4455

66

7788

99

1010

1111

1212

Inbound

Outbound

Figure 63 – Replication Server Internals with Parallel DSI’s

While the DSI thread is still responsible for transaction grouping, etc., it is the responsibility of the DSI Executor threads to perform the function string translation, apply the transactions and perform error recovery. Up to 255 Parallel DSI threads can be configured per connection. However, after a certain number of threads, adding more will not increase throughput.

rs_threads processing

As mentioned earlier (and repeatedly), the Replication Server guarantees transactions are applied in the same order at the replicate as at the primary. At first glance, this would seem an impossible task where Parallel DSI’s are employed – a long running procedure on DSI 1 ..and DSI 2 might get ahead. To prevent this, Replication Server 12.5 and earlier implemented a synchronization point at the end of every transaction by way of the rs_threads table.

create table rs_threads ( id int, -- thread id seq int, -- one up used for detecting rollbacks pad1 char(255), -- padding for rowsize. pad2 char(255), pad3 char(255), pad4 char(255), ) go create unique clustered index rs_threads_idx on rs_threads(id) go

Final v2.0.1

235

-- alternative implementation used on servers with >2KB page size -- contained in rs_install_rll.sql script create table rs_threads ( id int, seq int, pad1 char(1), pad2 char(1), pad3 char(1), pad4 char(1), ) lock datarows go create unique clustered index rs_threads_idx on rs_threads(id) go

While still in later versions of RS (i.e. 12.6 and 15.0) an alternative implementation called "DSI Commit Control" is also available and is discussed in the next section. The rs_threads table is manipulated using the following functions used only when Parallel DSI is implemented.

Function Explanation

rs_initialize_threads Used during initial connection to setup rs_threads table. Issued shortly after rs_usedb in the sequence.

rs_update_threads Used by a thread to block its row in the rs_threads table to ensure commit order and also to set the sequence number for rollback detection.

rs_get_thread_seq Used by a thread to determine when to commit by selecting the previous thread’s row in rs_threads.

rs_get_thread_seq_noholdlock Similar to above, but only used when isolation_level_3 is the serialization method.

To understand how this works, consider an example in which 5 Parallel DSI threads are used. During the initial connection processing during recovery, Replication Server will first issue the rs_initialize_threads function immediately after the rs_usedb. This procedure simply performs a delete of all rows (logged delete vs. truncate table due to heterogeneous support), and then inserts blank rows for each DSI initializing seq value to 0.

During processing, when Parallel DSI’s are in use, the first statement a DSI issues immediately following the begin transaction for the group is similar to the following:

create procedure rs_update_threads @rs_id int, @rs_seq int as update rs_threads set seq = @rs_seq where id = @rs_id go

Each DSI simply calls the procedure with its thread id (i.e. 1-5 in our example) and the seq value plus one from the last transaction group (the initial call uses a value of 1). Since this update is within the transaction group, it has the effect of blocking the thread’s row during the transaction group’s duration. Following this, normal transaction statements within the transaction group are sent as normal.

After all the transaction statements have been executed, the DSI then attempts to select the previous thread’s row from the rs_threads table using the rs_get_thread_seq function. If the previous thread has not yet committed, then the thread is blocked (due to lock contention) by the update lock on the row by the previous thread. If the previous thread has committed, then the lock is not held, consequently, the current thread possibly also can commit. Ignoring the effects of serialization method on transaction timing, this could be illustrated as in the below diagram. Note that in each case, each subsequent thread is blocked and waiting on the previous thread’s update on rs_threads.

Final v2.0.1

236


BT 1UT 1TX 01TX 02TX 03TX 04TX 05CT 1

BT 2UT 2TX 06TX 07TX 08TX 09TX 10CT 2 GT 1



...

Blocked

Blocked

Blocked

BT n UT n

TX ##

CT n GT n

rs_begin for transaction for thread n

rs_commit for transaction for thread n

Replicated DML transaction ##

rs_update_threads n

rs_get_thread_seq n


BT 1UT 1TX 01TX 02TX 03TX 04TX 05CT 1 BT 1UT 1TX 01TX 02TX 03TX 04TX 05CT 1

BT 2UT 2TX 06TX 07TX 08TX 09TX 10CT 2 GT 1 BT 2UT 2TX 06TX 07TX 08TX 09TX 10CT 2 GT 1



...

Blocked

Blocked

Blocked

BT n UT n

TX ##

CT n GT n




rs_update_threads n

rs_get_thread_seq n

Figure 64 – Parallel DSI Thread Sequencing Via rs_threads

To anyone who has monitored their system and checked object contention, they probably thought all of the blocking on rs_threads was a problem. As illustrated above, it is actually deliberate. The theory of the above is that transactions can acquire locks and execute in parallel – but due to the rs_threads locking mechanism, the transactions are still committed in order (1-20 in the above). After each thread commits, it then requests the next transaction group from the DSI-S. Note this happens in commit order, consequently in an ideal situation, the transaction groups will proceed in sequence through the threads.

The first question that comes to mind for many is: “What happens if one of the threads hits an error and rollsback its transaction? Wouldn’t the next thread simply commit?” The answer is no. This is where the seq column comes in and the realization why rs_get_thread_seq has seq in the name. As each rs_get_thread_seq function call is made, it returns the seq column for the previous thread. This value is simply compared to the previous value. If it is equal to the previous value, then an error must have occurred and subsequent transactions need to rollback as well. However, if the seq value is higher than the previous seq value for that thread, then the current thread can commit its transaction.

Final v2.0.1

237

seq > previous

suspend connection

No

rs_begin

rs_update_threads n

(replicated transactions)

rs_get_thread_seq n-1

Rollback transaction

commit transaction

Yes

(Blocked)

Figure 65 – rs_get_thread_seq and seq value comparison logic

It should be emphatically stated that:

1. Blocking on rs_threads is NOT an issue – it is deliberate and precisely used to control the commit order. Threads will block until their turn to commit.

2. Deadlocks raised involving rs_threads does not infer that rs_threads is an issue. Instead, it is an indicator that the statement it surfaced the deadlock with has contention with out of sequence execution.

To put it simply, rs_threads is NEVER the issue!!! To find out the real cause of concern, you can monitor the true contention through monDeadlocks and monOpenObjectActivity as well as watching monProcessWaits, monLocks - especially if the replicate database is also used by end-users for reporting or if maintenance activities are being performed. Techniques for finding the true causes of deadlocks/contention are discussed below in the section “Resolving Parallel DSI Contention”

DSI Commit Control

So, then, if rs_threads is not the issue, then why was DSI Commit Control implemented. The rationale stems from several reasons:

1. If there is intra-thread contention, it is handled by causing a deadlock. ASE chooses the deadlock victim according it's own algorithm which favors longer running tasks – which in this case probably is the task that should have waited – consequently, often the wrong task is rolled back as the deadlock victim. This adds additional work to the re-submittal of the SQL batches involved.

2. Since RS knows the sequence of commit, if contention does occur under DSI Commit Control, only the offending thread and subsequent threads need to be rolled back. The blocked thread and any other up to the blocking thread can continue.

3. The logic for rs_threads is heavily dependent on the ASE locking scheme, consequently does not lend itself to heterogeneous situations.

4. For very short transactions with small or no transaction grouping, the rs_threads activity adds significantly to the IO processing of replication.

As a result, DSI Commit Control was implemented in RS 12.6 as a more internal means of controlling contention detection and resolution between Parallel DSI's. The implementation is as follows:

1. Each thread submits its batch of SQL as usual 2. After the batch has completed execution, it checks to see if the previous thread has committed. If so, the

current thread can simply go ahead and commit. 3. If the previous thread has not committed, the current thread issues rs_dsi_check_thread_lock function to

see if thread's SPID is blocking another DSI thread.

Final v2.0.1

238

4. If rs_dsi_check_thread_lock returns a non-zero number, the thread rollsback it's transaction. 5. If rs_dsi_check_thread_lock returns 0, it waits dsi_commit_check_locks_intrvl seconds and then checks

again to see if the previous thread has committed and re-issues rs_dsi_check_thread_lock if not. 6. Step 5 is repeated dsi_commit_check_locks_max times, after which the batch is rolled back regardless.

This can best be illustrated by the following flow-chart:

Did previous thread commit?

Execute SQL

Rollback/Abort Rs_dsi_check_thread_lock

>dsi_commit_check_locks_max

dsi_commit_check_locks_intrvl

Commit

No

=0

>0

Yes

Yes No

Did previous thread commit?

Execute SQL

Rollback/Abort Rs_dsi_check_thread_lock

>dsi_commit_check_locks_max

dsi_commit_check_locks_intrvl

Commit

No

=0

>0

Yes

Yes No

Figure 66 - Commit Control Logic Flow

Note that of course if the thread is blocked, it does not get out of the first stage (executing SQL) until the contention is resolved. Additionally, note that if the threads commit quickly, there also is no delay at all.

The first question that might be asked is “How would a thread know the previous thread had committed?” Referring back to the earlier diagram, as each thread commits, it sends an acknowledgement to the DSI-S before doing post-transaction clean-up and sending a “thread ready” message.


From the above diagram, you can see how that it would be fairly simple for the DSI-S to withhold the “Commit” message from a subsequent thread until it gets a “Committed” message from the previous thread. The only issue then is to determine when a later thread is blocking an earlier thread resulting in an application deadlock - the earlier thread is blocked - and the later thread is waiting for it to finish - hence rs_dsi_check_thread_lock.

Final v2.0.1

239

On the plus side of rs_threads, it distinctly focuses in on the exact threads with contention and execution continues as soon as the contention is lifted. The default function string provided for RS 12.6 is much less specific – and in fact may lead to excessive false rollbacks just due to contention between the RS and other processes. This definition is:

alter function string rs_dsi_check_thread_lock for sqlserver_function_class output language ' select count(*) "seq" from master..sysprocesses where blocked = @@spid '

As noted, this would return a non-zero value whenever the DSI thread was blocking any other user - for example someone running a report or trying to do table maintenance. Consequently, a slight alteration would achieve the desired affect of only blocking when blocking another maintenance user transaction:

alter function string rs_dsi_check_thread_lock for sqlserver_function_class output language ' select count(*) "seq"

from master..sysprocesses where blocked = @@spid and suid=suser_id() -- added to detect only maintenance user blocks

'

As this statement may get executed extremely frequently, the recommended approach is to actually use a stored procedure and a modified function string definition that calls it such as:

-- procedure modification -- add to rs_install_primary.sql (rsinspri.sql on NT) create procedure rs_dsi_check_thread_lock as begin select count(*) "seq"

from master..sysprocesses where blocked = @@spid

and suid=suser_id() return 0 end go -- install in RS -- function string modification alter function string rs_dsi_check_thread_lock for rs_default_function_class output language '

exec rs_dsi_check_thread_lock ' go

The rationale is that this avoids optimizing the above SQL statement every 100 milliseconds or whatever dsi_commit_check_locks_intrvl is set to.

One important note. In addition to the modification needed for rs_dsi_check_thread_lock, the default configuration values are likely too high to provide effective throughput as well. The biggest problem is that the default value for dsi_commit_check_locks_intrvl is set to 1000ms or 1 second. This likely is too long to wait by a full order of magnitude as any contention will result in the thread waiting 1 second as well as causing subsequent threads from committing as well. To understand the magnitude of the problem, consider what would happen if 5 threads were being used and the first thread had a long running transaction. As a result, threads 2-5 would each execute the rs_dsi_check_thread_lock function and wait for 1 second. As soon as thread 1 commits, it still could be up to 1 second later before thread 2 commits due to waiting dsi_commit_check_locks. Note that thread 3 is waiting on thread 2, consequently, depending on the timing of the rs_dsi_check_thread_lock calls, thread 3 could be delayed up to 1 second after thread 2 and so forth. Net result is that the maximum delay will be:

max_delay=(num_dsi_threads-1) * dsi_commit_check_locks_intrvl

So with 5 threads, the max delay at the default settings would be 4 seconds - in a high volume system, several thousand SQL statements could have been executed during this period. As a result, a better starting value for dsi_commit_check_locks_intrvl is likely 100ms or even less. The problem is that this method depends on the speed of materializing the master..sysprocesses virtual table. On replicate systems used for reporting, this could result in

Final v2.0.1

240

considerable rows that then have to be table scanned for the values (virtual tables such as sysprocesses do not support indexing).

There is another problem: “false blocking”. If an earlier thread acquires a lock and blocks a later thread, this should be expected and not an issue. However, the statement above would detect that a blocked user existed. Consider the following scenario:

1. Thread #1 starts processing and is executing a larger than average transaction or one that executes longer than normal due to a replicated procedure or a invoked trigger.

2. Thread #2 completes it’s transaction, in the process, it acquires locks that block thread #3.

Thread #2 checks the commit status of thread #1 and sees that it isn’t ready to commit, so it then issues a rs_dsi_check_thread_lock - which returns a non-zero number since thread #3 is blocked. The result is predictable. One might think that this is easily rectified by returning the spid being blocked. However, it is likely that this could be a deadlock chain - such as #2 blocking #3 who is in turn blocking #1. Without knowing all the spids for previous threads and traversing the full chain, there is no way for a thread to know that if the block is a real problem or not. Net result, a rollback when none is necessary.

Thread Sequencing

As mentioned, the parallel transactions are submitted to each of the threads in order. Now that we understand how they commit in order, it might help to understand how the start in order. The key to thread sequencing is to understand that based on the dsi_serialization_method, parallel threads can start based on if the previous thread has reached one of three conditions:

Ready to Commit - In this scenario, subsequent threads can start only when the previous thread has submitted all it’s transaction batches successfully, received a successful rs_get_thread_seq function and is ready to send the rs_commit function. NOTE: A common misconception is that this implies the previous thread has committed - in reality, it is merely ready to commit.

Started - In this scenario, subsequent threads can start only after the previous thread has already started. When Ready - In this scenario, threads can start at any point as soon as they are ready. This doesn’t change

the commit order, it merely allows a thread to start when it is ready vs. waiting for another thread.

This coordination is done by the DSI-Scheduler. If you look back at the earlier detailed diagram of the DSI Execution flow, each DSIEXEC sends messages back to the DSI-S informing of the current status of it’s processing.


Based on the above diagram, you could see how commit control would work from an internals perspective - each subsequent thread to be committed would simply not get told to commit (step 11) until the previous thread had successfully committed (step 13). In perspective of the thread sequencing the thread at the bottom (with no lines to it) could begin executing at the following points:

Ready to Commit - In this scenario, thread #2 would have to wait until the ‘Commit Ready’ (step 10) message was received by the DSI-S. When the DSI-S got the ‘Commit Ready’ message from thread #1, it would send ‘Begin Batch’ message to thread #2 - assuming it had received a ‘Batch Ready’ message from thread #2.

Final v2.0.1

241

Started - In this scenario, thread #2 would only wait until the ‘Batch Began’ (step 7) message was received by the DSI-S. When the DSI-S got the ‘Batch Began’ message from thread #1, it would send ‘Begin Batch’ message to thread #2 - again, assuming that it had received a ‘Batch Ready’ message from thread #2.

When Ready - In this scenario, threads can start at any point as soon as they are ready. Consequently, when thread #2 would send it’s ‘Batch Ready’ message, the DSI-S would immediately reply with ‘Begin Batch’.

Note that the ‘batch’ we are discussing is only the first batch. Subsequent command batches are sent until the thread reaches the end and is ready to commit. The purpose for command batch sequencing is to try to control contention by proper execution. The basic premise is this. If the first transaction group is allowed to start in its proper order, it will acquire the locks it needs first. Subsequent threads will simply block vs. deadlocking. However, the problem with this theory is that it depends largely on the following factors:

Transaction Group Size - Essentially, how large the transaction group is from a number of statements. If the transaction groups are submitted nearly in parallel, the first batch of SQL statements in each thread logically should follow the last from the previous thread. However, they are being executed first, resulting in an overlap in which the vulnerability of a deadlock is raised. The larger the transaction groups, this vulnerability is increased.

Long Running SQL - If a thread executes a long running statement - such as a stored procedure or if an invoked trigger runs long - the likelihood is that subsequent threads will get ahead of the first thread and most likely be ready to commit (waiting on rs_threads or commit control) by the time the first thread completes the long running statement. As a result, any other statements left to be executed by the first thread increases the vulnerability of a rollback due to a deadlock issue.

ASE Execution Scheduling - As each statement is executed, it is likely that logical and/or physical IO’s will need to be performed. As a result, the SPID for the DSI thread is put to sleep pending the IO and execution moves to the next task on the ASE run queue. When the IO has completed, the thread is woken up and put on the runnable queue for processing. However, it is likely that multiple DSI threads will be waiting for IO concurrently. Note that ASE doesn’t know the ideal execution order based on the DSI pattern, so ASE can wake up any one of them in any order, resulting in out of order execution.

DSI Transaction Grouping - After each complete execution, the parallel DSI thread needs to get the next batch of transactions from the DSI Scheduler. If insufficient cache or time was spent grouping the transactions, a transaction group may not be available.

Problems in any one of these areas could lead to a “bursty” behavior in which blocking or commit sequencing results in apparent thread inactivity. The goal then is understanding how the configuration parameters - especially the serialization method - along with replicate DBMS tuning can minimize periods of inactivity enabling maximum parallelism for the transaction profile.

Configuration Parameters

There are several configuration parameters that control Parallel DSI’s.


batch_begin Default: on; Recommended: (see text)

Indicates whether a begin transaction can be sent in the same batch as other commands (such as insert, delete, and so on). While it is unarguable that it should be ‘on’ for non-parallel DSI and for parallel DSI’s using a wait_for_commit serialization method, there is a disagreement currently whether having this enabled for parallel DSI serialization methods such as wait_for_start delays the begin sequencing.

dsi_commit_check_locks_intrvl Default: 1000ms; Recommended: 50-100ms

The number of milliseconds (ms) the DSI executor thread waits between executions of the rs_dsi_check_thread_lock function string. Used with parallel DSI. Default: 1000ms (1 second); Minimum: 0; Maximum: 86,400,000 ms (24 hours)

Final v2.0.1

242


dsi_commit_check_locks_logs Default: 200; Recommended: <100 (see text)

The number of times the DSI executor thread executes the rs_dsi_check_thread_lock function string before logging a warning message. Used with parallel DSI. This should be set to a setting which puts a log warning out after 3-5 seconds to provide an earlier indication of an issue. To arrive at this value, simply divide 3000 (3 secs in milliseconds) by dsi_commit_check_locks_intrvl. Likely this will be a number <100. Default: 200; Minimum: 1; Maximum: 1,000,000

dsi_commit_check_locks_max Default: 400; Recommended: (see text)

The maximum number of times a DSI executor thread checks whether it is blocking other transactions in the replicate database before rolling back its transaction and retrying it. Used with parallel DSI. Note that at the default settings of 1000ms for dsi_commit_check_locks_intrvl, the default setting of 400 becomes 400 seconds or 6.667 minutes - which is far, far too long. The max should terminate in 5-10 seconds or less - the shorter especially for pure DML (insert, update, delete). Again, if we use 10 seconds as our max, to derive the value we would simply divide 10000ms by the dsi_commit_check_locks_intrvl - if 100ms, the answer would be 100. Default: 400; Minimum: 1; Maximum: 1,000,000

dsi_commit_control Default: on; Recommended: (see text)

Specifies whether commit control processing is handled internally by Replication Server using internal tables (on) or externally using the rs_threads system table (off). Recommendation is based on your preference as both mechanisms have positives and negatives as discussed above. Default: on

dsi_isolation_level Default: DBMS dependent; Recommended: 1

Specifies the isolation level for transactions. The ANSI standard and Adaptive Server supported values are: 0 – ensures that data written by one transaction represents the actual data. 1 – prevents dirty reads and ensures that data written by one transaction represents the actual data. 2 – prevents nonrepeatable reads and dirty reads, and ensures that data written by one transaction represents the actual data. 3 – prevents phantom rows, nonrepeatable reads, and dirty reads, and ensures that data written by one transaction represents the actual data. Data servers supporting other isolation levels are supported as well through the use of the rs_set_isolation_level function string. Replication Server supports all values for replicate data servers. The default value is the current transaction isolation level for the target data server.

dsi_keep_triggers Default: on (except standby databases); Recommended: off

Specifies whether triggers should fire for replicated transactions in the database. Set off to cause Replication Server to set triggers off in the Adaptive Server database, so that triggers do not fire when transactions are executed on the connection. While the book suggests to set on for all databases except standby databases; the reality is that unless you are doing procedure replication - or are not replicating tables that are strictly populated by triggers, this can be safely set to ‘off’.

Final v2.0.1

243


dsi_large_xact_size Default: 100; Recommended: 10,000 or 2,147,843,647 (max)

The number of commands allowed in a transaction before the transaction is considered to be large for using a single parallel DSI thread. The minimum value is 4. The default is probably far too low for other than strictly OLTP systems. While the initial recommendation would be to raise this to 2 billion and thereby eliminate this configuration from kicking in as it has little real effect, if the application does have some poorly designed large transactions, setting this to a much higher number than ordinary might help reduce DSI latency when the DSI is waiting on a commit before it even starts.

dsi_max_xacts_in_group Default: 20; Recommended: (see text)

Specifies the maximum number of transactions in a group. Larger numbers may improve data latency at the replicate database. Range of values: 1 – 100. The reason this is mentioned here at all is because of the impact on parallel DSI’s. In non-parallel DSI environments, setting this higher may help throughput. In parallel DSI environments - especially those involving a lot of updates or deletes, this may have to set considerably lower (i.e. 5-10). A common mistake is setting this to 100 and using a single DSI instead of attempting parallel DSI’s and a lower value. While 100 may work in some instances, all too often grouping rules make it difficult to achieve, hence, parallel DSI’s are a better approach than increasing this value significantly.

dsi_num_large_xact_threads Default: 2 if parallel_dsi is set to true; Recommended: 0 or 1 (see text)

The number of parallel DSI threads to be reserved for use with large transactions. The maximum value is one less than the value of dsi_num_threads. More than 2 are probably not effective. If dsi_large_xact_size is set to 2 billion, this should be set to 0. If attempting some large transactions, likely 1 is the best setting. See sub-section on Large Transaction Processing later in this section for details.

dsi_num_threads Default: 1 if no parallel DSI’s; 5 if parallel_dsi is set to true; Recommended (see text)

The number of parallel DSI threads to be used. The maximum value is 255. See section on parallel DSI for appropriate setting - but it is likely that the default is too low for high performance situations.

dsi_partitioning_rule Default: none; Recommended (see text)

Specifies the partitioning rules (one or more) the DSI uses to partition transactions among available parallel DSI threads. Values are origin, ignore_origin, origin_sessid, time, user, name, and none. See the Replication Server Administration Guide Volume 2 for detailed information. The recommended setting is to leave this set to none unless using parallel DSI’s and experiencing more than 1 rollback every 10 seconds. Then try the combination origin_sessid, time.

dsi_serialization_method Default: wait_for_commit; Recommended: wait_for_start

The method used to maintain serial consistency between parallel DSI threads, when applying transactions to a replicate data server. Values are: isolation_level_3 - specifies that transaction isolation level 3 locking is to be used in the replicate data server. single_transaction_per_origin - prevents conflicts by allowing only one active transaction from a primary data server. wait_for_commit - maintains transaction serialization by instructing the DSI to wait until one transaction is ready to commit before initiating the next transaction. None/wait_for_start - assumes that your application is designed to avoid conflicting updates, or that lock protection is built into your database system. No_wait - threads begin as soon as they are ready vs. waiting for previous threads to at least start as other settings. See sub-section on Serialization Methods

Final v2.0.1

244


dsi_sqt_max_cache_size Default: (0); Recommended: 4-8MB

Maximum SQT (Stable Queue Transaction interface) cache memory for the database connection, in bytes. The default, "0," means that the current setting of sqt_max_cache_size is used as the maximum cache size for the connection. This parameter controls the use of parallel DSI threads for applying transactions to a replicate data server. The more DSI threads you plan on using, the more dsi_sqt_max_cache_size you may need.

parallel_dsi Default: off; Recommended: (see text)

Provides a shorthand method for configuring parallel DSI threads. A setting of "on" configures these values: dsi_num_threads = 5 dsi_num_large_xact_threads = 2 dsi_serialization_method = "wait_for_commit" dsi_sqt_max_cache_size = 1 million bytes. A setting of "off" configures these parallel DSI values to their defaults. You can set this parameter to "on" and then set individual parallel DSI configuration parameters to fine-tune your configuration.

As illustrated by the single parameter “parallel_dsi”, many of these work together. Note that parallel_dsi sets several configuration values to what would appear to be fairly low numbers. However, due to the serialization method, these settings are typically the most optimal. More DSI threads will not necessarily improve performance.

Serialization Methods

Key Concept #26 – Serialization Method has nothing to do with transaction commit order. No matter which serialization method, transactions at the replicate are always applied in commit order. However, it does control the timing of transaction delivery with Parallel DSI’s in order to reduce contention caused by conflicts between the DSI’s.

One of the most difficult concepts to understand is the difference between the serialization methods. The best way to describe this is that the serialization method you choose depends on the amount of contention that you expect between the parallel threads. Some of this you can directly control via the dsi_max_xacts_in_group tuning parameter. The more transactions grouped together, the higher the probability of contention as the degree of parallelism increases or the higher the probability of contention with other users on the system. This will become more apparent as each of the serialization methods will be described in more detail in the following sections.

wait_for_commit

The default setting for dsi_serialization_method is “wait_for_commit”. This serialization method uses the ‘Ready to Commit’ transaction sequencing as it assumes that there will be considerable contention between the parallel transactions. As a result, the next thread’s transaction group is not sent until the previous thread’s statements have all completed successfully and it is ready to commit. This results in the thread timing in which execution is more staggered than parallel as illustrated below.

Final v2.0.1

245


...


...

Figure 69 – Thread timing with dsi_serialization_method = wait_for_commit

As discussed earlier, this timing sequence would have limited scalability beyond 3-5 parallel DSI threads. However, it assures that contention between the threads does not result in one rolling back – which would cause all those that follow to rollback as well.

none/wait_for_start

“None” does not infer that no serialization is used. What it really means is that no (none) contention is expected between the Parallel DSI threads. As a result, the thread transactions are submitted nearly in parallel based on a transaction sequencing on the begin statement - (timing then more a factor of the number of transactions in the closed queue in the SQT cache) with each thread waiting until the previous thread has begun (hence the new name ‘wait_for_start’ vs. the legacy term ‘none’). This looks like the illustration similar to the below.


...

Figure 70 – Thread timing with dsi_serialization_method = none

However, in certain situations, “none” could result in an inconsistent database – see the section on isolation_level_3 for details. As the first method being discussed which has a high degree of parallelism, let’s take a look at contention and see how it can be reduced to reduce the number of rollbacks.

isolation_level_3

Let’s first make sure of one thing – a common misconception is dispelled. When purely replicating DML (i.e. inserts, updates, and deletes), most people would think that isolation_level_3 will be slower than ‘none’ as a serialization method and invoke considerable more blocking with non-replication processes at the replicate. This is an absolute falsehood for the following reasons:

1. Since RS delivers all changes inside of a transaction and DML statement locks are held during the duration of the transaction, there is no difference in lock hold times, etc.

2. Since DRI constraints hold their locks until the end of the transaction, the same is true for DRI constraint locks.

As a result, there is no difference between ‘isolation_level_3’ and ‘none’ from a performance perspective, however, isolation_level_3 is the safer, and consequently Sybase recommended setting up through RS 12.5. There is one exception to this statement - when DOL locking is involved (and unfortunately datarows locking is likely needed to support parallel DSI’s, so, this is likely). In this case, isolation level 3 locking includes some addition locks - namely range and infinity (or next key) locks. Normally, again, in pure DML replication this likely will have minimal if any

Final v2.0.1

246

impact. However, if triggers are still enabled, or procedure replication is involved, the lock time for these locks can be extended - not only causing contention with reporting users - but also between the different parallel DSI threads.

It should also be noted that isolation_level_3 as a dsi_serialization_method is a bit of an anachronism in RS 15.0. While it is still available to support legacy configurations, the impact is the same as setting the serialization method to wait_for_start and setting dsi_isolation_level to 3.

Serialization method “isolation_level_3” is identical to “none” with the addition that Replication Server first issues “set transaction isolation level 3” via rs_set_isolation_level_3 function. However, as one would expect, this could increase contention between threads dramatically due to select locks being held throughout the transaction – but ONLY if the replicated operations invoke or contain select statements (i.e. replicated procedures). While Replication Server is normally associated with write activity, a considerable amount of read activity could occur in the following:

• Declarative integrity (DRI always holds locks) • Select statements inside replicated procedures • Trigger code if not turned off for connection • Custom function strings

• Aggregate calculations, etcConsequently, care should be taken when replicating stored procedures, etc. in assuring that isolation_level_3 is necessary to ensure repeatable reads from the aspect of the replicated transactions and that the extra lock hold time for selects in the procedure will not increase contention. Consider for example the normal “phantom read” problem where a process scanning the table reads a row – the row is moved as a result of an update, and then the row is re-read. In a normal system, this is simply avoided by having the scanning process invoke isolation level 3 via the set command. However, if you think about it, no one ever mentions having the offending writer invoke isolation level 3. The reason for that is that it would be unnecessary as once the read scans the row to be updated, it holds the lock until the read completes – thereby blocking the writer and preventing the problem. In this case, most of Replication Server’s transactions will be as the writer, so, it probably is in the same role as the offending writer in the phantom read – no isolation level three required.

Of course, the most obvious example of when isolation level 3 is normally thought of is when performing aggregation for data elements that are not aggregated at the primary and consequently the replicate may have to perform a repeatable read as part of the replication process. This could be a scenario similar to replicating to a DSS system or a denormalized design in which only aggregate rollups are maintained. Even in these cases however, isolation level 3 may not be necessary as alternatives exist. Consider the classic case of the aggregate. Let’s assume that a bank does not keep the “account balance” stored in the primary system (possibly because the primary is a local branch and may not have total account visibility??). When replicating to the central corporate system, the balance is needed to ensure timely ATM and debit card transactions. Of course, this could be implemented as a repeatable read triggered by the replicated insert, update, delete or whichever DML operation. However, it is totally unnecessary. Because Replication Server has access to the complete before and after images of the row, a function string similar to the following could be constructed:

alter function string <repdef_name>.rs_update for rs_default_function_class output language ‘update bank_account set balance = balance – (?tran_amount!old?-?tran_amount!new?) where <pkeycol> = ?tran_id!new? ‘

This maintains the aggregate without isolation level three required – and much more importantly – without the expensive scan of the table to derive the delta. By exploiting function strings – or by encapsulating the set isolation command within procedure or trigger logic, you may find that you can either avoid using isolation level three or restrict it only to those transactions from the primary that truly need it.

In summary, in addition to the contention increase simply from holding the locks on select statements, a possibly bigger performance issue when isolation level three is required is the extra i/o costs of performing the scans that the repeatable reads focus on – all within the scope of the DSI transaction group. Although isolation_level_3 is currently the safest parallel DSI serialization setting, if it is needed to ensure repeatable reads for aggregate or other select-based queries invoked in replicated procedures or triggers – the primary goal should be to see if a function string approach could eliminate the repeatable read condition. Once eliminated, isolation_level_3 can be set safely without any undue impact on performance.

single_transaction_per_origin

Similar to isolation_level_3, single_transaction_per_origin is outdated in RS 15.0. The same effect could be implemented by setting the dsi_serialization_method to “wait_for_start” and setting dsi_partitioning_rule to “origin”.

Final v2.0.1

247

The single_transaction_per_origin serialization method is mainly used for corporate rollup scenarios. Although clearly applicable for corporate rollups, another implementation for which single_transaction_per_origin works well is the shared primary or any other model in which the target replicated database is receiving data from multiple sources.

ChicagoChicago

New YorkNew York

London London

San FranciscoSan Francisco

Tokyo Tokyo

ChicagoChicago

New YorkNew York

London London


Tokyo Tokyo

Figure 71 – Corporate Rollup or Shared Primary scenario

In the above example, all the available routes between the sites normally present in a shared primary are not illustrated simply due to image clarity.

In this serialization method, since the transactions are from different origin databases, there should not be any contention between transactions. For example, stock trades in Chicago, San Francisco, Toronto, Tokyo and London are completely independent of each other – consequently their DML operations would not interfere with each other except in cases of updates of aggregate balances. However, within each site – for example, transactions from Chicago – some significant amount of contention may exist. By only allowing a single transaction per origin, each DSI could simply be processing a different sites transaction – consequently, the transaction timing is similar to none or isolation_level_3 in that the Parallel DSI threads are not staggered waiting for a the previous commit. From an internal threads perspective, it would resemble:



Stable Device

Corporate HQDSI-ExecDSI-Exec

DSI-Exec

SQM

DSI (S)

SQT

dAIO

DistributorDistributor

SQMSQM

SQTSQT

Rep AgentRep Agent

Distributor

SQM

SQT

Rep Agent

Distributor

SQM

SQT

Rep Agent

Inbound (1)Outbound (0)Inbound (1)

Outbound (0)New York

SeattleChicago

Figure 72 – Internal threads processing of single_transaction_per_origin

Note that the above diagram suggests that each DSI handles a separate origin regardless. This is not quite true and is just an illustration. The real impact of single_transaction_per_origin is that if any origin already has a transaction group in progress on one DSIEXEC thread, if another transaction is hit from the same origin, that transaction is applied

Final v2.0.1

248

as if the serialization was “wait_for_commit” instead. However, if the next transaction was from a different origin, it could be applied in parallel.

From a performance perspective, single_transaction_per_origin may not have as high of a throughput as other methods such as none. Consider the following:

Origin Transaction Balance – single_transaction_per_origin works best in situations where all the sites are applying transactions evenly. In global situations where normal workday at one location is offset from the other sites, this is not true. Instead, all of the transactions for a period of time come from the same origin – and consequently are single threaded.

Single Origin Error – Consider what happens if one of the replicated transactions from one of the sites fails for any reason. All DSI threads are suspended and the queue fills until the one site’s transaction is fixed and connection resumed. This could cause the outbound and inbound queues to rapidly fill – possibly ending up with a primary transaction log suspend

Origin Transaction Rate – Again, each individual site effectively has a single DSI of all the parallel DSI’s to use. If the source system has a very high transaction volume, the outbound queue will get behind quickly.

Either one of these situations is fairly common and could cause apparent performance throughput to appear much lower than normal. While the error handling is easily spotted from the Replication Server error log, the source transaction rate or the balance of transactions is extremely difficult to determine on the fly.

no_wait

The dsi_serialization_method of no_wait is similar to wait_for_start except that the threads do not wait for the other threads to start - instead the simply start as soon as they are ready. Remember, with wait_for_start or none - each thread waits to begin it’s batch until the previous thread begins. The result is a slightly staggered starting sequence illustrated a few pages ago, similar to the following:


...

Figure 73 – Thread timing with dsi_serialization_method = none

No wait no only eliminates this slight stagger, it also means that since a thread can start when ready, it could even start before the previous thread if the previous thread is not ready for any reason (i.e. still converting to SQL). The result could be something like:

Figure 74 – Thread timing with dsi_serialization_method = no_wait

Final v2.0.1

249

Note that the commit order is still maintained.

When would you use no_wait vs. wait_for_start?? In an insert intensive environment, no_wait may help. However, in an update intensive environment, because the probability of a conflicting update executing ahead of a previous one is even higher than it was under wait_for_start, no_wait could increase the number of parallel failures/rollbacks.

Dsi_serialization_method summary

So, now that it is better understood – which one to use??? Consider the following table:

Dsi_serialization_method When to use When not to use

Wait_for_commit • High contention at primary • Low to mid volume

• High volume

None/wait_for_start • High volume • Insert intensive application • Commit consistent transactions

• Short cycle update/DML (unless dsi_isolation_level is set to ‘3’)

• Not commit consistent transactions

Isolation_level_three • Mid-High Volume • Ensure database consistency • Low cardinality rollup with high

volume from each • Short cycle update/DML

• High number of selects in procs or function strings

• Satisfiable via before/after image • Commit consistent

Single_transaction_per_origin • High cardinality rollup with low volume from each

• Low cardinality rollup with high volume from each

As you can tell, it simply depends on the transaction profile from the source system.

Transaction Execution Sequence

However, there are transaction profiles that must use dsi_serialization_method=isolation_level_3 vs. dsi_serialization_method=’none’. Alternatively, they can use none/wait_for_commit, but they must have dsi_isolation_level set to ‘3’ which has the same effect.

“Disappearing Update” Problem

Consider the following scenario of atomic transactions at the primary -- assume a table similar to: -- create table tableX ( -- col_1 int not null, -- col_2 int not null, -- col_3 varchar(255) null, -- constraint tableX_PK primary key (col_1) -- ) Insert into tableX (col_1, col_2, col_3) values (1, 2, “this is a test string”) Update tableX set col_2=5, col_3=”dummy row” where col1_1=1

One would always expect the resulting row to tuple to be {1,5,”dummy row”} – however, it is possible that the result at the replicate could be {1,2,”this is a test string”}, as if the update never occurred. The reason for this is the timing of the transactions. If parallel DSI’s are used and the dsi_serialization_method of “none” is selected, the SQL statements are executed OUT OF ORDER – but committed in SERIALIZED ORDER. This is a big difference. If the first insert is the last SQL statement in a group and the update the first statement in the next group, the update will physically occur BEFORE the insert. Consider the following picture:

Final v2.0.1

250






Blocked

Blocked

Blocked

BT n UT n

TX ##

CT n GT n




rs_update_threads n

rs_get_thread_seq n

Insert into tableX () Update tableX

Figure 75 – Statement Execution Sequence vs. dsi_serialization_method=none

Many of you may already see the problem. The update effectively sees 0 rows affected, consequently the insert physically occurs later and the values are never updated. But wait….shouldn’t the update block the insert???

Locking in ASE

No. As of SQL Server 10.0, Sybase stopped holding the locks unless isolation level 3 is enabled, consequently the above could happen. Many people state that situations like this are not described in the books – but they are (as well as nearly all the material in this paper). Consider the following description from the Replication Server Administration Guide located in the section describing Parallel DSI Serialization Methods (located in Performance and Tuning chapter) – and in particular is the description for “none”. It reads:

This method assumes that your application is designed to avoid conflicting updates, or that lock protection is built into your database system. For example, SQL Server version 4.9.2 holds update locks for the duration of the transaction when an update references a nonexistent row. Thus, conflicting updates between transactions are detected by parallel DSI threads as deadlocks. However, SQL Server version 10.0 and later does not hold these locks unless transaction isolation level 3 has been set.

For replication to non-Sybase databases, transaction serialization cannot be guaranteed if you choose either the "none" (no coordination) or the "isolation_level_3" method, which may not be supported by non-Sybase databases.

The high-lighted section probably makes a lot more sense now to those who read it in the past and wondered. So, in the above illustration, if the dsi_serialization_method was set to isolation_level_3, the update would hold the locks and consequently the insert would block resulting in a deadlock as earlier discussed in the last section. The result would be the typical rollback and serial application – and all will be fine.

The DOL/RLL Twist

An aspect that caught people by surprise was when this started happening even when using wait_for_commit and DOL tables. In implementing DOL, Sybase ASE engineering introduced several optimizations under isolation levels 1 & 2.

Uncommitted Insert By-Pass – Uncommitted inserts on DOL tables would be bypassed by selects and other DML operations such as update or delete.

Unconflicting Update Return – Select queries could return columns from uncommitted rows being updated if ASE could determine that the columns being selected were not being updated. For example, an update of a particular author’s phone number in a DOL table would not block a query returning the same author’s address.

Final v2.0.1

251

At this time, there is no proof that the unconflicting update is a cause of concern. However, in the case of the uncommitted insert by-pass, this broadened the vulnerability to the above problem substantially. Instead of the update or delete having to be executed prior to insert, any subsequent update or delete would by-pass the uncommitted insert and as a result would return “0 rows affected”. Additionally, although the window of opportunity was much narrower, because the vulnerability was exposed until the commit was actually executed, the vulnerability was extended to the other serialization methods with the exception of dsi_serialization_method = isolation_level_3. It should be noted that in ASE 11.9-12.5, this optimization can be monitored with trace flag 694 and disabled with 693. As a result, customers are suggested to do one of the following if they find themselves in this situation:

• Use dsi_isolation_level=3 or dsi_serialization_method=isolation_level_3 • Boot the server with -T693 to disable the locking optimizations. This may be preferred if isolation level 3

leads to increased contention with parallel DSI’s.

Isolation Level 3

The reason isolation level 3 does not experience the problem is intrinsic to the ANSI specification to prevent phantom rows under isolation level 3. In order to prevent phantom rows, a “0 row affected” DML operation must hold a lock where the row should have been to prevent another user from inserting a row prior to commit (and hence a re-read of the table would yield different results violating isolation level 3). This is prevented in ASE via the following methods:

APL (All Page) Locking – The default locking scheme protects isolation level 3 on tables by retaining an “update” lock with an “Index Page” context. An update lock is a special type of read lock that indicates that the reader may modify the data soon. An update lock allows other shared locks on the page, but does not allow other update or exclusive locks. Since it is not always possible to determine the page location (i.e. the end of the table) for the lock, the lock is placed on the index page. This prevents inserts by blocking the insert from inserting the appropriate index keys.

DOL (Data Only) Locking – To ensure isolation levels 2 & 3, Sybase introduced two new types of locking contexts with ASE 11.9.2 – “Range” and “Infinity” locks. A “Range” lock is placed on the next row of a table beyond the rows that qualify for the query. This prevents a user from adding an additional row that would have qualified either at the beginning or end of a range. The “Infinity” lock is simply a special “Range” lock that occurs at the very beginning or end of a table.

Consequently, by retaining these locks, the premature execution of the update will cause a deadlock with rs_threads. As described in the Replication Server Performance & Tuning White Paper, this is deliberate, and results in Replication Server rolling back the transactions and re-issuing them in serial vs. parallel. As a result, if isolation level 3 is set, the above situation becomes:






Blocked

Blocked

Blocked

BT n UT n

TX ##

CT n GT n




rs_update_threads n

rs_get_thread_seq n

Insert into tableX () Update tableX

Deadlock

Figure 76 - Deadlock instead of “disappearing update” with isolation level 3

Final v2.0.1

252

Spurious Duplicate Keys

Although the issue above has gained the most attention as “disappearing updates” and Sybase has been able to identify other situations (such as insert/delete) that could occur, one of the situations that does not apply is a delete followed by an insert in which the insert would be executed first due to the same parallel execution that causes the disappearing update problem. Note that this situation differs significantly from the disappearing update problem and is only related in the aspect of the execution order and that it might be conceived to be a related problem – but in fact is not related.

This situation could occur when an application might delete and re-insert rows vs. performing an update – somewhat analogous to Replication Server’s “autocorrection” feature. Note that a deferred update is NOT logged nor replicated as a separate delete/insert pair and consequently it must be an explicit delete/insert pair submitted by the application. In this case, referring to the previous drawings, if the insert ended up where the update was illustrated above, it would execute (or attempt to) prior to the delete. As the row is already present, this would result in a duplicate key error being raised by the unique index on the primary key columns. Normally, when any error occurs during parallel execution, the Replication Server logs the error along with a warning that a transaction failed in parallel and will be re-executed in serial. This situation differs from the disappearing update problem in several critical areas:

• While the “disappearing update” problem does not raise an error, the duplicate key insert does in fact raise an error – which causes a rollback of the SQL that is causing the problem.

• Subsequent execution in serial by Replication Server would correctly apply the SQL and the database would not be inconsistent.

Additionally, other than proper transaction management within the application, none of the current proposals for addressing the disappearing update problem would address this issue. Customers witnessing a frequent number of “duplicate key” errors that appear spurious as subsequent execution succeeds should attempt to resolve the problem by ensuring proper transaction management is in place or by other application controls outside the scope of this issue. One frequent fix is to determine if the system is a Warm Standby and if an approximate numeric (float) column exists - if so, the likely cause of the spurious keys is our old friend the missing repdef/primary key identification.

Estimating Vulnerability

Before eliminating parallel DSI’s, you should first assess the vulnerability of your systems. Basically, the above could happen when an update, delete or procedure (containing conditional logic or aggregation) closely follows a previous transaction such that it is within num_xacts_in_group * num_threads transactions – but definitely separate transactions. Examples of applications that might be vulnerable include:

• Applications with poor transaction management in which atomic SQL statements all part of the same logical unit of work are often executed outside the scope of explicit transactions.

• Applications in which explicit transactions were avoided due to contention • Common “wizard” based applications if each screen saves its information to the database individually prior

to transitioning to the next screen (assuming the following screen may update the same row of information). • Middle tier components that perform immediate saving of data as each method is called vs. waiting until the

object is fully populated. • A typical job queue application in which a job is retrieved from the queue very quickly after begin created (as

is normal, retrieving a job usually entails updating the job status). • Work table data is replicated and then a procedure that uses the work table data is replicated.

Updates or deletes triggered by an insert would not be a case as any triggered DML is included in the implicit transaction with the triggering DML. Specifically, you can determine if DOL has exposed your application by booting the replicate dataserver with trace flag 694.

In any case, assessing the window of vulnerability finds it extremely small. The conditions that could cause an issue fall into the category of:

Non-Existent Row – Basically the scenario addressed in the book and illustrated above characterized by an insert followed closely by DML. The lack of a row doesn’t return an error and doesn’t hold the lock when the second statement is executed first. Consequently this scenario is always characterized by an insert followed by update or delete.

Repeatable Read – The typical isolation level three problem as discussed for isolation_level_3 dsi_serialization_method. Basically any DML operation followed closely by a read (either in a replicated proc or a trigger on a different table)

This leads up to:

Final v2.0.1

253

Key Concept 27: Parallel DSI Serialization does NOT guarantee transactions are executed in the same order – which could lead to database inconsistencies – particularly with dsi_serialization_method=’wait_for_start’ or ‘none’ and dsi_isolation_level other than ‘3’.

If you think about it, we are deliberately executing the transactions kind of out of order to achieve greater parallelism. If we didn’t, then we would be executing them in serial fashion (ala wait_for_commit), which does not achieve any real parallelism. Later (Multiple DSI section), we will discuss the concept of “commit consistent”. At this point, suffice it to say, is that if the transactions are not commit consistent, use isolation_level_3.

Large Transaction Processing

One of the most commonly known and frequently hit problems with Replication Server is processing large transactions. In earlier sections, the impact of large transactions on SQT cache and DIST/SRE processing were discussed. This section takes a close look at how large transactions affect the DSI thread. It should be noted that it is at the DSI that a transaction is defined as “large”. While a transaction may be “large” enough to be flushed from the SQT cache – it still can be too small to qualify as a large transaction.

Parallel DSI Tuning

Tuning parallel DSI’s for large transactions is a mix of understanding the behavior of large transactions, particularly in relationship to the dsi_large_xact_size and the SQT open queue processing.

DSI Tuning Parameters

There really only are two tuning parameters for large transactions. Both of these are only applicable to Parallel DSI implementations. The tuning parameters are:

Parameter Definition

dsi_large_xact_size Default: 100; Recommended: 10,000 or 2,147,843,647 (max)

The number of commands allowed in a transaction before the transaction is considered to be large for using a single parallel DSI thread. The minimum value is 4. The default is probably far too low for other than strictly OLTP systems. While the initial recommendation would be to raise this to 2 billion and thereby eliminate this configuration from kicking in as it has little real effect, if the application does have some poorly designed large transactions, setting this to a much higher number than ordinary might help reduce DSI latency when the DSI is waiting on a commit before it even starts.

dsi_num_large_xact_threads Default: 2 if parallel_dsi is set to true; Recommended: 0 or 1 (see text)

The number of parallel DSI threads to be reserved for use with large transactions. The maximum value is one less than the value of dsi_num_threads. More than 2 are probably not effective. If dsi_large_xact_size is set to 2 billion, this should be set to 0. If attempting some large transactions, likely 1 is the best setting. See the text in this section for details.

The key tuning parameter of both of these is dsi_large_xact_size. When a transaction exceeds this limit, the DSI processes it as a large transaction. In doing so, the DSI does the following:

1. Allow the transaction to be sent to the replicate without waiting for the commit record to be read. 2. Use a dedicated large transaction DSI 3. Each dsi_large_xact_size rows, the DSI will attempt to provide early detection

An important note is that this is only applicable to Parallel DSI. If Parallel DSI is not used, large transactions are processed normally with no special handling.

Parallel DSI Processing

In addition to beginning to process large transactions before the commit record is seen by the DSI/SQT, if using Parallel DSI’s, the Replication Server also processes the large transaction slightly differently during execution. The main differences are:

Final v2.0.1

254

• DSI/SQT open queue processing (DSI doesn’t wait for commit to be seen) • Early conflict detection • Utilizes reserved DSI threads set aside for large transactions

SQT Open Queue Processing

The reference manual states that large transactions begin to be applied by the DSI thread before the DSI sees the commit record. While some people misinterpret this to mean that the transaction has yet to be committed at the primary, except in the case of Warm Standby, the transaction has not only been committed, but fully forwarded to the Replication Server. Remember, in order for the inbound queue SQT to pass the transaction to the outbound queue, the transaction had to be committed. However, the DSI could start delivering the commands before the DIST has processed all of the commands from the inbound queue, while a Warm Standby system could be delivering SQL commands prior to the command being committed at the primary. This is accomplished by the DSI processing large transactions from the “Open” queue vs. the more normal “Closed” queue in the SQT cache. Overall, this can significantly reduce latency as the DSI does not have to wait for the full command to be in the queue prior to sending it to the replicate.

However, this does have a possible negative effect in Warm Standby systems that a large transaction may be rolled back at the primary – and need to be rolled back at the replicate. How can this happen??? Simple. Consider the case of a fairly normal bcp of 100,000 rows into a replicated table (slow bcp so row changes are logged). As the row changes are logged, they are forwarded to the Replication Server by the Rep Agent long before the commit is even submitted to the primary system. If at the default, after 100 rows have been processed to the Replication Server, the transaction would be labeled as a large transaction. As a result, the DSI would start applying the transaction’s row changes immediately without waiting for a commit (in fact, the commit may not even have been submitted to the primary yet). Now, should the bcp fail due to row formatting problems – it will need rolled back - not only in the primary, but also at the replicate as the transaction has already been started.

With such a negative, why is this done?? The answer is simple – transaction rollbacks in production systems are extremely rare (or should be!!) – therefore this issue is much more of an exception and not the norm. In fact, for normal (non-Warm Standby) replication, the commit had to have been issued at the primary and processed in the inbound queue or it would not have even got to the outbound queue. In addition, the benefit of this approach far outweighs the very small amount of risk. Consider the latency impact of waiting until the commit is read in the outbound queue as illustrated below by the following timeline:


Large Xactn at PDS

Rep Agent Processing

Inbound SQT Sort

DIST/SRE

Outbound SQT Sort

DSI -> RDS (normal)

DSI -> RDS (large xactn)


Large Xactn at PDSLarge Xactn at PDS

Rep Agent ProcessingRep Agent Processing

Inbound SQT SortInbound SQT Sort

DIST/SREDIST/SRE

Outbound SQT SortOutbound SQT Sort

DSI -> RDS (normal)DSI -> RDS (normal)

DSI -> RDS (large xactn)DSI -> RDS (large xactn)

Figure 77 – Latency in processing large transactions

Without starting to apply the transaction until the commit is read, several problems can occur. First, as illustrated above, the overall latency of the transaction is extended. In the bottom DSI execution of the transaction (labeled DSI -> RDS (large xactn)), it finishes well before it would if it waited until the transaction was moved to the SQT Closed queue. This is definitely an important benefit for batch processing to ensure that the batch processing finishes at the replicate prior to the next business day beginning. Consider the above example. If each time unit equaled an hour (although 2 hours for DIST/SRE processing is rather ludicrous) at the transaction began at the primary at 7:00pm, it

Final v2.0.1

255

would finish at the replicate at 7:00am the next morning using large transaction thread processing. Without it, the transaction would not finish at the replicate until 10:00am – 2 hours into business processing.

The latency savings for this is really evident in Warm Standby. Remember, for Warm Standby, the Standby DSI is reading from the inbound queue’s SQT cache. Normal (small) transactions, of course, are not sent to the Standby database until they have committed. However, since a large transaction reads from the SQT “Open” queue, it is fully possible that the Standby system will start applying the transaction within seconds of it starting at the primary and would commit within nearly the same time. Compare the following timeline with the one above.


Large Xactn at PDS

Rep Agent Processing

Inbound SQT Sort

DSI -> RDS (normal)

DSI -> RDS (large xactn)

dsi_large_xact_sizerows scan time


Large Xactn at PDSLarge Xactn at PDS

Rep Agent ProcessingRep Agent Processing

Inbound SQT SortInbound SQT Sort

DSI -> RDS (normal)DSI -> RDS (normal)

DSI -> RDS (large xactn)DSI -> RDS (large xactn)

dsi_large_xact_sizerows scan time

Figure 78 – Latency in processing large transactions for Warm Standby

However, the above will only happen if large transactions run in isolation. The problem is that if a large transaction begins to be applied and another smaller transaction commits prior to the large transaction, the large transaction is rolled back and the smaller concurrent transaction committed in order. After the smaller transaction commits, the large transaction does not restart from the beginning automatically - but rather waits until the commit is actually received before it is reapplied. This probably is due to the expense of large rollback’s and the aspect that if it the rollback occurs once, it is likely to occur again. This behavior is easily evident by performing the following in a Warm Standby configuration:

1. Configure the DSI connections for parallel DSI using the default parallel_dsi=’on’ setting. 2. Begin a large transaction at the primary (i.e. a 500 row insert into table within an explicit transaction).

At the end of the transaction place a waitfor delay “00:03:00” immediately prior to the commit.

3. Use a dirty read at the replicate to confirm large transaction is started. 4. Perform an atomic insert into another table at the primary (allow to implicitly commit) 5. Use a dirty read at the replicate to confirm large transaction rolled back and does not restart until delay

expires and transaction commits.

As a result, attempts to tune for and allocate large transaction threads will be negated if smaller/other transactions are allowed run concurrently and commit prior to the large transaction(s). This behavior, coupled with the “early conflict detection” and other logic implemented in large transaction threads to avoid excessive rollbacks is a very good reason to avoid the temptation - especially in Warm Standby - to reduce dsi_large_xact_size with hopes of improving throughput and reducing latency.

Key Concept #28: Large transaction DSI handling is intended to reduce the double “latency penalty” that waiting for a commit record in the outbound queue introduces in normal replication and latency as well as switch active timing issues associated with Warm Standby. However, it is nearly only useful when large transactions run in isolation (such as serial batch jobs).

Having said that, large transactions run concurrently (provided started in order of commit) such as concurrent purge routines may be able to execute without the rollback/wait for commit behavior. However, concurrent large transactions may not experience the desired behavior as will be discussed in the next section.

Final v2.0.1

256

Early Conflict Detection

Another factor of large transactions that the dsi_large_xact_size parameter controls is the timing of early conflict detection. This is stated in the Replication Server Administration manual as “After a certain number of rows (specified by the dsi_large_xact_size parameter), the user thread attempts to select the row for the next thread to commit in order to surface conflicting updates.” What this really means is the following. During processing of large transactions, every dsi_large_xact_size rows, the DSI thread attempts to select the sequence number of the thread before it. So, for example, for a large transaction of 1,000 statements (i.e. a bcp of 1,000 rows), the Replication Server would insert an rs_get_threadseq every 100 rows (assuming dsi_large_xact_size is still the default of 100). By doing this, if there is a situation in which the large transaction is blocking the smaller one, a deadlock is caused, thus “surfacing” the conflict. This is illustrated in the diagram below, in which thread #2 is being blocked by a conflicting insert by thread #3.

Ins UT1 BT 1UpdCT 1

BlockedBlockedIns UT2 BT 2UpdCT 2 ST1

DeadlockDeadlockIns UT3 BT 3UpdCT 3 ST2InsUpdST2 Ins Upd

BlockedBlockedIns UT4 BT 4UpdCT 4 ST3InsUpdST3 Ins Upd

BT # Begin transaction for transaction #

UT# Update on rs_threads for thread id # (blocks own row)

ST# Select on rs_threads for thread id # (check for previous thread commit)

CT # Commit transaction for transaction # Figure 79 – Early Conflict Detection with large transactions

The reason for this is the extreme expense of rollbacks and the size of large transactions. To put this in perspective, try a large transaction in any database within an explicit transaction and roll it back vs. allowing it to commit. Although performance varies from version to version of ASE as well as the transaction itself, a normal transaction may take a full order of magnitude longer to rollback than it takes to fully execute (i.e. a transaction with an execute time of 6 minutes may require an hour to rollback). By surfacing the offending conflict earlier rather than later, the rollback time of the large transaction is reduced. This is crucial as no other transaction activity is re-initiated until all the rollbacks have completed. Consequently, without the periodic check for contention by selecting rs_threads every dsi_large_xact_size rows, a large transaction could have a significantly large “penalty” (i.e. 900 rows for the bcp example). This is illustrated in the below diagram – a slight modification of the above – with the intermediate rs_thread selects grayed out.

Final v2.0.1

257

Ins UT1 BT 1UpdCT 1

BlockedBlockedIns UT2 BT 2UpdCT 2 ST1

DeadlockDeadlockIns UT3 BT 3UpdCT 3 ST2InsUpdST2 Ins Upd

BlockedBlockedIns UT4 BT 4UpdCT 4 ST3InsUpdST3 Ins Upd

Rollback/BlockPenalty Range

Figure 80 – Possible Rollback Penalty without Early Conflict Detection

Now then, getting back to the point earlier discussed in the previous section – the temptation to reduce dsi_large_xact_size until most transactions qualify – with the goal of reducing latency. To understand why this is a bad idea, consider the following points:

• Large transactions are never grouped. Consequently, this eliminates the benefits of transaction grouping and increase log I/O and rs_lastcommit contention.

• In order to ensure most transactions qualify, dsi_large_xact_size has to be set fairly low (i.e. 10). The problem with this is that every 10 rows, the large DSI threads would block waiting for the other threads to commit. If the average transaction was 20 statements and 5 large transaction threads were used, the first would have all 20 statements executing while the other 4 would execute up to the 10th and block. The higher the ratio of dsi_large_xact_size to average transaction size, the more the performance degradation. By contrast – a serialization method of “none” would let all 5 threads execute up to the 20th statement before blocking.

• The serialization between large transaction threads is essentially none up to the point of the first dsi_large_xact_size rows – since we are not waiting for the commits at all (let alone waiting until they are ready to be sent). If the transactions have considerable contention between them to the extent wait_for_commit would have been a better serialization method, the large transactions could experience considerable rollbacks and retries. After the first dsi_large_xact_size rows, the rs_threads blocking changes the remainder of the large transaction to more of a wait_for_commit serialization.

The last bullet takes a bit of thinking before it can be understood. Let say we have a novice Replication System Administrator (named Barney) who has diligently read the manuals, took the class – but didn’t test his system with a full transaction load (nothing abnormal here – in fact, it is rarity – and a shame – these days to note that few if any of large IT organizations stress test their applications or even have such a capability). However, being a “daring” individual, Barney decides to capitalize on the large transaction advantage of reading from the SQT Open queue and sets dsi_num_threads to 5, dsi_num_large_xact_threads to 4 and finally sets dsi_large_xact_size to 5 (his average number of SQL statements set from the application – a web order entry system). Now then, let’s assume due to triggered updates for shipping costs, inventory tracking, customer profile updates, etc., the 5 SQL statements expands to a total of 12 statements per transaction (not at all hard). What Barney assumes he is getting looks similar to the following:

Final v2.0.1

258


Begin/Commit TransactionReplicated Statementrs_threads select

rs_threads block on seq



rs_threads block on seq Figure 81 – Wishful Concurrent Large Transaction DSI Threads

The expectation: everything is done at T05. What Barney actually gets is more like:




Thread 3 blocked by thread 2








Thread 5 blocked by thread 4 Figure 82 – Real Life Concurrent Large Transaction DSI Threads

This illustrates how the first dsi_large_xact_size rows are similar to a serialization method of “none” while those statements after transition to more of a wait_for_commit. By the way, consider the impact if the last statement in thread 4 conflicts with one of the first rows in thread 5. A rollback at T12.

Now, the unbeliever would be quick to say that the dsi_large_xact_size could be increased to exactly the rows in the transaction (i.e. 12) at which point we would really have the execution timings in the earlier figure. Possibly – be real hard as the number of statements in a transaction is not a constant. However, remember – we have now lost transaction grouping, introduced a high probability of contention/rollbacks, increased load on rs_lastcommit and replicate transaction log – all for very little gain in latency for smaller transactions. While not denying that in some very rare instances of Warm Standby with a perfect static transaction size with no contention between threads that there is a probability that this type of implementation might help a small amount – the reality is that it is highly improbable - especially given the concurrent transaction induced rollback earlier discussed.

Thread Allocation

A little known and undocumented fact is that dsi_num_large_xact_threads are reserved out of dsi_num_threads exclusively for large transactions. That means only 3 threads are available for processing normal transactions if you set the default connection parameter of “parallel_dsi” to “on” without adjusting any of the other parameters (parallel_dsi “on” sets dsi_num_threads to 5 and dsi_num_large_xact_threads to 2 – leaving only 3 threads for normal transactions of <100 rows (at default)). This can surprise some administrators – who in checking their replicate dataserver –

Final v2.0.1

259

discover that “only” a few of the configured threads are active. Combining this with the previous topic yields another key to understanding Parallel DSI’s:

Key Concept #29: For most systems, it is extremely doubtful that more than 2 large transaction threads will improve performance. In addition, since large transaction threads are “reserved”, increasing the number of large transaction threads may require increasing the total number of threads to avoid impacting (small) normal transaction delivery rates.

Maximizing Performance with Parallel DSI’s

By now, you have enough information to understand why the default settings for the parallel_dsi connection parameter are what they are in respect to threading – and why this may not be the most optimal. Consider the following review of points from above:

• In keeping with Replication Server’s driving philosophy of maximizing resilience, the default serialization method is “wait_for_commit” as this minimizes the risk of inter-thread contention causing significant rollbacks.

• When using the “wait_for_commit” serialization method, only 3 Parallel DSI’s will be effective. Using more than this number will not bring any additional benefit.

• For most large transactions – due to the early conflict detection algorithm – no more than 2 large transaction threads will be effective. After this point, no more benefit will be realized as the next large transaction could reuse the first thread.

However, this may not be even close to optimal as the assumption is that there will be significant contention between the Parallel DSI’s and the large transactions are significantly higher than dsi_large_xact_size setting. If this is not true for your application (typically the case), then the default “parallel_dsi” settings are inadequate. To determine the optimal settings, you need to understand the profile of the transactions you are replicating, eliminate any replication or system induced contention at the replicate and develop Parallel DSI profiles of settings corresponding to the transaction profile during each part of the business day.

Parallel DSI Contention

Wait_for_start serialization method provides some of the greatest scalability – the more DSI’s involved, the higher the throughput. However, it also means a higher probability of contention causing rollback of a significant number of transactions (remember, if one rolls back, the rest do as well). Remember – the threads are already blocked on each other’s rows in rs_threads – deliberately – to ensure commit order is maintained. Any contention between threads, then, is more than likely going to cause a deadlock. Consider the following illustration.

Ins Table A Upd rs_threads 1 BT 1Upd Table BCT 1

Ins Table B Upd rs_threads 2 BT 2Upd Table CCT 2 Sel rs_threads 1

Ins Table A Upd rs_threads 3 BT 3Upd Table CCT 3 Sel rs_threads 2

Ins Table C Upd rs_threads 4 BT 4Upd Table ACT 4 Sel rs_threads 3

DeadlockDeadlock

BlockedBlockedBlockedBlocked

DeadlockDeadlock

BT #CT #

Begin Tran for thread # - marks beginning of transaction group

Commit Tran for thread # - marks end of transaction group

Normal thread sequencing blockInter-thread contention on transactions within group

Figure 83 – Deadlocking between Parallel DSI’s with serialization method of “none”

Final v2.0.1

260

In the example above, two deadlocks exist – threads 1 & 2 are deadlocked since thread 2 is waiting on thread 1 to commit as normal (rs_threads) yet thread 2 started processing its update on table B prior to thread 1 (assuming the same row hence the contention). As a result, #2 is waiting on #1 and #1 is waiting on #2 – a classic deadlock.. Threads 3 & 4 are similarly deadlocked. Interestingly enough, one of the more frequent tables “blamed” for deadlocks in replicated environments is the rs_threads table. As you can see – this is rather deliberate. Consequently deadlocks involving rs_threads should not be viewed as contention issues with rs_threads, but rather an indication of contention between the transactions the DSI’s were applying. An easy way to find out the offenders is to turn on “print deadlock info” configuration in the dataserver using sp_configure and simply ignore the pages for the object id/table rs_threads.

The biggest problem with this is that once one thread rollsback (typical response for a deadlock), all the subsequent threads will rollback as well. In order to prevent the contention from continuing and causing the same problems all over again, the Replication Server will retry the remaining transactions serially (one batch at a time) before resuming parallel operations. Obviously, a rollback followed by a serial transaction delivery will cause performance degradation if it happens frequently enough. However, a small number of occurrences are probably not a problem. During a benchmark at a customer site, using the default wait_for_commit resulted in the inbound queue rapidly getting one hour behind the primary bcp transaction. Switching to “none” drained the queue in 30 minutes as well as keeping up with new records. During these 30 minutes, the Replication Server encountered 3 rollbacks per minute – ordinarily excessive, but in this case, the serialization method of none was outperforming the default choice. However, at another customer site, a parallel transaction failed every 3-4 seconds – and no performance gain was noted in using “none” over “wait_for_commit”. As usual, this illustrates the point that no one-size-fits-all approach to performance tuning works and that each situation brings its own unique problem set.

While the book states that “This method assumes that your application is designed to avoid conflicting updates, or that lock protection is built into your database system.” it is not as difficult to achieve as you think. Basically, if you do not have a lot of contention at the primary, then contention at the replicate may be a direct cause of system tuning settings at the replicated DBMS and not due to the transactions. If the contention is system induced, you need to first determine the type of contention involved and whether it involves. Consider the following matrix of contention and possible resolutions.

Contention Possible Resolution(s)

Last page contention Change clustered index, partition table or use datarow locking.

Index contention Use datapage locking or reduce dsi_max_xacts_in_group

Row contention Reduce dsi_max_xacts_in_group until contention reduced

Note that nowhere in the above did we suggest changing the serialization method to “wait_for_commit”. If the problem is system induced as compared to the primary – yes, wait_for_commit will resolve it – however, the impact on throughput can be severe. In almost any system, a serialization method of “none” should be the goal. Backing off from that goal too quickly when other options exist could have a large impact on the ability of Replication Server to achieve the desired throughput. Keep in mind that even 2 threads running completely in parallel with a serialization of “none” may be better than 5 or 6 using “wait_for_commit”.

Understanding Replicated Transaction Profile

In order to determine if the contention at the replicate (if there is any) is due to replication or schema induced contention, you need to develop a sense of the transaction profile being executed at the primary during each part of the business day. Consider the following fictitious profile:

Transaction Type Time Range Volume (tpd) Leading Contention Cause

Execute Trade OLTP 0830 - 1630 500,000 get next trade id

Place Order OLTP 0700 - 1900 750,000 get next order id

Adjust Stock Price OLTP 0830 - 1630 625,000 place order read

401K Deposit Batch 0500 – 0700 125,000* mutual fund balance

Money Market Deposit OLTP 0900 - 1700 1,000 central fund update

Money Market Check Batch 1800 - 2200 750 central fund withdrawal

Close Market Batch 1700 - 1930 1 isolation level 3 aggregation

Final v2.0.1

261

Transaction Type Time Range Volume (tpd) Leading Contention Cause

Purge Historical Batch 2300 - 2359 1 Index maintenance

* Normalized for surge occurring on regular periodic basis

Note that the first two OLTP transactions have a monotonic key contention issue. When replicating this transaction, the id value will be known, therefore, this will not cause contention at the replicate. Accordingly, we would be most interested in what the second leading cause of contention is, however, we may not be able to determine that as the first one may be masking it.

Also, in the above list of sample transactions, some of the OLTP transactions not only affect individual rows of data representing one type of business object (such as customer account) – but they also affect either an aggregate (central fund balance) or other business object data. The contention could be on the latter. For example, each individual 401K pay deposit affects the individual investor’s account. In addition, it also adjusts their particular fund’s pool of receipts with which the fund manager uses for reinvestment. It is the activity against the fund data that could be the source of contention and not the individual account data.

Resolving Parallel DSI Contention

Figure 84 – Parallel DSI Contention Monitoring via MDA Tables

In the above set of tables, if the database is only being used by the maintenance user, monOpenObjectActivity can provide a fairly clear indication of which tables are causing contention among the parallel DSI’s by monitoring the monOpenObjectActivity.LockWaits column. If the transaction profile is not well understood, the RowsInserted,

Final v2.0.1

262

RowsDeleted, and RowsUpdated also can provide a sense of what is going on at a more table/index level perspective than the RS DSI monitor counters. A few general tips are

DSI Partitioning

Having understood where the contention occurs at the primary, you then have to look at where contention is at the replicate. It is unfortunate, but almost in every case in which a customer has called Sybase Support with Replication Server performance issues, few have bothered to investigate if and where contention is the cause. This is especially true in Warm Standby scenarios in which the replicate system is the only updated by the Replication Server (and attempting a serialization method of “none”). Additionally, in the few cases where the administrators have been brave enough to attempt the “none” serialization method, as soon as the first error that occurs stating a parallel transaction failed and had to be retried in serial, the immediate response is to switch back to wait for commit vs. eliminating the contention – or even determining if that level of contention is acceptable. In one example of the latter, during a bulk load test in a large database, the queue got 1GB behind after 1 hour using “wait_for_commit”. After switching to “none”, the queue was fully caught up in 30 minutes. However, during that period, approximately 3 parallel transactions failed per minute and were retried in serial. The trade-off was considered more than acceptable – 90 errors and empty queue vs. no errors and 1GB backlog. Just think though – if you were able to eliminate the contention that caused even 50% of the failures – the number of additional transactions per minute would be at least equivalent to the number of DSI’s. For example, in this case, 10 DSI’s were in use. This means an extra 15 transactions (3 * 0.50 * 10) could have been applied per minute – or 450 transactions during that time. And this is an extremely low estimate as we have not include the time it took to reapply the transactions in serial – during which the system could still be applying the transactions in parallel.

Which brings us back to the point – how can we eliminate the contention at the replicate? The answer is (of course) it all depends on what the source of contention is – is it contention introduced as a result of replication or contention between replication and other users.

Replication Induced Contention

As discussed earlier, replication itself can induce contention – frequently resulting in the decision to use suboptimal Parallel DSI serialization methods. For normal transactions, a serialization method of “none” will achieve the highest throughput. The goal is to eliminate any replication induced contention that is preventing use of “none” and then to assess whether the level of parallel transaction retries is acceptable. As discussed earlier, the main cause of contention directly attributable to replication is the transaction grouping. Transaction grouping is a good feature, however, at its default of 20 transactions per group, it can frequently lead to contention at the replicate that didn’t exist in the primary. The easiest way to resolve this is to simply reduce the dsi_max_xacts_in_group parameter until most of the contention is resolved. A possible strategy is to simply halve the dsi_max_xacts_in_group repeatedly until the replication induced contention is nearly eliminated. While it is theoretically possible to eliminate all replication-induced contention caused by transaction grouping in this manner, there is a definite tradeoff in eliminating transaction grouping and the associated increase in log and I/O activity and a limited acceptance of some contention. This means you will need to be willing to accept some degree of parallel transactions failing and being retried. If you remember, in an earlier session we mentioned that in one system, Replication got 1GB behind using “wait_for_commit”. By switching to “none”, Replication Server not only was able to keep up, it was able to fully drain the 1GB backlog in less than 30 minutes. During that time, however, an average of 3 parallel transactions per minute failed and were retried. This was completely acceptable considering the relative gain in performance.

Concurrency Induced Contention

In a sense, the transaction grouping is a form of concurrency that is causing contention. In addition to transaction grouping, the mere fact that Parallel DSI’s are involved means that the individual Parallel DSI’s could experience contention between them as well as with other users on the system. Possible areas of contention include:

• Replication to aggregate rollups in which many source transactions are all attempting to update the same aggregate row (i.e. total sales) in the destination database.

• DML applied serially at source that is being applied in parallel at replicate in which contention exists. For example, a (slow) bcp at primary does not have any contention. However, if the bcp specified a batch size (using –b), then the Replication Server may send the individual batches using Parallel DSI’s. The result is last page contention or index contention at the replicate.

• Replicated transactions that had contention at the primary. • Transactions that have contention at the replicate due to the timing of delivery where at the primary no

contention existed due to different timings. The timing difference could be the result of Replication Server component availability (i.e. Replication Agent was down) or due to long running transactions at the replicate delaying the first transaction until the conflicting transaction was also ready to go (i.e. a long running procedure at replicate would delay further transactions).

Final v2.0.1

263

How and if this contention could be eliminated depends on the type of contention. For example, where contention exists at index or page level for data tables, but not on the same rows, changing the replicate system to use datapage or datarow locking may bring relief.

Finding Contention using MDA Monitoring Tables

In ASE 12.5.0.3 and later, Sybase provides system monitoring tables via the Monitoring and Diagnostics API (MDA). As a result, sometimes these are often referred MDA tables, but technically they are known as the "monitoring tables". These tables are actually proxy tables that interface to the MDA via standard Sybase RPC calls. In order to determine where intra-parallel DSI contention is originating, you mainly need to look at five of these tables:

Monitoring Table Information Recorded

monLocks Records the current process lock information

monProcess Records information about currently executing processes

monProcessSQLText Records the SQL for currently executing processes

monSysStatement Records previously executed statement statistics

monSysSQLText Records previously executed SQL statements

The relationship between these tables is depicted below:

SPID = SPIDKPID = KPID


SPID = BlockingSPID



BatchID = BatchIDLineNumber = LineNumber

monLocks

SPIDKPIDDBIDParentSPIDLockIDContextObjectIDLockStateLockTypeLockLevelWaitTimePageNumberRowNumber

smallintintintsmallintintintintvarchar(20)varchar(20)varchar(30)intintint

<pk,fk><pk,fk>

monProcess

SPIDKPIDWaitEventIDFamilyIDBatchIDContextIDLineNumberSecondsConnectedBlockingSPIDDBIDEngineNumberPriorityLoginApplicationCommandNumChildrenSecondsWaitingBlockingXLOIDDBNameEngineGroupNameExecutionClassMasterTransactionID

smallintintsmallintsmallintintintintintsmallintintsmallintintvarchar(30)varchar(30)varchar(30)intintintvarchar(30)varchar(30)varchar(30)varchar(255)

<pk><pk><fk1>

<fk2>

monSysStatement

SPIDKPIDDBIDProcedureIDPlanIDBatchIDContextIDLineNumberStartTimeEndTimeCpuTimeWaitTimeMemUsageKBPhysicalReadsLogicalReadsPagesModifiedPacketsSentPacketsReceivedNetworkPacketSizePlansAltered

smallintintintintintintintintdatetimedatetimeintintintintintintintintintint

<pk,fk><pk,fk>

<pk>

<pk>

monProcessSQLText

SPIDKPIDBatchIDLineNumberSequenceInLineSQLText

smallintintintintintvarchar(255)

<pk,fk><pk,fk><pk><pk><pk>

monSysSQLText

SPIDKPIDBatchIDLineNumberSequenceInBatchSQLText

smallintintintintintvarchar(255)

<pk,fk><pk,fk><pk,fk><pk,fk><pk>

Figure 85 – MDA-based Monitoring Tables Useful for Identifying Contention

The difficult aspect of using monitoring tables is remembering which of the tables contain currently executing information and which contain previously executed statements. This is important since once a SQL statement is done executing, the information about that statement will be only available in the monSys* tables vs. monProcess* -

Final v2.0.1

264

however, the statement may still be holding locks. Consequently, if there is contention, the blocked statement will still be executing and in the monProcess* tables, while the statement(s) that caused the contention may be either in monProcess* (if still executing such as long running procedure or if blocked itself) or in monSys* if the statement has finished executing but the transaction has not yet committed.

Another aspect is that some of the monitoring tables concerns the fact that some of the tables are meant to be queried to build historical trend systems – by multiple users simultaneously. Classic examples of this are monSysStatement and monSysSQLText. The first time you query these tables, it returns all the rows that the pipe contains. Subsequent queries will only return rows that previously have not been returned to your connection. Consequently, if two different users are querying the monSysStatement table at different times, the proper rows will be returned.

Note that as mentioned earlier, you should start with the monOpenObjectActivity table (specifically the LockWaits column). If RS is the only user in the system, that may be all that is necessary. But if not, the next step is to enable statement monitoring. With statement monitoring, rather than looking at monLocks, you would actually track monProcessSQLText and monProcess. The technique is fairly simple - rapidly poll the tables (frequently enough to get an idea of the contention). Then by using the monProcess.BlockingSPID column you can identify both the blocked and blocking users along with their SQL statements at the time via monProcessSQLText. You can also review the past statements from monSysSQLText as well as look for statements with WaitTime in monSysStatements as indicators of where contention might exist.

Parallel DSI Configuration vs. Actual

In a sense, the num_dsi_threads configuration parameter is a “limiter” or maximum number of DSI threads that the Replication Server can use for any connection. Executing admin who, of course, will list all of the parallel DSI thread processes within the Replication Server. However, a check of sp_who <maint_user> may as few as two connections or may have show some number of connections but monitoring may show that only a few of them are actually active. Basically, after each batch is sent to a thread, the RS checks to see if thread seq #1 is available again. If so, it simply sends the next batch of SQL back to thread seq #1 instead of the next thread in sequence.

This phenomena can be controlled loosely by adjusting the dsi_max_xacts_in_group as well as dsi_xact_group_size. If the transactions are fairly fast (i.e. atomic inserts) and both are set fairly small (i.e. 3 and 2048 respectively), by the time the DSI Scheduler dispatches the second batch to the second DSI, the first will be available again – and will be re-used. By setting them to higher values – such as20 and 65536, it may take more time for the larger transaction to commit and the RS may use the full complement of DSI threads configured for.

Developing Parallel DSI Profiles

Similar to managing named data caches in Adaptive Server Enterprise, you may have to establish DSI profiles to manage replication performance during different periods of activity. Consider the following table of example settings:

Profile dsi_

seri

aliz

atio

n _m

etho

d

num

_thr

eads

dsi_

num

_lar

ge_

xact

_thr

eads

dsi_

larg

e_xa

ct_

size

dsi_

max

_xac

ts_

in_g

roup

normal daily activity None 10 1 1000 5

post-daily processing wait_for_commit 5 2 100 30

bcp data load None 5 0 1000 -1

bcp of large text data wait_for_commit 3 2 100 5

Developing a similar profile for your replication environment will enable the Replication Server to avoid potentially inhibitive deadlocks and retries during long transactions such as large bcp and high incidence SQL statements typical of post-daily processing routines. For small and large bcp loads, however, remember to use the –B option to breakup potentially queue filling bulk loads of data.

Key Concept #30: Maximum performance using Parallel DSI’s can only be achieved after replication and concurrency caused contention is eliminated and DSI profiles (based on the transaction profile) are developed to minimize contention between Parallel DSI’s.

Final v2.0.1

265

Tuning Parallel DSI’s with Monitor Counters

Tuning parallel DSI’s with monitor counters really boils down to maximizing parallelism while decreasing contention. First, let’s start by taking a look at what counters are available that might be of use during tuning parallel DSI’s

Parallel DSI Monitor Counters

While some of the same counters are used for parallel DSI tuning as with regular DSI tuning, the object is to see if the aggregate numbers for these counters is higher than with a single DSI. In addition, there are a few counters that relate specifically to Parallel DSI tuning. Overall the counters to watch are listed here (note that for this section we will only be reporting RS 15.0 counters):

Monitor Counter Description

DSITranGroupsSent Transaction groups sent to the target by a DSI thread. A transaction group can contain at most dsi_max_xacts_in_group transactions. This counter is incremented each time a 'begin' for a grouped transaction is executed.

DSITransUngroupedSent Transactions contained in transaction groups sent by a DSI thread.

DSITranGroupsSucceeded Transaction groups applied successfully to a target database by a DSI thread. This includes transactions that were successfully committed or rolled back according to their final disposition.

DSITransFailed Grouped transactions failed by a DSI thread. Depending on error mapping, some transactions may be written into the exceptions log.

RollbacksInCmdGroup Transactions in groups sent by a DSI thread that rolled back successfully.

AllThreadsInUse This counter is incremented each time a Parallel Transaction must wait because there are no available parallel DSI threads.

AllLargeThreadsInUse This counter is incremented each time a Large Parallel Transaction must wait because there are no available parallel DSI threads.

ExecsCheckThrdLock Invocations of rs_dsi_check_thread_lock by a DSI thread. This function checks for locks held by a transaction that may cause a deadlock.

TrueCheckThrdLock Number of rs_dsi_check_thread_lock invocations returning true. The function determined the calling thread holds locks required by other threads. A rollback and retry occurred.

CommitChecksExceeded Number of times transactions exceeded the maximum allowed executions of rs_dsi_check_thread_lock specified by parameter dsi_commit_check_locks_max. A rollback occurred.

GroupsClosedTrans Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group.

DSIFindRGrpTime Time spent by the DSI/S finding a group to dispatch.

DSIPrcSpclTime Time spent by the DSI/S determining if a transaction is special, and executing it if it is.

DSIDisptchRegTime Time spent by the DSI/S dispatching a regular transaction group to a DSI/E.

DSIDisptchLrgTime Time spent by the DSI/S dispatching a large transaction group to a DSI/E. This includes time spent finding a large group to dispatch.

DSIPutToSleep Number of DSI/E threads put to sleep by the DSI/S prior to loading SQT cache. These DSI/E threads have just completed their transaction.

DSIPutToSleepTime Time spent by the DSI/S putting free DSI/E threads to sleep.

DSILoadCacheTime Time spent by the DSI/S loading SQT cache.

DSIThrdRdyMsg ''Thread Ready'' messages received by a DSI/S thread from its assocaited DSI/E threads.

Final v2.0.1

266

Monitor Counter Description

DSIThrdCmmtMsgTime Time spent by the DSI/S handling a ''Thread Commit'' message from its associated DSI/E threads.

DSIThrdSRlbkMsgTime Time spent by the DSI/S handling a ''Thread Single Rollback'' message from its associated DSI/E threads.

DSIThrdRlbkMsgTime Time spent by the DSI/S handling a ''Thread Rollback'' message from its associated DSI/E threads.

Some of these have been discussed before - but in the following sections we will be taking a closer look to see how they work in parallel DSI environments.

Maximizing Parallelism

Obviously, the first step to maximizing parallelism is to use a dsi_serialization_method of “wait_for_start” and disabling large transactions. Then it becomes a progression of finding the right balance of number of threads, the group size and the partitioning rules to effectively use the parallel threads. The secret is to start with a reasonable number of threads based on the transaction profile and either increase the number of threads and/or adjust the transaction group size to keep all of them busy. A couple of key points:

• You can have too many threads - in which case some are not being used • If the group size is too small, the number of threads you can effectively use will be reduced.

With respect to the first comment, if you remember, the DSI-S will re-use a free thread before using a new/idle thread if a thread becomes free. One key counter to look at for this is the DSI counter AllThreadsInUse. If the counter DSI.AllThreadsInUse=0, then it is unlikely that adding threads will help. However, if this has any value, then looking closer at the load balancing of transaction groups and commands sent on each of the individual threads will give a good idea if adding threads will help. Let’s take a look at a high-end trading stress test:

Sam

ple

Tim

e

Cm

d G

roup

s

Tra

ns In

G

roup

s

Tra

ns

Succ

eed

Tra

ns F

ail

Rol

lBac

ks

AllT

hrea

dsB

usy

Che

ck

Loc

ks

Che

cks

Tru

e

Che

cks

Exc

eede

d

RS

Thr

eads

Part

nWai

ts

14:56:11 2 2 2 0 0 0 0 0 0 0 0

14:56:48 643 1241 637 0 0 216 0 0 0 7 0

14:57:29 1796 3447 1803 0 0 562 0 0 0 0 0

14:58:10 1817 3508 1814 0 0 596 0 0 0 0 0

14:58:57 3212 6391 3207 0 0 1907 0 0 0 0 0

14:59:38 2056 4066 2039 0 0 998 0 0 0 23 0

For now, we will only look at the AllThreadsBusy column (highlighted). Before the test began, this was 0 - and this actually may be the case under normal circumstances. The key is to look at the value during peak processing. As we can see, processing started around 14:56 and peaked about 14:58. If we look at the load distribution during this period we would see something similar to the following (dsi_num_threads=7; dsi_max_xacts_in_group=2 - due to contention discussed later):

Sam

ple

Tim

e

DSI

Thr

ead

Tra

ns

Gro

ups

NgT

rans

Xac

t Per

G

rp

Cm

ds

App

lied

Cm

ds P

er

Sec

Inse

rts

Upd

ates

Del

etes

DM

L S

tmts

DM

L S

tmts

Se

c

14:56:48 1 85 165 1.9 666 18 1 332 0 333 9

14:56:48 2 93 183 1.9 739 20 0 370 0 370 10

14:56:48 3 104 189 1.8 763 21 0 382 0 382 10

Final v2.0.1

267

Sam

ple

Tim

e

DSI

Thr

ead

Tra

ns

Gro

ups

NgT

rans

Xac

t Per

G

rp

Cm

ds

App

lied

Cm

ds P

er

Sec

Inse

rts

Upd

ates

Del

etes

DM

L S

tmts

DM

L S

tmts

Se

c

14:56:48 4 86 168 1.9 679 18 0 340 0 340 9

14:56:48 5 91 179 1.9 721 20 0 361 0 361 10

14:56:48 6 88 167 1.8 675 18 0 338 0 338 9

14:56:48 7 90 178 1.9 718 19 0 359 0 359 9

14:57:29 1 253 485 1.9 1940 47 0 970 0 970 23

14:57:29 2 251 483 1.9 1932 47 0 966 0 966 23

14:57:29 3 258 486 1.8 1944 47 0 972 0 972 23

14:57:29 4 262 509 1.9 2036 49 0 1018 0 1018 24

14:57:29 5 255 486 1.9 1950 47 0 975 0 975 23

14:57:29 6 259 502 1.9 2008 48 0 1004 0 1004 24

14:57:29 7 258 496 1.9 1989 48 0 994 0 994 24

14:58:10 1 258 502 1.9 2005 48 1 1001 0 1002 24

14:58:10 2 265 512 1.9 2052 50 2 1023 0 1025 25

14:58:10 3 250 483 1.9 1926 46 2 960 0 962 23

14:58:10 4 258 498 1.9 1990 48 1 994 0 995 24

14:58:10 5 256 500 1.9 1994 48 1 993 0 994 24

14:58:10 6 262 502 1.9 2004 48 1 1001 0 1002 24

14:58:10 7 268 511 1.9 2042 49 0 1021 0 1021 24

14:58:57 1 458 911 1.9 3040 66 599 622 0 1221 26

14:58:57 2 463 919 1.9 3056 66 614 606 0 1220 26

14:58:57 3 458 915 1.9 3069 66 590 648 0 1238 26

14:58:57 4 457 911 1.9 3031 65 613 596 0 1209 26

14:58:57 5 462 921 1.9 3084 67 593 652 0 1245 27

14:58:57 6 458 910 1.9 3039 66 601 618 0 1219 26

14:58:57 7 458 910 1.9 3038 66 595 626 0 1221 26

14:59:38 1 287 574 2 1938 47 354 436 0 790 19

14:59:38 2 288 576 2 1954 47 343 462 0 805 19

14:59:38 3 298 584 1.9 1996 48 329 504 0 833 20

14:59:38 4 302 590 1.9 2026 49 327 522 0 849 20

14:59:38 5 289 578 2 1955 47 356 443 0 799 19

14:59:38 6 312 604 1.9 2060 50 347 509 0 856 20

14:59:38 7 286 574 2 1955 47 331 480 0 811 19

Looking at the workload distribution during the peak period, we see that the number of transaction groups/transactions is extremely balanced. This gives us and indication that adding addition threads during this time frame would increase throughput. It is extremely interesting to note that the transaction profile starts as almost exclusively updates and then

Final v2.0.1

268

becomes an even balance of inserts/updates. Let’s bump the num_threads to 10 and also increase the dsi_max_xacts_in_group to 3 since the load is so evenly balanced and the AllThreadsBusy is so high. This increase is a bit cautious as we are dealing with updates which have been experiencing contention.

Sam

ple

Tim

e

Cm

d G

roup

s

Tra

ns In

G

roup

s

Tra

ns

Succ

eed

Tra

nsFa

il

Rol

lBac

ks

AllT

hrea

dsB

usy

DSI

Yie

lds

Che

ck L

ocks

Che

cks T

rue

Che

cks

Exc

eede

d

RS

Thr

eads

Fa

il

Part

nWai

ts

19:30:43 6 7 6 0 0 0 0 0 0 0 0 0

19:31:18 306 794 306 0 0 43 0 0 0 0 0 0

19:32:01 904 2362 900 0 0 79 0 0 0 0 0 0

19:32:45 2252 6334 2253 0 0 1248 0 0 0 0 9 0

19:33:32 2154 6306 2128 0 0 1918 0 0 0 0 26 0

19:34:17 16 16 16 0 0 0 0 0 0 0 0 0

We can see that the bulk of the processing was accomplished in ~1.5 minutes (from 19:32 to 19:33:32) and only ~2 minutes overall. This is a bit better than the first run which took about 3 minutes overall and processing was distributed over all three minutes. Looking at one reason why, we see immediately that at peak we were processing 6,300 original transactions each ~40 second interval whereas in the first run we were only accomplishing mostly 3,000 transactions with a peak of 6,300. Looking at the load distribution gives us a better idea why.

Sam

ple

Tim

e

DSI

Thr

ead

Tra

ns

Gro

ups

NgT

rans

Xac

tPer

Gr

p Cm

ds

App

lied

Cm

ds P

er

Sec

Inse

rts

Upd

ates

Del

etes

DM

LSt

mts

DM

L S

tmts

Se

c

19:31:18 1 39 97 2.4 388 11 0 194 0 194 5

19:31:18 2 22 56 2.5 223 6 1 110 0 111 3

19:31:18 3 32 82 2.5 328 9 0 164 0 164 4

19:31:18 4 32 88 2.7 352 10 0 176 0 176 5

19:31:18 5 27 74 2.7 296 8 0 148 0 148 4

19:31:18 6 44 107 2.4 421 12 6 200 0 206 5

19:31:18 7 26 65 2.5 260 7 0 130 0 130 3

19:31:18 8 24 62 2.5 248 7 0 124 0 124 3

19:31:18 9 27 73 2.7 292 8 0 146 0 146 4

19:31:18 10 33 90 2.7 360 10 0 180 0 180 5

19:32:01 1 89 228 2.5 912 21 0 456 0 456 10

19:32:01 2 98 253 2.5 1011 23 0 506 0 506 11

19:32:01 3 76 186 2.4 744 17 0 372 0 372 8

19:32:01 4 90 241 2.6 962 22 2 478 0 480 11

19:32:01 5 101 255 2.5 1010 23 1 504 0 505 11

19:32:01 6 86 233 2.7 932 21 0 466 0 466 10

19:32:01 7 108 290 2.6 1160 26 0 580 0 580 13

19:32:01 8 76 206 2.7 824 19 0 412 0 412 9

Final v2.0.1

269

Sam

ple

Tim

e

DSI

Thr

ead

Tra

ns

Gro

ups

NgT

rans

Xac

tPer

Gr

p Cm

ds

App

lied

Cm

ds P

er

Sec

Inse

rts

Upd

ates

Del

etes

DM

LSt

mts

DM

L S

tmts

Se

c

19:32:01 9 97 253 2.6 1009 23 1 503 0 504 11

19:32:01 10 83 217 2.6 859 19 0 429 0 429 9

19:32:45 1 220 619 2.8 2472 56 2 1232 0 1234 28

19:32:45 2 224 630 2.8 2518 57 1 1257 0 1258 28

19:32:45 3 219 622 2.8 2484 56 3 1237 0 1240 28

19:32:45 4 248 640 2.5 2558 58 0 1277 0 1277 29

19:32:45 5 229 647 2.8 2577 58 1 1286 0 1287 29

19:32:45 6 222 643 2.8 2567 58 1 1278 0 1279 29

19:32:45 7 228 644 2.8 2564 58 1 1280 0 1281 29

19:32:45 8 216 620 2.8 2470 56 0 1234 0 1234 28

19:32:45 9 226 644 2.8 2575 58 0 1287 0 1287 29

19:32:45 10 230 658 2.8 2622 59 1 1309 0 1310 29

19:33:32 1 209 627 3 2486 54 18 1216 0 1234 26

19:33:32 2 206 621 3 2465 53 16 1208 0 1224 26

19:33:32 3 234 648 2.7 2570 55 27 1244 0 1271 27

19:33:32 4 206 618 3 2448 53 23 1188 0 1211 26

19:33:32 5 207 621 3 2471 53 19 1207 0 1226 26

19:33:32 6 235 651 2.7 2587 56 18 1267 0 1285 27

19:33:32 7 209 627 3 2493 54 23 1212 0 1235 26

19:33:32 8 209 627 3 2484 54 18 1213 0 1231 26

19:33:32 9 231 645 2.7 2561 55 21 1250 0 1271 27

19:33:32 10 206 621 3 2441 53 31 1175 0 1206 26

Again, the load is fairly balanced during peak processing. The question is whether or not this was indeed a better configuration. The easiest way to tell the difference (besides aggregating across sample periods) is that during the peak processing, this run is steadily in the high 20’s of DML Statements per Second while the first run was primarily in the mid-20’s. However, the transaction mix is a bit different as well as the number of inserts vs. updates in the latter part are significantly different. The problem was that during the last part of the processing (when a few inserts were occurring), the number of parallel DSI failures was nearly 1 every 2 seconds. For this application, it turns out the optimal mix was 9 threads and a group size of 3. However, the same application also executes transactions (mainly inserts) against another database. Since the transactions were nearly exclusively inserts, the DSI profile was 20 threads and a group size of 20.

Controlling Contention

In the above section we made a mention to the fact that parallel DSI contention was nearly 1 every 2 seconds. The most common way to spot contention is to review the errorlog and look for the familiar message stating that a parallel transaction had failed and is being retried serially. However, if looking back over time just using the monitor counters, you may not have access anymore to historical errorlogs. Additionally, even when it is happening, keeping track of all the error messages to determine the relative frequency can be an inexact science.

This is spotted by looking at one of two possible sets of counters - depending on whether Commit Control is used or rs_threads. If Commit Control is used, the answer is fairly obvious - simply look for TrueCheckThrdLock and

Final v2.0.1

270

CommitChecksExceeded - which are recorded as ChecksTrue and ChecksExceeded in the spreadsheet below. However, in this case we were not using Commit Control. In this case, remember a bit from our notion of how parallel DSI’s communicate (and with some experimentation) we determine that in RS 15.0, the DSI counter DSIThrdRlbkMsgTime (specifically the counter_obs column) will tell us how often the DSI had to rollback transactions due to parallel DSI contention. Repeating the above last run’s spreadsheet:

Sam

ple

Tim

e

Cm

d G

roup

s

Tra

ns In

G

roup

s

Tra

ns

Succ

eed

Tra

nsFa

il

Rol

lBac

ks

All

Thr

eads

B

usy

DSI

Yie

lds

Che

ck L

ocks

Che

cks T

rue

Che

cks

Exc

eede

d

RS

Thr

eads

Fa

il

Part

nWai

ts

19:30:43 6 7 6 0 0 0 0 0 0 0 0 0

19:31:18 306 794 306 0 0 43 0 0 0 0 0 0

19:32:01 904 2362 900 0 0 79 0 0 0 0 0 0

19:32:45 2252 6334 2253 0 0 1248 0 0 0 0 9 0

19:33:32 2154 6306 2128 0 0 1918 0 0 0 0 26 0

19:34:17 16 16 16 0 0 0 0 0 0 0 0 0

As we can see, once the inserts start, contention immediately spikes. Possible causes include:

• The inserts may be firing a trigger which can be causing the contention • The inserts may be causing conflicting locks (either due to range/infinity locks or similar if isolation level is

3) • The updates may have shifted to another table and may be the cause of the contention - possibly even be

updates to the same rows (such as updates to aggregate values).

Only further analysis using the MDA tables could tell us what tables are involved in the contention. Note that the key activity here is to try to reduce the contention - the suggested order to use is:

1. first within the DBMS (i.e. change to datarows locking, optimize trigger code, etc.) 2. if this is not possible, to decrease the grouping 3. then try DSI partitioning 4. finally reduce the number of parallel DSI’s (as a last resort)

DSI Partitioning

In RS 12.6, one of the new features that was added to help control contention was the concept of DSI partitioning. Currently, the way DSI partitioning works is that the DBA can specify the criteria for partitioning among such elements as time, origin, origin session id (aka spid), user, transaction name, etc. During the grouping process, the DSI scheduler compares each transaction’s partition key to the next. If they are the same, they are processed serially - if possible, grouped within the same transaction group. If they are different, the DSI scheduler assumes that there is no application conflict between the two and allows them to be submitted in parallel. If the transaction group needs to be closed due to group size and the next transaction has the same partition key value, then that thread is executed as if the dsi_serialization_method was wait_for_commit (and subsequent threads are also held until it starts).

Note that the goal of this feature was specifically aimed at the case in which RS introduces contention either by executing transactions on different connections that were originally submitted on the same - or simply didn’t have any contention at the primary due to the time. As a result, the recommended starting point is “none” for dsi_partitioning_rule. However, if contention exists and it can’t be eliminated and reducing the group size doesn’t help, a good starting point is to set the dsi_partitioning_rule to origin_sessid or the compound rule of ‘origin_sessid, time’.

Once implemented, you will need to carefully monitor the DSI counters PartitioningWaits and in particular the counters for the respective partitioning rule you are using. For example, if using origin_sessid, the counters OSessIDRuleMatchGroup and OSessIDRuleMatchDist will identify how often a transaction was forced to wait (submitted in serial - OSessIDRuleMatchGroup) vs. how often it proceeded in parallel (OSessIDRuleMatchDist). If the parallelism is too low, it might actually be better to reduce the number of parallel threads and try without DSI partitioning. Remember, however, the goal is to reduce the contention. So if by implementing dsi_partitioning_rule = ‘origin_sessid’, you see a drop of AllThreadsBusy from 1000 to 500 and PartitionWaits climbs to 250, but the failed

Final v2.0.1

271

transactions drops from 1-2 per second to 1 every 10 seconds, this is likely a good thing. The final outcome (as always) is best judged by comparing the aggregate throughput rates for the same transaction mix.

Final v2.0.1

273

Text/Image Replication

Okay, just exactly how is Replication Server able to replicate non-logged text/image updates???

The fact that Replication Server is able to do this surprises most people. However, if you think about it – the same way that ASE had to provide the capability to insert 2GB of text into a database with a 100MB log – Replication Server had to provide support for it – AND also be able to insert this same 2GB of text into the replicate without logging it for the same reason. The unfortunate problem is that text/image replication can severely hamper Replication Server performance – degrading throughput by 400% or more in some cases. Unfortunately, other than not replicating text, not a lot can be done to speed this process up.

Text/Image Datatype Support

To understand why not, you need to understand how ASE manages text. This is simply because the current biggest limiter on replicating text is the primary and replicate ASE’s themselves. While we are discussing mainly text/image data, remember, this applies to off row java objects as well as these are simply implemented as image storage. Throughout this section, any reference to “text” datatypes should be treated as any one of the three Large Object (LOB) types.

Text/Image Storage

From our earliest DBA days, we are taught that text/image data is stored in a series of page chains separate from the main table. This allows an arbitrary length of text to be stored without regard to the data page limitation of 2K (or ~1960 bytes). Each row that has a text value stores a 16-byte value – called the “text pointer” or textptr – that points to the where the page chain physically resides on disk. While this is good knowledge, a bit more knowledge is necessary for understanding text replication.

Unlike normal data pages with >1900 bytes of storage, each text page can only store 1800 bytes of text. Consequently a 500K chunk of text will require at least 285 pages in a linked page chain for storage. The reason for this is that each text page contains a 64-byte Text Image Page Statistics Area (TIPSA) and a 152-byte Sybase Text Node (st-node) structures located at the bottom of the page.

Head of st-node (152 bytes)

Page header (32 bytes)

Text/image data (1800 bytes)

TIPSA (64 bytes)Head of st-node (152 bytes)

Page header (32 bytes)

Text/image data (1800 bytes)

TIPSA (64 bytes) Figure 86 – ASE Text Page Storage Format

Typically, a large text block (such as 500K) will be stored in several runs of sequential pages – with the run length depending on concurrent I/O activity to the same segment and available contiguous free space. For example, the 285 pages needed to store 500K of text may be arranged in 30 runs of roughly 10 pages each. Prior to ASE 12.0, updating the end of the text chain – or reading the chain starting at a particular byte offset (as is required in a sense), meant beginning at the first page and scanning each page of text until the appropriate byte count was reached. As of ASE 12.0, the st-node structure functions similar to the Unix File System’s I-node structure in that in contains a list of the first page in each run and the cumulative byte length of the run. For simplicity sake, consider the following table for a 64K text chunk spread across 4 runs of sequential pages on disk:

Final v2.0.1

274

Page Run (page #’s) st-node page byte offset

8 (300-307) 300 14400

16 (410-425) 410 43200

8 (430-437) 430 57600

5 (500-504) 500 65536

This allows ASE to rapidly determine which page needs to be read for the required byte offset without having to scan through the chain. Depending on how “fragmented” the text chain is (i.e. how many runs are used) and the size of the text chain itself, the st-node may require more than 152 bytes. Rather than use the 152 bytes on each page and force ASE to read a significant portion of the text chain simply to read the st-node, the first 152 bytes are stored on the first page while the remainder is stored in it’s own page chain (hence the slight increase in storage requirements for ASE 12.0 for text data vs. 11.9 and prior systems).

It goes without saying, then, that Adaptive Server Enterprise 12.0+ should be considerably faster at replicating text/image data then preceding versions. Thanks to the st_node index, the Replication Agent read of the text chain will be faster and the DSI delivery of text will be faster as neither one will be forced to repeatedly re-read the first pages in the text chain simply to get to the current byte offset where currently reading/writing text.

The first page in the chain – pointed to by the 16-byte textptr is called the First Text Page or FTP. It is somewhat unique in that when a text chain is updated, it is never deleted (unless the data row is deleted). This is surprising but true and still true when setting the text value explicitly to null still leaves this page allocated – simply empty. The textptr is a combination of the page number for the FTP plus a timestamp. The FTP is important to replication because it is on this page that the TIPSA contains a pointer back to the data row it belongs to. So, while the data row contains a textptr to point to the FTP, the FTP contains the Row ID (RID) back to the row. Should the row move (i.e. get a new RID), the FTP TIPSA must be updated. The performance implications of this at the primary server is fairly obvious (consequently, movements of data rows containing text columns should be minimized).

The FTP value and TIPSA pointers can be derived using the following SQL: -- Get the FTP..pretty simple, since it is the first page in the chain and the text pointer in the row -- points to the first page, all we have to do is to retrive the text pointer select [pkey columns], FTP=convert(int,textptr(text_column)) From table Where [conditions] -- Getting the TIPSA and the row from the TIPSA is just a bit harder as straight-forward functions for -- our use are not included in the SQL dialect. Dbcc traceon(3604) Go Dbcc page(dbid, FTP, 2) Go -- look at last 64 bytes, specifically the 6 bytes beginning at offset 1998. The first 4 bytes are -- the page id (depending on platform, the byte order may be reversed) followed by the last 2 bytes -- which are the rowid on the page. For APL tables, you then can do a dbcc page on that page at use -- the row offset table to determine the offset within the page and read the pkey values.

As you can see, determining the FTP is fairly easy, while the TIPSA resembles more of an nonclustered lookup operation which the dataserver internally can handle extremely well.

Standard DML Operations

Text and image data can be directly manipulated using standard SQL DML Insert/Update/Delete commands. As we also were taught, however, this mode of manipulation logs the text values as they are inserted or updated and is extremely slow. The curious might wonder how a 500K text chunk is logged in a transaction log with a fixed log row size. The answer is that the log will contain the log record for the insert and subsequent log records with up to 450 bytes of text data – the final number of log records dependent on the size of the text and the session’s textsize setting (i.e. set textsize 65536).

SQL Support for Text/Image

In order to speed up text/image updates and retrievals as well as provide the capability to insert text data larger than permissible by the transaction log, Sybase added two other verbs to the Transact SQL dialect – readtext and writetext. Both use the textptr and a byte offset as input parameters to determine where to begin read or writing the text chunk. In addition, the writetext command supports a NOLOG parameter which signals that the text chunk is not to be logged in

Final v2.0.1

275

the transaction log. Large amounts of text simply can be inserted or updated through repetitive calls to writetext specifying the byte offset to be where previous writetext would have terminated.

Of special consideration from a replication viewpoint is that the primary key for the row to which the text belongs is never mentioned in the writetext function. The textptr is used to specifically identify which text column value is to be changed instead of the more normal where clause structure with primary key values. Hold this thought until the section on Replication Agent processing below.

Programming API Support

Anyone familiar with Sybase is also familiar (if only in name) with the Open Client programming interface - which is divided into the simple/legacy DB-Lib (Database Library) API interface and the more advanced CT-Lib (Client Library) interface. Using either, standard SQL queries – including DML operations – can be submitted to the ASE database engine. Of course, this is one way to actually modify the text or image data – but as we have all heard, DML is extremely slow at updating text/image and forces us to log the text as well (which may not be supportable). Consequently, both support API calls to read/write text data to ASE very similar to the readtext/writetext functions described above. For example, in CT-Lib, ct_send() is used to issue SQL statements to the dataserver while ct_get_data() and ct_send_data() are used to read/write text respectively. Similar to writetext, ct_send_data supports a parameter specifying whether the text data is to be logged. Note that while we have discussed these functions as if they followed readtext/writetext implementation, in reality, the API functions basically set the stage for the SQL commands instead of the other way around. In any case, similar to write text, the sequence for inserting a text chunk using the CT-LIB interface would look similar to:

ct_send() –- send the insert statement with dud data for text (init pointer) ct_send() –- retrieve the row to get the textptr just init’d ct_send_data() – send the first text chunk ct_send_data() – send the next text chunk ct_send_data() – send the next text chunk … ct_send_data() – send the last text chunk

The number of calls dependent on how large of a temporary buffer the programmer wishes to use to read the text (probably from a file) into memory and pass to the database engine. A somewhat important note is that the smaller the buffer, the more likely the text chain will be fragmented and require multiple series of runs.

Of all the methods currently described, the ct_send_data() API interface is the fastest method to insert or update text in a Sybase ASE database.

RS Implementation & Internals

Now that we now how text is stored and can be manipulated, we can begin applying this knowledge to understand what the issue is with replicating text.

sp_setreptable Processing

If not the single most common question, the question “Why does sp_setreptable take soooo long when executed against tables containing text or image columns?” certainly ranks in the top ten questions asked to TSE. The answer is truthfully – to fix an oversight that ASE engineering “kinda forgot”. If you remember from our previous discussion, the FTP contains the RID for the data row in its TIPSA. The idea is that simply by knowing what text chain you were altering, you would also know what row it belongs to. This is somewhat important. If a user chose to use writetext or ct_send_data(), a lock should be put on the parent row to avoid data concurrency issues. However, ASE engineering chose instead to control locking via locking the FTP itself. In that way (lazily) they were protected in that updates to the data row also would require a lock on the FTP (and would block if someone was performing a writetext) and concurrent writetexts would block as well. Unfortunately for Replication Server Engineering, this meant that ASE never maintained the TIPSA data row RID if the RID was never initialized – which frequently was the case – especially in databases upgraded from previous releases prior to ASE 12.0. In order to support replication, the TIPSA must be initialized with the RID for each data row. Consequently, sp_setreptable contains an embedded function that scans the table and for each data row that contains a valid textptr, it updates the column’s FTP TIPSA with the RID. Since a single data row may contain more than one text or image column, this may require more than one write operation. To prevent phantom reads and other similar issues, this is done within the scope of a single transaction, effectively locking the entire table until this process completes. The code block is easily located in sp_setreptable by the line: if (setrepstatus(@objid, @setrep_flags) != 1)

Unfortunately, as you can imagine, this is NOT a quick process. On a system with 500,000 rows of data containing text data (i.e. 500,000 valid text pointers), it took 5 hours to execute sp_setreptable (effectively 100,000 textptrs/hour – usual caveat of your time may vary is applicable). An often used metric is that the time required is the same as that to build a new index (assuming a fairly wide index key so the number of i/o’s are similar).

Final v2.0.1

276

Key Concept #31: The reason sp_setreptable takes a long time on tables containing text/image columns, is that it must initialize the First Text Page’s TIPSA structure to contain the parent row’s RID.

There is a semi-supported method around this problem provided that pre-existing text values in a database will never be manipulated via writetext or ct_send_data(). That method is to use the legacy sp_setreplicate procedure which does not support text columns and then call sp_setrepcol as normal to set the appropriate mode (i.e. replicate_if_changed). This executes immediately and supports replication of text data manipulated through standard DML operations (insert/update/delete) as well as new text values created with the writetext and ct_send_data methods and slow bcp operations.


Now, the nagging question – “Why on earth is initializing the FTP TIPSA with the RID so critical??” Some may already have guessed. If a user specifies a non-logged writetext operation and only modifies the text data (i.e. no other columns in row changed), then it would be impossible for the Replication Server to determine which row the text belonged to at the replicate. Remember, replicated databases have their own independent allocation routines, consequently, even in Warm Standby, there is no way to guarantee that because a particular text chain starts at page 23456 at the primary that the identical page will be used at the replicate. This is especially true in non-Warm Standby architectures such as shared primary or corporate rollup scenarios in which the page more than likely will be allocated to different purposes (perhaps an OAM page in one, while a text chain in the other).

As a result, the Replication Server MUST be able to determine the primary keys for any text column modified. As you could guess, this lot falls to the task of the Replication Agent. While we have used the term “NOLOG” previously, as those with experience know, in reality, there is no such thing as an “unlogged operation” in Sybase. Instead, operations are considered “minimally logged” – which means that while the data itself is not logged, the space allocations for the data are logged (required for recovery). In addition to logging the space allocations for text data, the text functions internal within ASE check to see what the replication status is for the text column any time it is updated. If the text column is to be replicated, ASE inserts a log row in the transaction log containing the normal logging information (transaction id, object id, etc.) as well as the textptr.

The Replication Agent reads the log record, extracts the textptr and parses the page number for the text chain. Then it simply reads the FTP TIPSA for the RID (itself a combination of a page number and row id) along with table schema information (column names and datatypes as normal) and reads the parent row from the data page. If the text chain was modified with a writetext, the Replication Agent tells the Replication Server what the primary keys were by first sending a rs_datarow_for_writetext function with all of the columns and their values.

Key Concept #32: The Replication Agent uses the FTP TIPSA RID to locate the parent row and then constructs a replicated function rs_datarow_for_writetext to send with the text data to identify the row at the replicate.

In either case – text modified via DML or writetext – similar to transaction logging of text data, in order to send data to the Replication Server, the Replication Agent must break up the text into multiple chunks and send via multiple rs_writetext “append” calls. An example of this from a normal logged insert of data is illustrated in the below LTL block (notice the highlighted sections). distribute @origin_time='Apr 15 1988 10:23:23.001PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000001, @tran_id=0x000000000000000000000001 begin transaction 'Full LTL Test'distribute @origin_time='Apr 15 1988 10:23:23.002PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000002, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_insert yielding after @intcol=1,@smallintcol=1,@tinyintcol=1,@rsaddresscol=1,@decimalcol=.12, @numericcol=2.1,@identitycol=1,@floatcol=3.2,@realcol=2.3,@charcol='first insert', @varcharcol='first insert',@text_col=hastext always_rep, @moneycol=$1.56,@smallmoneycol=$0.56,@datetimecol='4-15-1988 10:23:23.001PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff,@varbinarycol=0x01112233445566778899,@imagecol=hastext rep_if_changed, @bitcol=1 distribute @origin_time='Apr 15 1988 10:23:23.003PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000003, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append first last changed with log textlen=30 @text_col=~.!!?This is the text column value. distribute @origin_time='Apr 15 1988 10:23:23.004PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000004,

Final v2.0.1

277

@tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append first changed with log textlen=119 @imagecol=~/!"!gx"3DUfw@4ª»ÌÝîÿðÿ@îO@Ý@y@f9($&8~'ui)*7^Cv18*bhP+|p{`"]?>,D *@4ª distribute @origin_time='Apr 15 1988 10:23:23.005PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000005, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append @imagecol=~/!!7Ufw@4ª"ÌÝîÿðÿ@îO@Ý@y@f distribute @origin_time='Apr 15 1988 10:23:23.006PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000006, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append last @imagecol=~/!!Bîÿðÿ@îO@Ý@y@f9($&8~'ui)*7^Cv18*bh distribute @origin_time='Apr 15 1988 10:23:23.007PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000007, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_update yielding before @intcol=1, @smallintcol=1, @tinyintcol=1, @rsaddresscol=1, @decimalcol=.12, @numericcol=2.1, @identitycol=1, @floatcol=3.2, @realcol=2.3, @charcol='first insert', @varcharcol='first insert', @text_col=notrep always_rep, @moneycol=$1.56, @smallmoneycol=$0.56, @datetimecol='Apr 15 1988 10:23:23.002PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=notrep rep_if_changed, @bitcol=1 after @intcol=1, @smallintcol=1, @tinyintcol=1, @rsaddresscol=1, @decimalcol=.12, @numericcol=2.1, @identitycol=1, @floatcol=3.2, @realcol=2.3, @charcol='updated first insert', @varcharcol='first insert', @text_col=notrep always_rep, @moneycol=$1.56, @smallmoneycol=$0.56, @datetimecol='Apr 15 1988 10:23:23.002PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=notrep rep_if_changed, @bitcol=0

A couple of points are illustrated above:

• The base function (insert/update) contains the replication status and also whether or not the column contains data. In the last example, “notrep” refers to the fact that the text chain is empty.

• The text replication is passed through a series of rs_writetext append first, append, append, …., append last functions with each specifying the number of bytes.

As you could guess, even when not logging the text, the Replication Agent can simply read the text chain (after all, it already has started to in order to find the RID on the FTP TIPSA).

Key Concept #33: Similar to the logging of text data, text data is passed to the Replication Server by “chunking” the data and making multiple calls until all the text data has been sent to the Replication Server.

Changes in ASE 15.0.1

Because of customer complaints about the impracticality of marking large pre-existing text columns for replication, ASE implemented a different method in ASE 15.0.1 that did not involve updating the TIPSA. Instead, ASE 15.0.1 provides the option of creating an index on the text pointer value in the base table. As a result, when the Replication Agent is scanning the log and sees a textchain allocation, it can perform an internal query of the table via the text pointer index to find the datarow belonging to the text chain. This can be enabled using the following syntax:

-- Warm Standby and MSA syntax with DDL replication sp_reptostandby <db_name> [,'ALL' | 'NONE' | 'L1'] [, 'use_index'] -- Standard table replication marking sp_setreptable <table_name> [, true | false] [, owner_on | owner_off | null] [, use_index] -- Standard text/image column replication marking sp_setrepcol <tab_name> [, column_name] [, do_not_replicate | replicate_if_change | always_replicate] [, use_index]

As you can see, the only difference between these and pre-ASE 15.0.1 systems is the final parameter of ‘use_index’ (or null if using the pre-ASE 15.0.1 implementation). This implementation has advantages and disadvantages

• Advantages o The speed of this index creation obviously depends on the size of the table as well as the settings for

‘number of sort buffers’ and parallel processing. o On extremely large tables, this still is likely to complete in hours vs. days o Read only queries can still execute as create index only uses a shared table lock

Final v2.0.1

278

• Disadvantages o On really large tables, more I/O’s will need to be performed traversing the index to find the data row where

as in the TIPSA method, the page pointer is located on the first index page. o Additional storage space is required to store the text pointer index o Normal DML operations (such as insert, update, deletes) may incur extra processing to maintain the index

(except updates when the text column is not modified and the text pointer index would be considered a ‘safe’ index).

As a result, if expecting a large number of text operations and you can take the upfront cost of the TIPSA method, you may wish to use this instead of the text pointer index. In addition to these considerations, the text/image marking precedence is column table database. As a result, if the database is marked ‘use_index’, but a specific table is marked using the TIPSA method, the table has precedence and will use the TIPSA method.

RS & DSI Thread Processing

As far as Replication Server, text data is handled no differently than any other, except of course, that the DIST thread needs to associate the multitude of rows with the subscription on the DML function (rs_insert) or as designated by the rs_datarow_for_writetext. You may have wondered previously why the rs_datarow_for_writetext didn’t simply contain only the primary key columns vs. the entire row. There actually are two reasons: 1) the DBA may have been lazy and not actually identified the primary key (used a unique index instead); and 2) subscriptions on non-primary key searchable columns would be useless. The latter is probably the most important of the two – without all of the columns, if a site subscribed to data from the primary based on a searchable column (i.e. state in pubs2..authors), the site would probably never receive any text data. However, by providing all data, the DIST thread can check for searchable columns within the data row to determine the destination for the text values.

The bulk of the special handling for text data within the Replication Server is within the DSI thread. First, the DSI thread treats text as a large transaction. In itself, this is not necessarily odd as often text write operations result in a considerable number of rows in the replication queues. However, the biggest difference is how the DSI handles the text from a replicated function standpoint.

Replicated Text Functions

As we discussed earlier, when a text row is inserted using regular DML statements at the primary, the primary log will contain the insert and multiple inserttext log records. The replication agent, as we saw from above, translates this into the appropriate rs_insert and rs_writetext commands. At the replicate, we are lacking something fairly crucial – the textptr. Consequently, the DSI first sends the rs_insert as normal and then follows it with a call to rs_init_textptr – typically an update statement for the text column setting it to a temporary string constant. It then follows this with a call to rs_get_textptr to retrieve the textptr for the text chain allocation just created. Once it receives the textptr, the DSI uses the CT-LIB ct_send_data() function to actually perform the text insert. From a timeline perspective, this looks like the below

distributedistribute rsrs_insert_insert

……

distributedistribute rsrs_writetext append first_writetext append first

distributedistribute rsrs_writetext append_writetext append

distributedistribute rsrs_writetext append last_writetext append last

rsrs_insert_insert

……

rsrs_init__init_textptrtextptr

rsrs_writetext_writetext


rsrs_get__get_textptrtextptr((textpointertextpointer))

Figure 87 – Sequence of calls for replicating text modified by normal DML.

Final v2.0.1

279

For text inserted at the primary using writetext or ct_send_data, the sequence is little different. As we discussed before, because the textreq function within the ASE engine is able to determine if the text is to be replicated – even when a non-logged text operation is performed, ASE will put a log record in the transaction log. The Replication Agent in reading this record, retrieves the RID from the TIPSA and then creates an rs_datarow_for_writetext function. After that, the normal rs_writetext functions are sent to the Replication Server. The DSI simply does the same thing. It first sends the rs_datarow_for_writetext to the replicate. It then is followed by the rs_init_textptr and rs_get_textptr functions as above.

The role of rs_datarow_for_writetext is actually two fold. Earlier, we discussed the fact that it is used to determine the subscription destinations for the text data. For rows inserted with writetext operations, it is also used to provide the column values to the rs_init_textptr and rs_get_textptr function strings so the appropriate row for the text can be identified at the replicate and have the textptr initialized.

The sequence of calls for replicating text modified by writetext or ct_send_data is illustrated below:

distributedistribute rsrs__datarowdatarow_for_writetext_for_writetext

……

distributedistribute rsrs_writetext append first_writetext append first

distributedistribute rsrs_writetext append_writetext append

distributedistribute rsrs_writetext append last_writetext append last

rsrs__datarowdatarow_for_writetext_for_writetext

……

rsrs_init__init_textptrtextptr



rsrs_get__get_textptrtextptr((textpointertextpointer))

Figure 88 – Sequence of calls for replicating text modified by writetext or ct_send_data().

This brings the list of function strings to 4 for handling replicated text. Thankfully, if using the default function classes (rs_sqlserver_function_class or rs_default_function_class), these are generated for you. However, what if you are using your own function class?? If using your own function class, you will not only need to create these four function strings, but you will also need to understand the following:

• Text function strings have column scope. In other words, you will have to create a series of function strings for each text/image column in the table. If you have 2 text columns, you will need two definitions for rs_get_textptr, etc.

• The textstatus modifier available for text/image columns in normal rs_insert, rs_update, rs_delete as well as rs_datarow_for_writetext, rs_init_textptr is crucial to avoid allocating text chains when no text data was present at the primary.

In regards to the first bullet, the text function strings for each text column is identified by the column name after the function name. In the following paragraphs, we will be discussing these functions in a little bit more detail.

Text Function Strings

Consider the pubs2 database. In that database, the blurbs table contains biographies for several of the authors in a column named “copy”. If we were to create function strings for this table, they might resemble the below:

create function string blurbs.rs_datarow_for_writetext;copy for sqlserver2_function_class output language ‘ ‘

Note the name of the column in the function string name definition. As noted earlier, the rs_datarow_for_writetext is sent when a writetext operation was executed at the primary. In the default function string classes, this function is empty for the replicate – the rs_get_textptr function is all that will be necessary. However, in the case of a custom function class, you may want to have this function perform something – for example insert auditing or trace information into an auditing database.

Final v2.0.1

280

Typically the next function sent is the rs_init_textptr, which might look like the below: create function string blurbs.rs_textptr_init;copy

for sqlserver2_function_class output language 'update blurbs set copy = “Temporary text to be replaced” where au_id = ?au_id!new?'

This, at first appears to be a little strange. However, remember, we need a valid text pointer before we start using writetext operations. But since we haven’t sent any text yet….kind of a catch-22 situation. Consequently, we simply use a normal update command to insert some temporary text into the column knowing that the real text will begin at an offset of 0 and therefore will write over top of it. Note that in the examples in the book, it sets the column to a null value. This can be problematic. Although setting a text column to null is supposed to allocate a text chain, in earlier versions of SQL Server, it was no guarantee that setting the text column to null would do so (in fact, it seemed that ~19 bytes of text was the guidelines for System 10.x). In addition, there is a little known (thankfully) option to sp_changeattribute - dealloc_first_txtpg - which asynchronously deallocates text pages with null values. As a result, text replication may fail as the text pointer may get deallocated before the RS retrieves it - or may get deallocated between the time RS allocates it and the first text characters are sent to the ASE. Anytime you get an invalid textpointer error or zero rows error for the textpointer, it is a good idea to check the RS commands being sent (using trace “on”,”DSI”,”DSI_BUF_DUMP”) and validating the text row should exist and that the table attribute for dealloc_first_txtpg is not set. Consequently, to ensure that the text chain is indeed allocated when needed, rather than initializing the textpointer using and update textcol=null, you may want to use an update where textcol=”<some arbitrary string>”.

After initializing the textptr, the next function Replication Server sends is the rs_get_textptr function. create function string blurbs.rs_get_textptr;copy

for sqlserver2_function_class output language 'select copy from blurbs where au_id = ?au_id!new?'

Those who have worked with SQL text functions may be surprised at the lack of a textptr() function call in the output mask as in “select textptr(copy) from …”. This is deliberate. Those familiar with CT-Lib programming know that when a normal select statement without the textptr function is used, it is the pointer itself that is bound using ct_bind() and ct_fetch() calls. The textptr() function solely exists so that those using the SQL writetext and readtext commands can pass it a valid textptr. The CT-Lib API essentially has it built-in as it is only with the subsequent ct_get_data() or ct_send_data() calls that the actual text is manipulated. Since Replication Server uses CT-Lib API calls to manipulate text, the textptr() function is then unnecessary.

Of special note, it is often the lack of a valid textptr – or more than one – that frequently will cause a Replication Server DSI thread to suspend. If this should happen, check the queue for the proper text functions as well as check the RSSD for fully defined function string class. The error could be transient, but it also could point to database inconsistencies where the parent row is actually missing.

Finally, the text itself is sent using multiple calls to rs_writetext. The rs_writetext function can perform the text insert in three different ways. The first is the more normal writetext equivalent as in:

create function string blurbs.rs_writetext;copy for rs_sqlserver2_function_class output writetext use primary log

In this example, RS will use ct_send_data() API calls to send the text to the replicate using the same log specification that was used at the primary. While this is the simplest form of the rs_writetext functions, it is probably the most often used as it allows straightforward text/image replication between two systems that provide ct_send_data() for text manipulation (and therefore one of the biggest problems in replicating through gateways). An alternative is the RPC mechanism, which can be used to replicate text through an Open Server:

create function string blurbs.rs_writetext;copy for gw_function_class output rpc 'execute update_blurbs_copy @copy_chunk = ?copy!new?, @au_id = ?au_id!new?, @last_chunk = ?rs_last_text_chunk!sys?, @writetext_log = ?rs_writetext_log!sys?'

This also could be used to replicate text from a source database to a target in which the text has been split into multiple varchar chunks. Note that in this case, two system variables are used to flag whether this is the last text chunk and

Final v2.0.1

281

whether it was logged at the primary. The former could be used if the target is buffering the data to ensure uniform record lengths (i.e. 72 characters) and to handle white space properly. When the last chunk is received, the Open Server could simply close the file – or if a dataserver, it could update the master record with the number of varchar chunks. Note that the Replication Server handles splitting the chunks of text into 255 byte or less chunks avoiding datatype issues.

The final method for rs_writetext is in fact to prevent replication via no output. create function string blurbs.rs_writetext;copy

for rs_sqlserver2_function_class output none

Which disables text replication no matter what the setting of sp_setrepcol.

Text Function Modifiers

The second aspect of text replication that takes some thought, is the role of the text variable modifiers. While other columns support the usual old and new modifiers for function strings as in ?au_lname!new?, text does not support the notion of a before and after image. The main reason for this, is that while the text rows may be logged, unlike normal updates to tables, the before image is not logged when text is updated. Additionally, if the primary application opts not to log the text being updated, the after image isn’t available from the log either. While it is true that the text does get replicated, so that in a sense an “after image” does exist, remember, that text is replicated in chunks, consequently a single cohesive after image is not available. Even if it were, the functionality would be extremely limited as the support for text datatypes is extremely reduced.

However text columns do support two modifiers: new and text_status. Before you jump and say “wait a minute, didn’t you just say…”, the answer is sort of. In the previous paragraph, we were referring to the old and new as it applies to the before and after images captured from the transaction log. The new text modifier instead refers to the current chunk of text contents without referring to whether it is the old or new values. For example, if left at “always_replicate”, if a primary transaction updates a column in the table other than the text column and minimal column replication is not on, then the text column will be replicated. In this scenario, the “new” chunks are really the “old” values which are still the same. The whole purpose of “new” in this sense was to provide an interface into the text chunks as they are provided through the successive rs_writetext commands. An example of this can be found near the end of the previous section when discussing the RPC mechanism for replicating text to Open Servers (which could then write it to a file). In that example (repeated below), the “new” variable modifier was used to designate the text chunk string vs. the columns text status.

create function string blurbs.rs_writetext;copy for gw_function_class output rpc 'execute update_blurbs_copy @copy_chunk = ?copy!new?, @au_id = ?au_id!new?, @last_chunk = ?rs_last_text_chunk!sys?, @writetext_log = ?rs_writetext_log!sys?'

For non-RPC/stored procedure mechanisms, text columns also support the text_status variable modifier, which specifies whether the text column actually contains text or not. The values for text_status are:

Hex Dec Meaning

0x0000 0 Text field contains NULL value, and the text pointer has not been initialized.

0x0002 2 Text pointer is initialized.

0x0004 4 Real text data will follow.

0x0008 8 No text data will follow because the text data is not replicated.

0x0010 16 The text data is not replicated but it contains NULL values.

During normal text replication, these modifiers are not necessary. However, if using custom function strings, these status values allow you to customize behavior at the replicate – for example, avoiding initialing a text chain when no text exists at the primary. Consider the following:

create function string blurbs_rd.rs_update for my_custom_function_class with overwrite output language ‘ if ?copy!text_status? < 2 -- do nothing since no text was modified

Final v2.0.1

282

else if ?copy!text_status? = 2 or ?copy!text_status? = 4 insert into text_change_tracking (xactn_id, key_val) values (?rs_origin_xactn_id!sys?,?au_id!new?) else if ?copy!text_status? = 8 -- text is not replicated else if ?copy!text_status? = 16 insert into text_change_tracking (xactn_id, key_val, text_col) values (?rs_origin_xactn_id!sys?, ?au_id!new?, “(text was deleted or set to null at the primary)”); ‘

The above function string – or one similar – could be used as part of an auditing system that would only allocate a text chain when necessary – and also signal when the primary text chain may have been eliminated via being set to null.

Performance Implications

As mentioned earlier, the throughput for text replication is much, much lower than for non-text data. In fact, during a customer benchmark in which greater than 2.5GB/hr was sustainable for non-text data, only 600MB/hr was sustainable for text data (or 4x worse). The reason for this degradation is somewhat apparent from the above discussions.


It goes without saying that if the text or image data isn’t logged, then the Replication Agent has to read it from disk – and more than likely physical reads. While the primary transaction may have only updated several bytes by specifying a single offset in the writetext function, the Replication Agent needs to read the entire text chain.

As it reads the text chain, if the original function was a writetext or ct_send_data, it first has to read the row’s RID from the FTP TIPSA, read the row from the base table and construct the rs_datarow_for_writetext function as well. Then as it begins to scan the text chain, it begins to forward the text chunks to the Replication Server. While reading the text chain, all other Rep Agent activity in the transaction log is effectively paused. In highly concurrent or high volume environments, this could result in the Replication Agent getting significantly behind. As mentioned earlier, it might be better to simply place tables containing text or image data in a separate database and replicate both.

Replication Server Processing

Within the Replication Server itself, replicating text can have performance implications. First, it will more than likely fill the SQT cache – and also be the most likely victim of a cache flush meaning it will have to be read from disk. Consequently, not only will the stable queue I/O be higher due to the large number of rs_writetext records required, but also during the transaction sorting, it is almost guaranteed that it will have to be re-read from disk.

The main impact within the Replication Server however, is at the DSI thread. Consider the following points:

• Text transactions can’t be batched • The DSI has to get the textptr before the rest of the text can be processed. This requires more network

interaction than most other types of commands. • Each rs_writetext function is sent via successive calls to ct_send_data(). While this is the fastest way to

handle text, it is not fast. Consider the fact that in ASE versions prior to ASE 12.0, the database engine would have to scan the text pages to find the byte offset. Consequently, processing a single rs_writetext is slower than an rs_insert or other similar normal DML function.

Net Impact

Replicating text will always be considerably slower than regular data. If not that much text is crucial to the application, then replicating text may not have that profound of an impact on the rest of the system. However, if a lot of text is expected, then performance could be severely degraded. At this juncture, application developers have really only three choices:

1. Replicate the text and endure the performance degradation. 2. Use custom function strings to construct a list of changed rows and then asynchronously to replication,

have an extraction engine move the text/image data 3. Don’t replicate text/image at all

Which one is best is determined by the business requirements. For most workflow automation systems, the text is irrelevant and therefore simply can be excluded from replication. However, for high availability architectures involving a Warm Standby, text replication is required.

Final v2.0.1

283

Asynchronous Request Functions

Just exactly why were Asynchronous Request Functions invented for anyway??? It is an even toss up as to which replication topic is least understood – text replication, Parallel DSI’s, or asynchronous request functions. Even for those who understand what they do, they don’t understand the impact that they could have on replication performance. In this section, we will be taking a close look at Asynchronous Request Functions and the performance implications of using them.

Purpose

During normal replication, it is impossible for a replicated data item to be re-replicated back to the sender or sent on to other sites (without the old LTM “–A” mode or the current send_maint_xacts_to_replicate configuration for Replication Agent). However, in some cases this might be necessary. There are many real-life scenarios in which a business unit needs to submit a request to another system and have the results replicated back. While it is always possible to have the first system simply execute a stored procedure that is empty of code as a crude form of messaging, the problem with this is that the results are not replicated back to the sender. The reason is simple – the procedure would be executed at the target by the maintenance user – whose transactions are filtered out. It is also possible to configure the replication agent to not filter out the maintenance user, but that could lead to the “endless loop” replication problem. Since we are discussing it, the obvious solution is asynchronous request functions. Sometimes, however, it might not be the obvious answer as it can get overlooked. In the next couple of sections, we discuss several scenarios of real-life situations in which asynchronous request functions make sense.

Key Concept #34: Asynchronous Request Functions were intended for a replicate system to be able to asynchronously request the primary perform some changes and then re-replicate those changes back to the replicate

Web – Internal Requests

Let’s assume we are working for a large commercial institution such as a bank or a telephone utility company. As part of our customer service (and to stay competitive), we have created a web site for our customers to view online billing/account statements or whatever. However, to protect our main business systems from the ever-present hackers and to ensure adequate performance for internal processes, we have separated the web-supported database from the database used by internal applications (a very, very good idea that is rarely implemented). In addition, to make this site work for us and to reduce the number of customer service calls handled by operators, we would like the customer to be able to change their basic account information (name, mailing address) as well as perform some basic operations (online bill pay, transfer funds). Sounds pretty normal right???

The problem with this is, how do you handle the name changes, etc.??? In some systems, you can’t – you have to provide a direct interface to the main business systems. However, with Replication Server, you simply implement each of the customer’s actions as “request functions”, in which the request for a name change, bill payment, whatever is forwarded to the main business system, processed and then the results replicated back. You could easily picture this as being something similar to:

Web Database Business Systems

App Server Account Requests

Account TransactionsWeb Database Business Systems

App Server Account Requests

Account Transactions

Figure 89 – Typical Web/Internal Systems Architecture

In fact, the way most commercial bank web sites work, this architecture is extremely viable and reduce the risk to mission critical systems by isolating the main business systems from the load and security risks of web users.

Corporate Change Request

In many large systems, some form of corporate controlled data exists which can only be updated at the corporate site. A variation of this is a sort of change nomination process in which the change nomination is made to the headquarters and due to automated rules, the change is made. One example in which this applies is a budget programming system. As lower levels submit their budget requests, the corporate budget is reduced and the budgeted items replicated back to

Final v2.0.1

284

subscribing sites. At the headquarters system, rules such as whether or not the amount exceeds certain dollar thresholds based on the type of procurement etc. could be in place.

This scenario is a bit different than most as the local database would not be strictly executing a request function. More than likely, a “local change” would be enacted – i.e. a record saved in the database with a “proposed” status. Once the replicated record is received back from headquarters, it simply overwrites the existing record. In addition, due to the hierarchical nature of most companies, a request from a field office for a substantial funding item may have to forwarded through intermediates – in affect, the request function is replicated on to other more senior organizations due to approval authority rules.

Forwarded Budget Requests Total Expenditures

Corporate

Regional

Field

Budget Requests & Expenditures

Approved RequestsBudgeted Amounts

Forwarded Budget Requests Total Expenditures

Corporate

Regional

Field

Budget Requests & Expenditures

Approved RequestsBudgeted Amounts

Figure 90 – Typical Corporate Change Nomination/Request Architecture

Update Anywhere

Whoa!!! This isn’t supposed to be able to be done with Sybase Replication Server. For years we have been taught the sanctity of data ownership and woe to the fool who dared to violate those sacred rules as they would be forever cursed with inconsistent databases.

Not. Consider the fact that you and your spouse are both at work…only you happen to be traveling out of the area. Now, picture a bad phone bill (or something similar) in which you both call to change the address, account names or something – but provide slightly different information (i.e. work phone number). The problem is that by being in two different locations and using the same toll-free number, you were probably routed to different call centers with (gasp) different data centers. The fledgling Sybase DBA answer is this can’t be done. However, keep in mind, that the goal is to have all of the databases consist – which of the two sets of data is the most accurate portrayal of the customer information is somewhat irrelevant. Having that in mind, look at the following architecture.

Final v2.0.1

285

New York

Los Angeles

Dallas

Arbitrator

Washington DC

ChicagoSan Francisco

Request #1

Request #2

Response “A”

Response “B” Figure 91 – Update Anywhere Request Function Architecture

No matter what order request 1 or 2 occur in, the databases will all have the same answer. The reason? We are exploiting the commit sequence assurance of Replication Server. In this case, it is the commit sequence of the request functions at the “arbitrator”. If request #2 commits first, then it will get response A and request #1 will get response B. Since commit order is guaranteed via Replication Server, then every site will have the response (A) from request 2 applied ahead of the response (B) from request 1.

Implementation & Internals

Now that we have established some of the reasons why a business might want to do Asynchronous Request Functions, the next thing to consider is how they are implemented. Frequently, another reason administrators don’t implement request functions is the lack of understanding who to set it up. In this section, we will explore this and how the information gets to the replication server.

Replicate Database & Rep Agent

Perhaps before discussing what happens internally, a good idea might be to review the steps necessary to create an asynchronous request function.

Implementing Asynchronous Request Functions

In general, the steps are:

1. If not already established, make sure source database is established as a primary database for replication (i.e. has a Rep Agent, etc.)

2. Create the procedure to function as the asynchronous request function. This could be an “empty” procedure – or could have logic to perform “local” changes (i.e. set a status column to “pending”).

3. Mark the procedure for replication in the normal fashion (sp_setrepproc) 4. Create a replication definition for the procedure, specifying the primary database as the target (or

recipient) desired and not the source database actually containing the procedure. 5. Make sure the login names and passwords are in synch between the servers for users who have

permission to execute the procedure locally (including those who can perform DML operations if proc is embedded in a trigger).

Final v2.0.1

286

6. Ensure that the common logins have permission to execute the procedure at the recipient database.

A bit of explanation might be in order for the last three. Regarding step #4, the typical process of replicating a procedure from a primary to a replicate involves creating a replication definition and subscription as normal similar to:

HQ.funding

create function replication definition my_proc_namewith primary at HQ.fundingdeliver as ‘hq_my_proc_name’(…param list…)searchable parameters (…param list…)

NY.funding

create subscription my_proc_name_subfor my_proc_namewith replicate at NY.funding

At PRS At RRSmy_proc_nameProcedure exists here

hq_my_proc_nameProcedure exists here

HQ.funding

create function replication definition my_proc_namewith primary at HQ.fundingdeliver as ‘hq_my_proc_name’(…param list…)searchable parameters (…param list…)

NY.funding

create subscription my_proc_name_subfor my_proc_namewith replicate at NY.funding

At PRS At RRSmy_proc_nameProcedure exists here

hq_my_proc_nameProcedure exists here

Figure 92 – Applied (Normal) Procedure Replication Definition Process

This illustrates a normal replicated procedure from HQ to NY. For request functions, the picture changes slightly to:

HQ.funding

create function replication definition ny_my_proc_namewith primary at HQ.fundingdeliver as ‘ny_req_my_proc_name’(…param list…)searchable parameters (…param list…)

NY.funding

(no subscription)

At PRS At RRSny_my_proc_nameProcedure exists here

ny_req_my_proc_nameProcedure exists here

HQ.funding



NY.funding

(no subscription)

At PRS At RRSny_my_proc_nameProcedure exists here

ny_req_my_proc_nameProcedure exists here

Figure 93 – Asynchronous Request Function Replication Definition Process

In this illustration, NY is sending the request function (dashed line) to HQ and the return is replicated via the solid line. Note that in the above example, the “with primary at” clause specifies the recipient (HQ in this case) and not the source (NY) and that the replication definition was created at the primary PRS for the recipient. One way to think of it is that an asynchronous request function replication definition functions as both a replication definition and subscription.

A couple of points that many might not consider in implementing request functions:

• A single replicated database can submit request functions to any number of other replicated databases. Think of a shared primary configuration of 3 or more systems. Any one of the systems could send a request function to any of the others.

• While a single site can send request functions to any number of sites, a single request function can only be sent to a single recipient site. This restriction is due to the fact a single procedure needs to have a unique replication definition and that definition can only specify a single “with primary at” clause.

• In order to send a request function to another system, a route must exist between the two replicated systems.


Essentially, there is nothing unique about Replication Agent processing for request functions. As with any stored procedure execution, when a request function procedure is executed, an implicit transaction is begun. While described in general terms in the LTL table located in the Replication Agent section much earlier, the full LTL syntax for “begin transaction” is:

distribute begin transaction ‘tran name’ for ‘username’/ - encrypted_password

Consequently, the username and encrypted password are packaged into the LTL for the Replication Server. The reason for this is as you probably guessed – the fact that the Replication Server executes the request function at the destination as the user who executed it at the primary (more on this in the next section). As a result, Replication Agent processing for request functions is identical to the processing for an applied function.

Replication Server Processing

Since the source database processing is identical to applied functions, it is within the Replication Server that all of the magic for request functions happens. This happens in two specific areas – the inbound processing and the DSI processing.

Final v2.0.1

287

Inbound Processing

As discussed earlier, within the inbound processing of the replication server, not much happens as far as row evaluation until the DIST thread. Normally, this involves matching replicated rows with replication definitions, normalizing the columns and checking for subscriptions. In addition, for stored procedure replication definitions, this process also involves determining if the procedure is an applied or request function. Remember: the name for a replication definition for a procedure is the same as the procedure name, and that due to the unique naming constraint for replication definitions, there will only be one replication definition with the same name as the procedure. Consequently, determining if the procedure is a request function or not is easily achieved simply by checking to see if the primary database for the replication definition is the same as the current source connection (i.e. connection for which the SQM belongs to). If not, then the procedure is a request procedure. Following the SQM, the DIST/SRE fails to find a subscription and simply needs to read the primary at clause to determine the “primary” database that is intended to receive the request function. The DIST/SRE then writes the request function to the outbound queue, marking it as a request function.

DSI Processing

Within the outbound queue processing of a request function, the only difference is in the DSI processing. When a request function is processed by a DSI, the following occurs:

• The DSI-S stops batching commands and submits all commands up to the request function. • The DSI-E disconnects from the replicate dataserver and reconnects as the username and password from the

request function transaction record. • The DSI-E executes the request function. If more than one request function has been executed in a row by

the same user, all are executed individually. • The DSI-E disconnects from the replicate and reconnects as either the maintenance user or different user.

The latter is applicable when back-to-back request functions are executed by different users at the primary.

Once the request function(s) have been delivered, the DSI resumes “normal” processing of transactions as the maintenance user until the next request function is encountered.

Recipient Database Processing

The second difference in request function processing takes place at the replicate database. If you remember from our earlier discussion, the Replication Agent filters log records based on the maintenance user name returned from the LTL “get maintenance user” command. Since the DSI applies the request function by logging in as the same user at the primary, then any modification performed by the request function execution is eligible for replication back out of the recipient database. If the procedure listed in the “deliver as” clause of the request function replication definition is itself marked for replication, then the procedure invoked by the request function will be replicated as an applied function. If not, then any individual DML statements on tables marked for replication and/or sub-procedures marked for replication will be replicated as normal. A couple of points for consideration:

• The destination of the modifications be replicated out of the recipient is not limited to the site that originally made the request function call. Since at this point normal replication processing is in effect, normal subscription resolution specifies which sites receive the modifications due to the request function.

• The “deliver as” procedure itself (or a sub-procedure) could be a request function in which case the request is “forwarded up the chain” while the original request function serves as “notification” to the immediate supervisory site that the subordinate is making a request.

Key Concept #35: An Asynchronous Request Function will be executed at the recipient by the same user/password combination as the procedure was executed by at the originating site. Because it is not executed by the maintenance user, changes made by the request function are then eligible for replication.

Performance Implications

By now, you have begun to realize some of the power – and possibilities – of request functions. However, they do have downside – it degrades replication performance. Consider the following:

• Replication command batching/transaction grouping is effectively terminated when a request function is encountered (largely due to the reconnection issue).

Final v2.0.1

288

• Replication Server must first disconnect/reconnect as the request function user, establish the database context, execute the procedure, and then disconnect/reconnect as the maintenance user. Ignoring the procedure execution times, the two disconnect/reconnects could consume a considerable portion of time when a large number of request functions are involved.

• In the typical implementation, the request functions at the originator are often empty, while at the recipient there is a sequence of code. Consequently, at the originator, transactions that follow the request function appear to execute immediately. However, at the recipient, they will be delayed until the request function completes execution.

Normally the latter is not much of an issue, but some customers have attempted to use request functions as a means of implementing “replication on demand” in which a replicate periodically executes a request function that at the primary flips a “replicate_now” bit (or something similar). If the number of rows affected are very large, then this procedure’s execution could be significantly longer than expected.

In summary, request functions will impede replication performance by “interrupting” the efficient delivery of transactions. Obviously, the degree to which performance is degraded will depend on the number and frequency of the request functions. This should not deter Replication System Administrators from using request functions, however, as they provide a very neat solution to common business problems.

Final v2.0.1

289

Multiple DSI’s

Multiple DSI or Parallel DSI – which is which or are they the same??? The answer to this question takes a bit of history. Prior to version 11.0, Parallel DSI’s were not available in Replication Server. However, many customers were already hitting the limit of Replication Server capabilities due to the single DSI thread. Accordingly, several different methods of implementing multiple DSI’s to the same connection were developed and implemented so widely that it was even taught in Sybase’s “Advanced Application Design Using Replication Server” (MGT-700) course by late 1995 and early 1996.

This does not mean the two methods are similar as there is one very key difference between the two. Parallel DSI’s guarantee that the transactions at the replicate will be applied in the same order. Multiple DSI’s do not – in fact, exploit this to achieve higher throughput.

WARNING: Because the safeguards ensuring commit order are deliberately bypassed, Multiple DSI’s are not fully supported by Sybase Technical Support. If you experience product bugs such as stack traces, dropped LTL, etc., then Sybase Technical Support will be able to assist. However, if you experience data loss or inconsistency then Sybase Technical Support will not be able to assist in troubleshooting.

Concepts & Terminology

Okay, if you’ve read this far, then the above warning didn’t deter you. Before discussing Multiple DSI’s, however, a bit of terminology and concepts need to be discussed to ensure we each understand what is trying to be stated. Throughout the rest of this section, the following definitions are used in association with the following terms:

Parallel DSI – Internal implementation present in the Replication Server product that uses more than one DSI thread to apply replicated transactions. Transaction commit order is still guaranteed, despite number of threads or serialization method chosen.

Multiple DSI – A custom implementation in which multiple physical connections are created to the same database, in effect implementing more than one DSI thread. Transaction commit order is not guaranteed and must be controlled by design.

Serialized Transactions – Transactions that must be applied in the same order to guarantee the same database result and business integrity. For example, a deposit followed by a withdrawal. Apply these in the opposite order may not yield the same database result as the withdrawal will probably be rejected due to a lack of sufficient funds.

Commit Consistent – Transactions applied in any order will always yield the same results. For example transactions at different Point-Of-Sale (POS) checkout counters or transactions originating from different field locations viewed from the corporate rollup perspective.

Key Concept #36: If using the Multiple DSI approach, you must ensure that your transactions are “commit consistent” or employ your own synchronization mechanism to enforce proper serialization when necessary.

Performance Benefits

Needless to say, Multiple DSI’s can achieve several orders of magnitude higher throughput than Parallel DSI’s. One customer processing credit card transactions reported achieving 10,000,000 transactions per hour. If you think this is unrealistic, in late 1995, a U.S. Government monitored test demonstrated a single Replication Server (version 10.5) replicating 4,000,000 transactions per 24 hour period to three destinations – each transaction a stored procedure with typical embedded selects and averaging 10 write operations (40,000,000 write operations total) against SQL Server 10.0 with only 5 DSI’s. That’s a total of 12,000,000 replicated procedures for a total of 120,000,000 write operations processed by a single RS in a single day against a database engine with known performance problem!!! So 10,000,000 a hour with RS 11.x is could be believable. Such exuberance however needs to be tempered with the cold reality that in order to achieve this performance, a number of design changes had to be made to facilitate the parallelism and extensive application testing to ensure commit consistency had to be done. It cannot be understated – Multiple DSI’s can be a lot of work – you have to do the thinking the Replication Server Engineering has done for you with Parallel DSI’s.

Final v2.0.1

290

In order to best understand the performance benefits of Multiple DSI’s over Parallel DSI’s, you need to look at each of the bottlenecks that exist in Parallel DSI’s and see how Multiple DSI’s overcome them. While the details will be discussed in greater detail later, the performance benefits from Multiple DSI’s stem from the following:

No Commit Order Enforcement – by itself, this is the source of the biggest performance boost as transactions in the outbound queue are not delayed due to long running transactions (i.e. remember the 4 hour procedure execution example) or just simply waiting for their “turn” to commit.

Not Limited to a Single Replication Server – Multiple DSI’s lends itself extremely well to involving multiple Replication Servers in the process – achieving an MP configuration currently not available within the product itself.

Independent of Failures – If a transaction fails with Parallel DSI, activity halts – even if the transactions that follow it have no dependence on the transaction that failed (i.e. corporate rollups). As a consequence, Multiple DSI’s prevent large backlogs in the outbound queue reducing recovery time from transaction failures.

Cross-Domain Replication – Parallel DSI’s are limited to replicating to destinations within the same Replication domain as the primary. Multiple DSI’s have no such restriction and in fact, extend easily to support large-scale cross-domain replication architectures (different topic outside scope of this paper).

Implementation

While the Sybase Education course MGT-700 taught at least three methods for implementing Multiple DSI’s, including altering the system function strings, the method discussed in this section will focus on that of using multiple maintenance users. The reason for this is the ease and speed of setup and the least impact on existing function definitions (i.e. you don’t end up creating a new function class). Implementing Multiple DSI’s is a sequence of steps:

1. Implementing multiple physical connections 2. Ensuring recoverability and preventing loss 3. Defining and implementing parallelism controls

Implementing multiple physical connections

The multiple DSI approach uses independent DSI connections for delivery. Due to the unique index on the rs_databases table in the RSSD, the only way to accomplish this is to fake out the Replication Server and make it think it is actually connecting to multiple databases instead of one. Fortunately, this is easy to do. Since Replication Server doesn’t check the name of the server it connects to, all we need to do is “alias” the real dataserver in the Replication Server’s interfaces file. For example, lets assume we have a interfaces file similar to the following (Solaris):

CORP_FINANCES master tli /dev/tcp \x000224b782f650950000000000000000 query tli /dev/tcp \x000224b782f650950000000000000000

Based on our initial design specifications, we decide we need a total of 6 Multiple DSI connections. Given that the first one counts as one, we simple need to alias it five additional times.

CORP_FINANCES master tli /dev/tcp \x000224b782f650950000000000000000 query tli /dev/tcp \x000224b782f650950000000000000000

CORP_FINANCES_A master tli /dev/tcp \x000224b782f650950000000000000000 query tli /dev/tcp \x000224b782f650950000000000000000

CORP_FINANCES_B master tli /dev/tcp \x000224b782f650950000000000000000 query tli /dev/tcp \x000224b782f650950000000000000000

CORP_FINANCES_C master tli /dev/tcp \x000224b782f650950000000000000000 query tli /dev/tcp \x000224b782f650950000000000000000

CORP_FINANCES_D master tli /dev/tcp \x000224b782f650950000000000000000 query tli /dev/tcp \x000224b782f650950000000000000000

CORP_FINANCES_E master tli /dev/tcp \x000224b782f650950000000000000000 query tli /dev/tcp \x000224b782f650950000000000000000

Final v2.0.1

291

Once this is complete, the Multiple DSI’s can simply be created by creating normal replication connections to CORP_FINANCES.finance_db, CORP_FINANCES_A.finance_db, CORP_FINANCES_B.finance_db, etc. However, before we do this, there is some addition work we will need to do to ensure recoverability (discussed in next section).

To get a clearer picture of what this accomplishes, however, as we mentioned Replication Server now thinks it is replicating to n different replicate databases instead of one. Because of this, it creates separate outbound queues and DSI threads to process each connection. The difference between this and Parallel DSI’s is illustrated in the following diagrams.



Stable Device

RepAgent

Primary DB

Replicate DB

Rep AgentUser


SQM

SQT

SQM

DSIDSI-Exec

DSI-ExecDSI-Exec

SQT



Stable Device

RepAgent

Primary DB

Replicate DB

Rep AgentUser


SQM

SQT

SQM

DSIDSI-Exec

DSI-ExecDSI-Exec

SQT

Figure 94 – Normal Parallel DSI with Single Outbound Queue & DSI threads

Rep Agent User SQM

SQT


Stable Device



Inbound (1)Outbound (0)Inbound (1)

Outbound (0)

DSI-ExecDSI-ExecDSI-Exec

SQMSQMSQM

DSIDSIDSI

DS2_a.my_dbDS2_b.my_dbDS2_c.my_db

my_dbmy_dbDS2DS2

Figure 95 – Multiple DSI with Independent Outbound Queues & DSI threads

In the above drawings, only a single replication server was demonstrated. However, in Multiple DSI’s each of the connections could be from a different replication server. Consider the following – the first being the more normal multiple replication server implementation using routing to a single replication server, while the second demonstrates Multiple DSI’s - one from each Replication Server.

Final v2.0.1

292

New YorkNew York

London London


TokyoTokyo

ChicagoChicago

New YorkNew York

London London London London

San FranciscoSan FranciscoSan FranciscoSan Francisco

TokyoTokyoTokyoTokyo

ChicagoChicagoChicagoChicago

Figure 96 – Multiple Replication Server Implementation without Multiple DSI’s

While the RRS could use Parallel DSI’s, as we have already discussed, long transactions or other issues could degrade performance. In addition, only a single RSI thread is available between the two Replication Servers involved in the routing. While this is normally sufficient, if a large number of large transactions or text replication is involved, it may also be a bottleneck. Additionally, this has an inherent fault in that if any one of the transactions from any of the source sites fail, all of the sites stop replicating until the transaction is fixed and the DSI is resumed.

In contrast, consider a possible Multiple DSI implementation:

New YorkNew York

London London


TokyoTokyo

ChicagoChicago

New YorkNew York

London London London London

San FranciscoSan FranciscoSan FranciscoSan Francisco

TokyoTokyoTokyoTokyo

ChicagoChicagoChicagoChicago

Figure 97 – Multiple Replication Server Implementation Using Multiple DSI’s

In this case, each RS could still use Parallel DSI’s to overcome performance issues within each and in addition, since they are independent, a failure of one does not cause the others to backlog.

A slight twist of the latter ends up with a picture that demonstrates the ability of Multiple DSI’s to provide a multi-processor (MP) implementation.

Final v2.0.1

293

InvestmentsInvestmentsTrading SystemTrading System InvestmentsInvestmentsTrading SystemTrading SystemTrading SystemTrading System

Figure 98 – MP Replication Achieved via Multiple DSI’s

Note that the above architecture really only helps the outbound processing performance. All subscription resolution, replication definition normalization, etc. is still performed by the single replication server servicing the inbound queue. However, systems with high queue writes, extensive function string utilization or other requirements demonstrating a bottleneck in the outbound processing, the MP approach may be viable.

Ensuring Recoverability and Preventing Loss

While the multiple independent connections do provide a lot more flexibility and performance, they do present a problem – recoverability. The problem is simply this: with a single rs_lastcommit table and commit order guaranteed, Parallel DSI’s are assured at restarting from that point and not incurring any lost or duplicate transactions. However, if using Multiple DSI’s, the same is not true. Simply because the last record in the rs_lastcommit table refers to transaction id 101 does not mean the transaction 100 was applied successfully – or that 102 has not been already applied. Consider the following picture:

DS2_a.my_db

tran oqid 31 …

tran oqid 35 …

tran oqid 39 …

tran oqid 43 …

...

DS2_a.my_db

tran oqid 31 …

tran oqid 35 …

tran oqid 39 …

tran oqid 43 …

...

DS2_b.my_db

tran oqid 32 …

tran oqid 36 …

tran oqid 40 …

tran oqid 44 …

...

DS2_b.my_db

tran oqid 32 …

tran oqid 36 …

tran oqid 40 …

tran oqid 44 …

...

DS2_c.my_db

tran oqid 33 …

tran oqid 37 …

tran oqid 41 …

tran oqid 45 …

...

DS2_c.my_db

tran oqid 33 …

tran oqid 37 …

tran oqid 41 …

tran oqid 45 …

...

DS2_d.my_db

tran oqid 34 …

tran oqid 38 …

tran oqid 42 …

tran oqid 46 …

...

DS2_d.my_db

tran oqid 34 …

tran oqid 38 …

tran oqid 42 …

tran oqid 46 …

...

rs_lastcommit

tran oqid 41 …

...

rs_lastcommit

tran oqid 41 …

...

Plausible Scenarios:Plausible Scenarios:

1 1 -- c committed after a, b, & d (longc committed after a, b, & d (long xactnxactn))2 2 -- a, b, d suspended firsta, b, d suspended first3 3 -- a, b, d rolled back due to deadlocksa, b, d rolled back due to deadlocks

Figure 99 – Multiple DSI’s with Single rs_lastcommit Table Consider the three scenarios proposed above. In each of the three, you would have no certainty that tran OQID 42 should be next. As a result, it is critical that each Multiple DSI has it’s own independent set of rs_lastcommit, rs_thread tables as well as associated procedures (rs_update_lastcommit).

Unfortunately, a DSI connection does not identify itself, consequently there are only two choices available:

1. Use a separate function class for each DSI. Within the class, call altered definitions of rs_update_lastcommit to provide distinguishable identity. For example, add a parameter that is hard-coded to the DSI connection (i.e. “A”), or call a variant of the procedure such as rs_update_lastcommit_A.

2. Exploit the ASE permission chain and use separate maintenance users for each DSI. Then create separate rs_lastcommit, etc. owned by each specific maintenance user.

Final v2.0.1

294

3. Multiple maintenance users with changes to the rs_lastcommit table to accommodate connection information and corresponding logic added to rs_update_lastcommit to set column value based on username.

While the first one is obvious – and obviously a lot of work as maintaining function strings for individual objects could then become a burden, the second takes a bit of explanation. The third one is definitely an option and is perhaps the easiest to implement. The problem is that with high volume replication, the single rs_lastcommit table could easily become a source of contention. In addition to rs_lastcommit, a column would have to be added to rs_threads as it has no distinguishable value either – along with changes to the procedures which manipulate these tables (rs_update_lastcommit, rs_get_thread_seq, etc.). However, it does have the advantage of being able to handle identity columns and other maintenance user actions requiring “dbo” permissions. While separate maintenance user logins are in fact used, each are aliased as dbo within the database. The modifications to the rs_lastcommit and rs_threads tables (and their corresponding procedures such as rs_update_lastcommit, rs_get_lastcommit, etc.) would be to add a login name column. Since this is system information available through suser_name() function, the procedure modifications would simply be adding the suser_name() function to the where clause. For example, the original rs_lastcommit table, rs_get_lastcommit and rs_update_lastcommit are as follows:

/* Drop the table, if it exists. */ if exists (select name from sysobjects where name = 'rs_lastcommit' and type = 'U') begin drop table rs_lastcommit end go /* ** Create the table. ** We pad each row to be greater than a half page but less than one page ** to avoid lock contention. */ create table rs_lastcommit ( origin int, origin_qid binary(36), secondary_qid binary(36), origin_time datetime, dest_commit_time datetime, pad1 binary(255), pad2 binary(255), pad3 binary(255), pad4 binary(255), pad5 binary(4), pad6 binary(4), pad7 binary(4), pad8 binary(4) ) go create unique clustered index rs_lastcommit_idx on rs_lastcommit(origin) go /* Drop the procedure to update the table. */ if exists (select name from sysobjects where name = 'rs_update_lastcommit' and type = 'P') begin drop procedure rs_update_lastcommit end go /* Create the procedure to update the table. */ create procedure rs_update_lastcommit @origin int, @origin_qid binary(36), @secondary_qid binary(36), @origin_time datetime as update rs_lastcommit set origin_qid = @origin_qid, secondary_qid = @secondary_qid, origin_time = @origin_time, dest_commit_time = getdate() where origin = @origin if (@@rowcount = 0) begin insert rs_lastcommit (origin, origin_qid, secondary_qid, origin_time, dest_commit_time, pad1, pad2, pad3, pad4, pad5, pad6, pad7, pad8)

Final v2.0.1

295

values (@origin, @origin_qid, @secondary_qid, @origin_time, getdate(), 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00) end go /* Drop the procedure to get the last commit. */ if exists (select name from sysobjects where name = 'rs_get_lastcommit' and type = 'P') begin drop procedure rs_get_lastcommit end go /* Create the procedure to get the last commit for all origins. */ create procedure rs_get_lastcommit as select origin, origin_qid, secondary_qid from rs_lastcommit go

Note that the last procedure, rs_get_lastcommit, normally retrieves all of the rows in the rs_lastcommit table. The reason for this is that the oqid is unique to the source system – but if there are multiple sources as can occur in a corporate rollup scenario – there may be duplicate OQID’s. Consequently, the oqid and database origin id (from RSSD..rs_databases) is stored together. During recovery, as each transaction is played back, the oqid and origin are used to determine if the row is a duplicate.

If using the multiple login/altered rs_lastcommit approach, then you simply need to add a where clause to each of the above procedures and the primary key/index constraints. For rs_lastcommit, this becomes (modifications highlighted):

/* Drop the table, if it exists. */ if exists (select name from sysobjects where name = 'rs_lastcommit' and type = 'U') begin drop table rs_lastcommit end go /* ** Create the table. ** We pad each row to be greater than a half page but less than one page ** to avoid lock contention. */ -- modify the table to add the maintenance user column. create table rs_lastcommit ( maint_user varchar(30), origin int, origin_qid binary(36), secondary_qid binary(36), origin_time datetime, dest_commit_time datetime, pad1 binary(255), pad2 binary(255), pad3 binary(255), pad4 binary(255), pad5 binary(4), pad6 binary(4), pad7 binary(4), pad8 binary(4) ) go -- modify the unique index to include the maintenance user create unique clustered index rs_lastcommit_idx on rs_lastcommit(maint_user, origin) go /* Drop the procedure to update the table. */ if exists (select name from sysobjects where name = 'rs_update_lastcommit' and type = 'P') begin drop procedure rs_update_lastcommit end go /* Create the procedure to update the table. */ create procedure rs_update_lastcommit

Final v2.0.1

296

@origin int, @origin_qid binary(36), @secondary_qid binary(36), @origin_time datetime as -- add maint_user qualification to the where clause. update rs_lastcommit set origin_qid = @origin_qid, secondary_qid = @secondary_qid, origin_time = @origin_time, dest_commit_time = getdate() where origin = @origin and maint_user=suser_name() if (@@rowcount = 0) begin

-- add the maintenance user login to insert statement insert rs_lastcommit (maint_user, origin, origin_qid, secondary_qid, origin_time, dest_commit_time, pad1, pad2, pad3, pad4, pad5, pad6, pad7, pad8) values (suser_name(), @origin, @origin_qid, @secondary_qid, @origin_time, getdate(), 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00) end go /* Drop the procedure to get the last commit. */ if exists (select name from sysobjects where name = 'rs_get_lastcommit' and type = 'P') begin drop procedure rs_get_lastcommit end go /* Create the procedure to get the last commit for all origins. */ create procedure rs_get_lastcommit as -- add the maint_user to the (previously nonexistent) where clause select origin, origin_qid, secondary_qid from rs_lastcommit where maint_user = suser_name() go

Similar changes will need to be done to the rs_threads table and associated procedure calls as well. It is important to avoid changing the procedure parameters. Fortunately, all retrieval and write operations against the rs_lastcommit table are performed through stored procedure call (similar to an API of sorts). By not changing the procedure parameters and due to the fact that all operations occur through the procedures, we do not need to make any changes to the function strings (reducing maintenance considerably). Why this is necessary at all is discussed under section describing the Multiple DSI/Multiple User implementation.

Note that at the same time, we could alter the table definition to accommodate max_rows_per_page or datarow locking and eliminate the row padding (thereby reducing the amount of data logged in the transaction log for rs_lastcommit updates). However, other than the reduction in transaction log activity, this will gain little in the way of performance . It is a useful technique to remember, though, as ASE 12.5 will support larger page sizes (i.e. 16KB vs. 2KB), which invalidates the normal rs_lastcommit padding. So if implementing RS 12.1 or less on ASE 12.5 you may need to modify these tables anyhow.

While useful for handling identity and simple to implement, the third alternative above may provide slightly greater performance by eliminating any contention on the rs_lastcommit table. By using separate maintenance users, you can exploit the way ASE does object resolution and permission checking. It is a little known fact (but still documented), that when you execute a SQL statement in which the object’s ownership is not qualified, ASE will first look for an object of that name owned by the user (as defined in sysusers). If one is not found, then it searches for one owned by the database owner – dbo. So if “fred” is a user in the database and there is two tables: 1) fred.authors; and 2) dbo.authors and fred issues “select * from pubs2..authors”, authors will be resolved to fred.authors. On the other hand, if Mary issues “select * from pubs2..authors”, since no mary.authors exists, authors will be resolved to dbo.authors. Consequently, by using separate maintenance users and individually owned rs_lastcommit, etc. tables, we have the following:

Final v2.0.1

297

MaintUser5

MaintUser4

MaintUser4.rs_lastcommit


MaintUser3

MaintUser3.rs_lastcommitMaintUser2

MaintUser2.rs_lastcommitMaintUser1


MaintUser5MaintUser5


MaintUser4.rs_lastcommitMaintUser4.rs_lastcommit



MaintUser3.rs_lastcommitMaintUser3.rs_lastcommitMaintUser2MaintUser2

MaintUser2.rs_lastcommitMaintUser2.rs_lastcommitMaintUser1MaintUser1


Figure 100 – Multiple Maintenance Users with Individual rs_lastcommits

This then addresses the problems in the scenario we discussed earlier and changes the situation to the following:

DS2_a.my_db

tran oqid 31 …

tran oqid 35 …

tran oqid 39 …

tran oqid 43 …

...

DS2_a.my_db

tran oqid 31 …

tran oqid 35 …

tran oqid 39 …

tran oqid 43 …

...

DS2_b.my_db

tran oqid 32 …

tran oqid 36 …

tran oqid 40 …

tran oqid 44 …

...

DS2_b.my_db

tran oqid 32 …

tran oqid 36 …

tran oqid 40 …

tran oqid 44 …

...

DS2_c.my_db

tran oqid 33 …

tran oqid 37 …

tran oqid 41 …

tran oqid 45 …

...

DS2_c.my_db

tran oqid 33 …

tran oqid 37 …

tran oqid 41 …

tran oqid 45 …

...

DS2_d.my_db

tran oqid 34 …

tran oqid 38 …

tran oqid 42 …

tran oqid 46 …

...

DS2_d.my_db

tran oqid 34 …

tran oqid 38 …

tran oqid 42 …

tran oqid 46 …

...

DS2_a.rs_lastcommit

tran oqid 39 …

...

DS2_a.rs_lastcommit

tran oqid 39 …

...



DS2_b.rs_lastcommit

tran oqid 44 …

...

DS2_b.rs_lastcommit

tran oqid 44 …

...

DS2_c.rs_lastcommit

tran oqid 41 …

...

DS2_c.rs_lastcommit

tran oqid 41 …

...

DS2_d.rs_lastcommit

tran oqid 34 …

...

DS2_d.rs_lastcommit

tran oqid 34 …

...

DS2_a.my_db

tran oqid 31 …

tran oqid 35 …

tran oqid 39 …

tran oqid 43 …

...

DS2_a.my_db

tran oqid 31 …

tran oqid 35 …

tran oqid 39 …

tran oqid 43 …

...

DS2_b.my_db

tran oqid 32 …

tran oqid 36 …

tran oqid 40 …

tran oqid 44 …

...

DS2_b.my_db

tran oqid 32 …

tran oqid 36 …

tran oqid 40 …

tran oqid 44 …

...

DS2_c.my_db

tran oqid 33 …

tran oqid 37 …

tran oqid 41 …

tran oqid 45 …

...

DS2_c.my_db

tran oqid 33 …

tran oqid 37 …

tran oqid 41 …

tran oqid 45 …

...

DS2_d.my_db

tran oqid 34 …

tran oqid 38 …

tran oqid 42 …

tran oqid 46 …

...

DS2_d.my_db

tran oqid 34 …

tran oqid 38 …

tran oqid 42 …

tran oqid 46 …

...

DS2_a.rs_lastcommit

tran oqid 39 …

...

DS2_a.rs_lastcommit

tran oqid 39 …

...



DS2_b.rs_lastcommit

tran oqid 44 …

...

DS2_b.rs_lastcommit

tran oqid 44 …

...

DS2_c.rs_lastcommit

tran oqid 41 …

...

DS2_c.rs_lastcommit

tran oqid 41 …

...

DS2_d.rs_lastcommit

tran oqid 34 …

...

DS2_d.rs_lastcommit

tran oqid 34 …

...

Figure 101 – Multiple DSI’s with Multiple rs_lastcommit tables

Now, no matter what the problem, each of the DSI’s recovers to the point where it left off.

Key Concept #37: The Multiple DSI approach uses independent DSI connections set up via aliasing the target dataserver.database. However, this leads to a potential recoverability issue with RS system tables that must be handled to prevent data loss or duplicate transactions.

Detailed Instructions for Creating Connections

Now that we now what we need to do to implement the multiple DSI’s and how to ensure recoverability, the next stage is to determine exactly how to achieve it. Basically, it comes down to a modified rs_init approach or performing the

Final v2.0.1

298

steps manually (as may be required for heterogeneous or OpenServer replication support). Each of the below requires the developer to first create the aliases in the interfaces file.

Manual Multiple DSI Creation

Despite what it sounds, the manual method is fairly easy, but does require a bit more knowledge about Replication Server. The steps are:

1. Add the maintenance user logins (sp_addlogin). Create as many as you expect to have Multiple DSI’s plus a few extra.

2. Grant maintenance user logins replication_role. Do not give them sa_role. If you do, when in any database, the maintenance user will map to “dbo” user vs. the maintenance user desired – consequently incurring the problem with rs_lastcommit.

3. Add the maintenance users to the replicated database. If identity values are used, one may have to be aliased to “dbo”. If following the first implementation (modifying rs_lastcommit), all may be aliased to dbo.

4. Grant all permissions on tables/procedures to replication_role. While you could grant permissions to individual maintenance users, by granting permissions to the role, you reduce the work necessary to add additional DSI connections later.

5. Make a copy of $SYBASE/$SYBASE_RS/scripts/rs_install_primary. Alter the copy to include the first maintenance user as owner of all the objects. Use isql to load the script into the replicate database. Repeat for each maintenance user.

6. Create connections from Replication Server to the replicate database. If the database will also be a primary database and data is being replicated back out, pick one of the maintenance users to be the “maintenance user” and specify the log transfer option

create connection to data_server.database set error class [to] rrss__ssqqllsseerrvveerr__eerrrroorr__ccllaassss set function string class [to] rrss__ssqqllsseerrvveerr__ffuunnccttiioonn__ccllaassssset username [to]

mmaaiinntt__uusseerr__nnaammee [set password [to] mmaaiinntt__uusseerr__ppaasssswwoorrdd ] [set database_param [to] 'value'] [set security_param [to] 'value' ] [with {log transfer on, dsi_suspended}] [as active for logical_ds.logical_db |

as standby for logical_ds.logical_db [use dump marker]]

7. If replicate is also a primary, add the maintenance user to Replication Server (create user) grant the specified maintenance user connect source permission in the Replication Server. For all other maintenance users, alter the connection and set replication off (if desired).

8. Configure the Replication Agent as desired.

Modified rs_init Method

The modified rs_init method is the easiest and ensures that all steps are completed (none are accidentally forgotten). It is very similar to the above in results, but less manual steps.

1. Make a copy of $SYBASE/$SYBASE_RS/scripts/rs_install_primary (save it as rs_install_primary_orig). Alter the rs_install_primary to include the first maintenance user as owner of all the objects.

2. Run rs_init for replicate database. Specify the first maintenance user. Repeat steps 1-2 until all maintenance users created. If using the modified rs_lastcommit approach, you can simply repeat step 2 until done.

3. If identity values are used, one may have to be aliased to “dbo” (drop the user and add an alias). 4. (Same as above). Grant all permissions on tables/procedures to replication_role. While you could grant

permissions to individual maintenance users, by granting permissions to the role, you reduce the work necessary to add additional DSI connections later.

5. Use sp_config_rep_agent to specify the desired maintenance user name and password for the Replication Agent. Not that all maintenance users have probably been created as Replication Server users. This is not a problem, but can be cleaned up if desired.

6. Rename the rs_install_primary script to a name such as rs_install_primary_mdsi. Rename the original back to rs_install_primary. This will prevent problems for future replication installations not involving multiple DSI’s.

Final v2.0.1

299

Single rs_lastcommit with Multiple Maintenance Users

If for maintenance reasons or other, you opt not to have multiple rs_lastcommit tables and instead wish to use a single table, you will have to do the following (note this is a variance to either of the above, so replace the above instructions as appropriate):

1. Make a copy of rs_install_primary. Depending on manual or rs_init method, edit the appropriate file and make the following changes:

a. Add column for maintenance user suid() or suser_name() to all tables and procedure logic. This includes adding column to tables such as rs_threads without anything. Procedure logic should select suid() or suser_name() for use as column values.

b. Adjust all unique indexes to include suid() or suser_name() column. 2. Load script according to applicable manual or rs_init instructions above.

Single rs_lastcommit with Single Maintenance User

This method employs the use of function string modifications and really is only necessary if the developers really want job security due to maintaining function strings. The steps are basically:

1. Make a copy of rs_install_primary and save it as rs_install_primary_orig. Modify the original as follows:

a. Add column for DSI to each table as well as parameter to each procedure. This includes tables such as rs_threads, rs_lastcommit and their associated procedures.

b. Adjust all unique indexes to include DSI column. 2. Load script using rs_init as normal. This will create the first connection. 3. Create a function string class for the first DSI (inherit from default). Modify the system functions for

rs_get_thread_seq, rs_update_lastcommit, etc. to specify the DSI. Repeat for each DSI. 4. Alter the first connection to use the first DSI’s function string class. 5. Create multiple connections from Replication Server to replicate database for remaining DSI’s using the

create connection command. Specify the appropriate function string class for each. 6. Rename the rs_install_primary script to a name such as rs_install_primary_mdsi. Rename the original

back to rs_install_primary. This will prevent problems for future replication installations not involving multiple DSI’s.

7. Monitor replication definition changes during lifecycle. Manually adjust function strings if inheritance does not provide appropriate support.

Defining and Implementing Parallelism Controls

The biggest challenge to Multiple DSI’s is to design and implement the parallelism controls in such a way that database consistency is not compromised. The main mechanism for implementing parallelism is through the use of subscriptions, and in particular the subscription where clause. Each aliased database connection (Multiple DSI) subscribes to a different data – either at the object level or through the where clause. As a result, two transactions executed at the primary might be subscribed to by different connections and therefore have a different order of execution at the replicate than they had at the primary. The following rules MUST be followed to ensure database consistency:

1. Parallel transactions must be commit consistent. 2. Serial transactions must use the same DSI connection. 3. If not 1 & 2, you must implement your own synchronization point to enforce serialization.

Parallel Subscription Mechanism.

In many cases, this is not as difficult to achieve as you would think. The key, however, is to make sure that the where clause operations for any one connection are mutually exclusive from every other connection. This can be done via a variety of mechanisms, but is usually determined by two aspects: 1) the number of source systems involved; and 2) the business transaction model.

Single Primary Source

In some cases, a single primary source database provides the bulk of the transactions to the replicate. As a result, it is the transactions from this source database that must be processed in parallel using the Multiple DSI’s. In this situation, each of the Multiple DSI’s subscribes to different transactions or different data through one of the following mechanisms:

Final v2.0.1

300

Data Grouping – In this scenario, different DSI’s subscribe to a different subset of tables. This is most useful when a single database is used to process several different types of transactions. The transactions affect a certain small number of tables unique to that data. An example of this might be a consolidated database in which multiple stations in a business flow all access the same database. For example, a hospital’s outpatient system may have a separate appointment scheduling/check-in desk, triage treatment, lab tests and results, pharmacy, etc. If each “group” of tables that support these functions are subscribed to by different DSI’s, they will be applied in parallel at the replicate.

Data Partitioning – In this scenario, different DSI’s subscribe to different sets of data from the same tables, typically via a range or discrete list. An example of the former may be that a DSI may subscribe to A-E or account numbers 10000-20000. An example of a discrete list might be similar to a bank in which one DSI subscribes to checking accounts, the other credit card transactions, etc.

User/Process Partitioning – In this scenario, different DSI’s subscribe to data modified by different users. This is most useful in situations where individual user transactions need to be serialized, but are independent of each other’s. Probably one of the more frequently implemented, this includes situations such as retail POS terminals, banking applications, etc.

Transaction Partitioning – In this scenario, different DSI’s subscribe to different transactions. Typically implemented in situations involving a lot of procedure-based replication, this allows long batch processes (i.e. interest calculations) to execute independent of other batch processes without either “blocking” the other through the rs_threads issue.

The first two and last are fairly easy to implement and typically do not require modification to existing tables. However, the user/process partition might. If the database design incorporates an audit function to record the last user to modify a record and user logins are enforced, then such a column could readily be used as well.

However, in today’s architectures, frequently users are coming through a middleware tier (such as a web or app server) and are using a common login. As a result, a column may have to be added to the main transaction tables to hold the process id (spid) or similar value. In many cases, the spid itself could be hard to develop a range on as load imbalance and range division may be difficult to achieve. For example, a normal call center may start with only a few users at 7:00am, build to 700 concurrent users by 09:00am and then degrade slowly to a trickle from 4:00pm to 06:00pm. If you tried to divide the range of users evenly by spid, you would end up with some DSI’s not doing any work for a considerable period (4 hours) of the workday. On the other hand, the column could store the mod() of the spid (i.e. @@spid%10) – remembering that the result of mod(n) could be zero through n-1 (i.e. mod(2) yields 0 & 1 as remainders). Note that as of ASE 11.9, global variables are no longer allowed as input parameter defaults to stored procedures.

Multiple Primary Sources

Multiple primary source system situations are extremely common to distributed businesses needing a corporate rollup model. Each of the regional offices would have it’s own dedicated DSI thread to apply transactions to the corporate database. As mentioned earlier, this has one very distinct advantage over normal replication in that an erroneous transaction from one does not stop replication from all the others by suspending the DSI connection. When multiple primary source systems are present, establishing parallel transactions are fairly easy due to the following:

No code/table modifications - Since each source database has it’s own dedicated DSI, from a replication standpoint, it resembles a 1:1 straightforward replication.

Guaranteed commit consistency - Transactions from one source system are guaranteed commit consistent from all others. This is true even in cases of two-phased commit distributed transactions affecting several of the sources. Since in each case an independent Rep Agent, inbound queue processing and OQID’s are used for the individual components of a 2PC transaction, it would be impossible for even a single Replication Server to reconstruct the transaction into a single transaction for application at the replicate.

Parallel DSI support – While this doesn’t appear to add benefit if the multiple DSI’s are from a single source, in the case of multiple sources, it can help with large transactions (due to large transaction threads) and medium volume situations through tuning the serialization method (none vs. wait_for_commit), etc.

Handling Serialized Transactions

In single source systems, it is frequent that a small number of transactions still need to be serialized no matter what the parallelism strategy you choose. For example, if a bank opts for using the account number, probably 80-90% of the transactions are fine. However, in the remaining 10-20% are transactions such as account transfers that need to be serialized. For example, if a typical customer transfers funds from a savings to a checking account, if the transaction is split due to the account numbers, the replicate system may be inconsistent for a period of time. While this may not affect some business rules, if an accurate picture of fund balances is necessary, this could cause a problem similar to the

Final v2.0.1

301

typical isolation level 3/phantom read problems in normal databases. Consequently, after defining the parallelism strategy, a careful review of business transactions needs to be conducted to determine which ones need to be serialized.

Once determined, the handling of serialized transactions is pretty simple – simply call a replicated procedure with the parameters. While this may necessitate an application change to call the procedure vs. sending a SQL statement, the benefits in performance at the primary are well worth it. In addition, because it is a replicated procedure, the individual row modifications are not replicated – consequently, the Multiple DSI’s that subscribe to those accounts do not receive the change. Instead, another DSI reserved for serialized transactions (it may be more than one DSI – depending on design) subscribes to the procedure replication and delivers the proc to the replicate.

The above is a true serialized transaction example. For the most part, serializing the transactions simply means ensuring that all the ones related are forced to use the same DSI. At that stage, the normal Replication Server commit order guarantee ensures that the transactions are serialized within respect one another. The most common example is to have transactions executed by the same user serialized – or impacting the same account serialized. For example, a hospital bill containing billable items for Anesthesia and X-ray. As long as the bill invoice number is part of the subscription and the itemization, then by subscribing by invoice, the transaction is guaranteed to arrive at the replicate as a complete bill – and within a single transaction.

However, there may not be a single or easily distinguishable set of attributes that can be easily subscribed to for ensuring transaction serialization within the same transaction. If such is the case, then the rs_id column becomes very useful. During processing, the primary database can simply assign an arbitrary transaction number (up to 2 billion before rollover) and store it in a column added similar to the user/spid mod() column described earlier. By using bitmask subscription, the load could be evenly balanced across the available Multiple DSI’s.

Serialization Synchronization Point

There may be times when it is impossible to use a single procedure call to replicate a transaction that requires serialization and the normal parallel DSI serialization is counter to the transactions requirements. This normally occurs when a logical unit of work is split into multiple physical transactions – possibly even executed by several different users. A classic case – without even parallel DSI - is when the transaction involves a worktable in one database and then a transaction in another database (pending/approved workflow). Another example, a store procedure at the primary call may generate a work table in one database using a select/into and then call a sub-procedure to further process and insert the rows. Of course, since both transactions originate from two different databases, read by two different Rep Agents, and delivered by two different DSI connections, the normal transactional integrity of the transaction is inescapably lost. Similarly, even when user/process id is used for the parallelism strategy, Multiple DSI connections will wreak havoc on transactional integrity and serialization – simply because there is no way to guarantee that the transaction from once connection will always arrive after the other.

The answer is “Yes”. The question “Is there a way to ensure transactions are serialized?”. However, the technique is a bit reminiscent of rs_threads. If you remember, rs_threads imposes a modified “dead man’s latch” to control commit order. A similar mechanism could be constructed to the same thing through the use of stored procedures or function string coding. The core logic would be:

Latch Create – Basically some way to ensure that the latch was clear to begin with. Unlike rs_threads where the sequence is predictable, in this case, it is not, consequently a new latch should be created for each serialized transaction

Latch Wait – In this case, the second and successive transactions if occurring ahead of the first transaction need to sense that the first transaction has not taken place and wait.

Latch Set – As each successive transaction begins execution, the transaction needs to set and lock the latch. Latch Block – Once the previous transactions have begun, the following transactions need to block on the

latch so that as soon as the previous transactions commit, they can begin immediately. Latch Release – When completed, each successive transaction needs to clear its lock on the latch. The last

transaction should destroy the latch by deleting the row.

This is fairly simple for two connections, but what if 3 or more are involved? Even more complicated, what if several had a specific sequence for commit? For example, lets consider the classic order entry system in which the following tables need to be updated in order: order_main, order_items, item_inventory, order_queue. Normally, of course, the best approach would be to simply invoke the parallelism based on the spid of the person entering the order. However, for some obscure reason, this site can’t do that – and want to divide the parallelism along table lines. So, we would expect 4 DSI’s to be involved – one for each of the tables. The answer is we would need a latch table and procedures similar to the following at the replicate:

-- latch table create table order_latch_table ( order_number int not null, latch_sequence int not null,

Final v2.0.1

302

constraint order_latch_PK primary key (order_number) ) lock datarows go -- procedure to set/initialize order latch create procedure create_order_latch @order_number int, @thread_num rs_id as begin insert into order_latch_table values (@order_number, 0) return (0) end go -- procedure to wait block and set latch create procedure set_order_latch @order_number int, @thread_seq int, @thread_num rs_id as begin declare @cntrow int select @cntrow=0 -- make sure we are in a transaction so block holds if @@trancount = 0 begin rollback transaction raiserror 30000 “Procedure must be called from within a transaction return(1) end -- wait until time to set latch while @cntrow=0 begin waitfor delay “00:00:02” select @cntrow=count(*) from order_latch_table where order_number = @order_number and latch_sequence = @thread_seq –1 at isolation read uncommitted end -- block on latch so follow-on execution begins immediately -- once previous commits update order_latch_table set latch_sequence = @thread_seq where order_number = @order_number -- the only way we got to here is if the latch update worked -- otherwise, we’d still be blocked on previous update -- In any case, that means we can exit this procedure and allow -- the application to perform the serialized update return (0) end go -- procedure to clear order latch create procedure destroy_order_latch @order_number int, @thread_num rs_id as begin delete order_latch_table where order_number = @order_number return (0) end go

It is important to note that the procedure body above is for the replicate database. At the primary, the procedure will more than likely have no code in the procedure body as there is no need to perform serialization at the primary (transaction is already doing that). In addition, it is possible to combine the “create” and “set” procedures into a single procedure that would first create the latch if it did not already exist.

The way this works is very simple - but does require the knowledge of which threads will be applying the transactions. For example, consider the following pseudo-code example:

Begin transaction Insert into tableA Update tableB Insert into tableC Insert into tableC Update table B

Final v2.0.1

303

Commit transaction

Now, assuming tables A-C will use DSI connections 1-3 and need to be applied in particular order (i.e. A inserts a new financial transaction, while B updates the balance and C is the history table), the transaction at the primary could be changed to:

Begin transaction Exec SRV_create_order_latch @order_num, 1 Insert into tableA Exec SRV_set_order_latch @order_num, 1, 2 Update into tableB Exec SRV_set_order_latch @order_num, 2, 3 Insert into tableC Insert into tableC Exec SRV_set_order_latch @order_num, 3, 2 Update into tableB Exec SRV_destroy_order_latch @order_num, 1 Commit transaction

Note that the SRV prefix on the procedures in the above is to allow the procedure replication definition to be unique vs. other connections. The “deliver as” name would not be prefaced with the server extension. Also, note that the first “set latch” is sent using the second DSI. If you think about it, this makes sense as the first statement doesn’t have to wait for any order - it should proceed immediately. In addition, the procedure execution calls above could be placed in triggers, reducing the modifications to application logic - although this would require the trigger to set the latch for the next statement, changing the above to:

Begin transaction Insert into tableA

Exec SRV_create_order_latch @order_num, 1 Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 2

Update into tableB Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 3

Insert into tableC Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 3

Insert into tableC Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 2

Update into tableB Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 3

Commit transaction

In which the indented calls are initiated by the triggers on the previous operation. Note that the above also uses variables for passing the sequence. This is simply due to the fact that the trigger is generic and can’t tell what number of operations preceded it. As a result, the local version of the latch procedures would have to have some logic added to track the sequence number for the current order number and each “set latch” would have to simply add one to the number.

-- latch table create table order_latch_table ( order_number int not null, latch_sequence int not null, constraint order_latch_PK primary key (order_number) ) lock datarows go -- procedure to set/initialize order latch create procedure SRV_create_order_latch @order_number int, @thread_num rs_id as begin insert into order_latch_table values (@order_number, 1) return (0) end go -- procedure to wait block and set latch create procedure SRV_set_order_latch

@order_number int,

Final v2.0.1

304

@thread_seq int, @thread_num rs_id

as begin update order_latch_table set latch_sequence = latch_sequence+1 where order_number = @order_number end go -- procedure to clear order latch create procedure SRV_destroy_order_latch @order_number int, @thread_num rs_id as begin delete order_latch_table

where order_number = @order_number return (0) end go

However, you should also note that the destroy procedure never gets called - it would be impossible from a trigger to know when the transaction is ended. A modification to the replicate versions of rs_lastcommit procedure could perform the clean up at the end of each batch of transactions.

Design/Implementation Issues

In addition to requiring manual implementation for synchronization points, implementing multiple DSI’s has other design challenges.

Multiple DSI’s & Contention

Because Multiple DSI’s mimic the Parallel DSI serialization method “none”, they could experience considerable contention between the different connections. However, unlike Parallel DSI’s - the retry from deadlocking is not the “kindler-gentler” approach of applying the offending transactions in serial and printing a warning. Instead, they one that was rolled back (in this case the order (i.e. thread 2 vs. thread 1) is not known, so the wrong victim may be rolled back and the transaction attempted again and again until the DSI suspends due to exceeding the retries. For example, in a 1995 case study using 5 Multiple DSI connections for a combined 200 tps rate, 30% of the transactions deadlocked at the replicate. Of course, in those days, the number of transactions per group was not controllable and attempts to use the byte size were rather cumbersome. In the final implementation, transaction grouping was simply disabled and the additional I/O cost of rs_lastcommit endured.

As a result, it is even more critical to tune the connections similar to the Parallel DSI/ dsi_serialization_method=none techniques discussed earlier. Namely:

• Set dsi_max_xacts_in_group to a low number (3 or 5) • Use datapage or datarow locking on the replicate tables • Change clustered indexes or partition the table to avoid last page contention

Identity Columns & Multiple DSI

As partially discussed before, this could cause a problem. If the parallelism strategy chosen is one based on the table/table subset strategy, then simply aliasing one of the DSI connections to “dbo” and ensuring that all transactions for that table use that DSI connection is a simple strategy. Parallel DSI’s may also have to be implemented for that DSI connection as well.

However, if not - for example the more classic user/process strategy, the real solution is to simply define the identity at the replicate as a “numeric” vs. “identity”. This should not pose a problem as the identity - with the exception of Warm Standby - does not have any valid context in any distributed system. Think about it. If not a Warm-Standby, define the context of identity!! It doesn’t have any - and in fact, if identities are used at multiple sites - field sites for example, at a corporate rollup, it would have to be combined with the site identifier (source server name from rs_source_ds) to ensure problems with “duplicate” rows do not happen.

Multiple DSI’s & Shared Primary

Again, as we mentioned before, you need to consider the problem associated with Multiple DSI’s if the replicate is also a primary database. Since the DSI connections use aliased user names, the normal Replication Agent processing for filtering transactions based on maintenance user name will fail - consequently re-replicating data distributed from Multiple DSI’s. Normally. However, as mentioned, it is extremely simple to disable this by configuring the connection parameter “dsi_replication” to “off”.

However, the re-replication of data modifications may be desirable. For instance, in large implementations, the replicate may be an intermediate in the hierarchical tree. Or, it could be viewed as a slight twist on the asynchronous

Final v2.0.1

305

request functions earlier described. Only in this case, normal table modifications could function as asynchronous requests. For example, order entry database could insert a row into a “message queue” table for shipping. At the shipping database, the replicated insert triggers inserts into the “pick” queue and the status is replicated back to the order entry system. And so on.

Business Cases

Despite their early implementation as a mechanism to implement parallelism prior to Parallel DSI’s, Multiple DSI’s still have applicability in most of today’s business environments. By now, you may be getting the very correct idea that Multiple DSI’s can contribute much more to your replication architecture than just speed. In this section we will take a look at ways that Multiple DSI’s can be exploited to get around normal performance bottlenecks as well as entertaining business solutions.

Long Transaction Delay

In several of the previous discussions, we illustrated how a long running transaction – whether it be a replicated procedure or several thousand individual statements within a single transaction – can cause severe delays in applying transactions that immediately followed them at the primary. For example, if a replicated procedure requires 4 hours to run, then during the 4 hours that procedure is executing, the outbound queue will be filling with transactions. As was mentioned in one case, this could lead to an unrecoverable state if the transaction volume is high enough that the remaining time in the day is not enough for the Replication Server to catch up.

Multiple DSI’s can deftly avoid this problem. While in Parallel DSI’s, the rs_threads table is used to ensure commit order, no such mechanism exists for Multiple DSI’s. Consequently, while one DSI connection is busy executing the long transaction, other transactions can continue to be applied through the other DSI connections. This is particularly useful in handling overnight batch jobs. Normal daily activity could use a single DSI connection (it still could use parallel DSI’s on that connection though!), while the nightly purge or store close out procedure would use a separate DSI connection. Consider the following illustration:

Batch Interest Payments

DataWarehouseOLTP

SystemClosing Trade PositionCustomer TradesMutual Fund Trades

Batch Interest Payments

DataWarehouseOLTP

SystemClosing Trade PositionCustomer TradesMutual Fund Trades

Figure 102 - Multiple DSI Solution for Batch Processing

The approach is especially useful for those sites which normally Replication Server is able to maintain the transaction volume even during peak processing - but gets behind rapidly due to close of business processing and overnight batch jobs.

Commit Order Delay

Very similarly, large volumes of transactions that are independent of each other end up delaying one-another simply due to commit order. Consider the average Wal-Mart on a Friday night, with 20+ lanes of checkout counters. It the transactions are being replicated, transactions from the express lane would have to wait for the others to execute at the replicate and commit in order, even though the transactions are completely independent and commit consistent. Again, because commit consistency is a prerequisite, Multiple DSI’s allow this problem to be overcome by allowing such techniques as dedicating a single DSI connection for each checkout counter. Similarly, in many businesses, there are several different business processes involved in the same database. Again, these could use separate DSI connections to avoid being delayed due to a high volume of activity for another business process. Consider the following:

Flight Departures

AirlineHeadquarters

AirportAirfreight ShipmentsPassenger TicketingAircraft Servicing Costs

Flight Departures

AirlineHeadquarters

AirportAirfreight ShipmentsPassenger TicketingAircraft Servicing Costs

Figure 103 - Multiple DSI Solution for Separate Business Processes

Flight departures is an extremely time sensitive piece of information, yet very low volume compared to passenger check-in and ticketing activities. During peak travel times, a flight departure could have to wait for several hundred passenger related data records to commit at the replicate prior to being received. During peak processing, a delay of 30

Final v2.0.1

306

minutes would not be tolerable as this is the required reporting interval for flight “following” (tracking) that may be required from a business sense (i.e. delay the next connecting flight due to this one leaving 45 minutes late) - or simply timely notification back at headquarters that a delayed flight has finally taken off.

Contention Control

Another reason for Multiple DSI’s is to allow better control of the parallelism and consequently reduce the contention by managing transactions explicitly. For example, in normal Parallel DSI, a typical online daemon process (such as a workflow engine) will log in using a specific user id. At the primary, there would be no contention within its transactions simply due to only a single thread of execution. However, with parallel DSI enable, considerable contention may occur at the replicate as transactions are indiscriminately split among the different threads. As a result, in the case of aggregates, etc. at the replicate, considerable contention may result. With multiple queuing engines involved, the contention could be considerable. By using Multiple DSI’s, all of the transactions for one user (i.e. a queuing engine) could be directed down the same connection - minimizing the contention between the threads.

Another example of this is also present in high volume OLTP situations such as investment banking in which a few small accounts (investment funds) incur a large number of transactions during trading and compete with small transactions from a large user base investing in those funds. However, it also can happen in retail banking from a different perspective. Granted, any single account probably does not get much activity. And when it does, it is dispersed between different transactions over (generally) several hours. However, given the magnitude of the accounts, if even a small percentage of them experience timing related contention, it could translate to a large contention issue during replication. 1% of 1,000,000 is 1,000 - which is still a large number of transactions to retry when an alternative exists. In the example below, however, every transaction that affected a particular account would use the same connection and as a result would be serialized vs. concurrent and much less likely to experience contention.

Acct_num mod 0

HeadquartersBranchBank

Acct_num mod 1Acct_num mod 2 (etc)Cross_Acct Transfer

Acct_num mod 0

HeadquartersBranchBank

Acct_num mod 1Acct_num mod 2 (etc)Cross_Acct Transfer

Figure 104 - Multiple DSI Approach to Managing Contention

One of the advantages to this approach is that where warranted, Parallel DSI’s can still be used. While this is nothing different than other Multiple DSI situations, in this case, it takes on a different aspect as different connections can use different serialization methods. For example, one connection in which considerable contention might exist would use “wait_for_commit” serialization, while others use “none”.

Corporate Rollups

On of the most logical places for Multiple DSI implementation is corporate rollup. No clearer picture of commit consistency can be found. The problem is that Parallel DSI’s are not well equipped to handle corporate rollups. Consider the following

• If one DSI suspends, they all do. Which means they all begin to back up - not just the one with the problem. As a result the aggregate of transactions in the backup may well exceed possible delivery rates.

• Single Replication Server for delivery. While transactions may be routed from several different sources, it places the full load for function string generation and SQL execution on a single process.

• Large Transactions issues. Basically, as stated before, a system becomes essentially single threaded with a large transaction due to commit order requirements. Given several sites executing large transactions and the end result is that corporate rollups have extreme difficulty completing large transactions in time for normal daily processing.

• Limited Parallelism. At a maximum, Parallel DSI only supports 20 threads. While this has proven conclusively to be sufficient for extremely high volume at even half of that, with extremely large implementations (such as nation-wide/global retailers), it still can be two few.

• Mixed transaction modes. In “follow-the-sun” type operations limit the benefits of “single_transaction_per_source” as the number of sources active concurrently performing POS activity may be fairly low while others are performing batch operations. Consequently, establishing Parallel DSI profiles is next to impossible as the different transaction mixes are constant.

Multiple DSI’s can overcome this by involving multiple Replication Servers, limiting connection issues to only that site and allowing large transaction concurrency (within the limits of contention at replicate, of course). In fact, extremely large-scale implementations can be developed. Consider the following:

Final v2.0.1

307

CorporateRollup

Regional Rollup

Field Offices

CorporateRollup

Regional Rollup

Field Offices

Figure 105 - Large Corporate Rollup Implementation with Multiple DSI’s

In the above example, each source maintains it’s own independent connection to the corporate rollup as well as the intermediate (regional) rollup. This also allows a field office to easily “disconnect” from one reporting chain and “connect” to the other simply by changing the route to the corporate rollup as well as the regional rollup and changing the aliased destination to the new reporting chain (note: while this may not require dropping subscriptions, it still may require some form of initialization or materialization at the new intermediate site). While not occurring on a regular basis (hopefully), this reduces the IT workload significantly when re-organizations occur.

Asynchronous Requests

Addition to parallel performance, another performance benefit for Multiple DSI’s could be as a substitute for asynchronous request functions. As stated earlier, request functions have the following characteristics:

• Designed to allow changes to be re-replicated back to the originator or other destinations. • Can incur significant performance degradation in any quantity due to reconnection and transaction grouping

rules. • Require synchronization of accounts and passwords.

Multiple DSI’s natively allow the first point but by-pass the last two quite easily. The replicated request functions could simply be implemented as normal procedure replication with the subscription being an independent connection to the same database. In this way, transaction grouping for the primary connection is not impeded, and the individual maintenance user eliminates the administrative headache of keeping the accounts synchronized.

Cross Domain Replication

Although a topic better addressed by itself, perhaps one of the more useful applications in Multiple DSI’s is as a mechanism to support cross-domain replication. Normally, once a replication system is installed and the replication domain established, merging it with other domains is a difficult task of re-implementing replication for one to the domains. However, this may be extremely impractical as it disables replication for one of the domains during this process - and is a considerable headache for system developers as well as those on the business end of corporate mergers who need to consider such costs as part of the overall merger costs.

The key to this is that a database could participate in multiple domains simply be being “aliased” in the other domain the same way as Multiple DSI approach - because in a sense it is simply a twist on Multiple DSI’s - each domain would have a separate connection. Consider the following:

Final v2.0.1

308

DS1DS1 DS2DS2

DS2.db2DS2.db2

DS3DS3 DS4DS4

DS4.db2DS4.db2DS3.db1DS3.db1

DS3a.db1DS3a.db1

DS1a.db1DS1a.db1

DS1.db1DS1.db1

Figure 106 - Multiple DSI Approach to Cross-Domain Replication

Once the concept of Multiple DSI’s is understood, cross-domain replication becomes extremely easy. However, it is not without additional issues that need to understood and handled appropriately. As this topic is much better addressed on its own, not a lot of detail will be provided, however, consider the following:

Transaction Transformation - Typically the two domains will be involved in different business processes. For example, Sales and HR. If integrating the two, the integration may involve considerable function string or stored procedure coding to accommodate the fact that a $5,000 sale in one translates to a $500 commission to a particular employee in the other.

Number of Access Points - If the domains intersect at multiple points, replication of aggregates could cause data inconsistencies as the same change may be replicated twice. This is especially true in hierarchical implementations.

Messaging Support - Replicating between domains may require adding additional tables simply to form the intersection between the two. For example, if Sales and Shipping were in two different domains, replicating the order directly - particularly with the amount of data transformation that may need to take place - may be impractical. Instead “queue” or “message” tables may have to be implemented in which the “new order received” message is enqueued in a more desirable format for replication to the other domain.

While some of this may be new to those who’ve never had to deal with it, particularly, any form of workflow automation involves some new data distribution concepts foreign to and in direct conflict with academic teachings. Since cross-domain replication is a very plausible means of beginning to implement workflow, some of these need to be understood. However, it is crucial to establish that cross-domain replication should not be used as a substitute for a real message/event broker system where the need for one clearly is established. Whether in a messaging system or accomplished else wise (replication), workflow has the following characteristics:

Transaction Division - While an order may be viewed as a single logical unit of work by the Sales organization, due to backorders or product origination, the Shipping department may have several different transactions on record for the same order.

Data Metamorphism - To the Sales system, it was a blue shirt for $39.95 to Mr. Ima Customer. To Shipping, it is a package 2x8x16 weighing 21 ounces to 111 Main Street, Anytown, USA.

Transaction Consolidation - To Sales, it is an order for Mrs. Smith containing 10 items. To credit authorization, it is a single debit for $120.00 charged to a specific credit card account.

And so forth. Those familiar with Replication Server’s function string capabilities know that a lot of different requirements can be meant with them. However, as the above points illustrate, cross domain replication may involve an order of magnitude more difficult data transformation rules - spanning multiple records - not supportable by function strings alone. While “message tables” could be constructed to handle simpler cases, it increases I/O in both systems and may require modifications to existing application procedure logic, etc. Hence advent and forte of Sybase Real Time Data Services and Unwired Orchestrator

Final v2.0.1

309

Integration with EAI

One if by Land, Two if by Sea.. Often, system developers confuse replication and messaging - assuming they are mutually exclusive or that messaging is some higher form of replication that has replaced it. Both are equally wrong. For good reason – remove the guaranteed commit order processing and provide transaction level transformations/subscriptions and Sybase’s RS becomes a messaging system. In fact, Sybase’s RS is a natural extension to messaging architectures to the extent that any corporation with an EAI strategy that already owns RS should take a long and serious look at how to integrate RS into their messaging infrastructure (i.e. build an adapter for it). Several years ago, Sybase produced the “Sybase Enterprise Event Broker”, which did just that - used Replication Server as a means to integrate older applications with messaging systems. Today, SEEB has been replaced with RepConnector (a component in Real Time Data Services), consequently is the 2nd generation product for replication/messaging integration.

The assumption for this section is that the reader is familiar with basic EAI implementations and architectures.

Replication vs. Messaging

Messaging is billed as “application-to-application” integration while replication is often viewed as “database-to-database integration”. The confusion then usually arises as different people will proselytize one solution over another – completely ignorant of the fact that each are entirely different solutions and are target to different needs. However, in order to straighten this out, let’s take a closer look at the characteristics of each solution.

Characteristic Replication Server EAI Messaging

Focus Enterprise/Corporate data sharing at the data element level

Enterprise/Internet B2B integration at the message/logical unit of work

Unit of Delivery Transaction composed of individual row modifications.

Complete message – essentially intact logical transaction

Serialization Guaranteed Serialization to ensure database consistency

Optional – usually not serialized. Desire is to ensure workflow

Subscription Granularity Row/column value Message type, addressees, content, etc.

Event triggers DML operation/Proc execution Time expiration, message transmission

Schema Transparency Row level with limited denormalization – similar data structures

Complete transparency (requires integration server)

Speed/Throughput High Volume/ Low-NRT latency Medium Throughput/Hours-Minutes latency.

Implementation Complexity Low to Medium with singular corporate administration & support

Medium to Complex with coordinated specifications/disjoint administration & support

Application Transparency Transparent with isolated issues. Primary transaction unaltered (direct to database)

Requires rewrite to form messages. Primary transaction is asynchronous and may be extensively delayed.

Interfaces LTL, SQL, RPC EDI, XML, proprietary

While the above would seem to suggest that EAI represents a “better” data distribution mechanism, the real answer is it depends on your requirements. If you want a simpler implementation with NRT latency and high volume replication to an internal system, Replication Server is probably the better solution. However, if flexibility is key – or, if the target

Final v2.0.1

310

system is not strictly under internal control (i.e. a packaged application or a partner system), EAI is the only choice. In general, EAI extends basic messaging with business level functionality. The following table illustrates how EAI extends basic messaging to include business level drivers.

Replication Server EAI Messaging

Guaranteed Delivery Guaranteed Delivery • Time limit • Non-repudiation (return receipt) • Delivery Failure

N/A Message Prioritization • Relative priority • Time constraints

N/A Perishable Messages • Time expiration • Subsequent Message

Transmission Encryption/System Authentication via SSL

Message Security • Sender/user Authenticity • Privacy

ANSI SQL Interface Standards (EDI, XML) • Protocol Translation • Custom Protocol Definition

SQL Transactions Message Format Distribution • Message Structures

CML (insert/update/deletes) Procedure Executions

Flexible Event Detection • Failure Events (Non-Events) • Threshold Events • State Change Events • User Requested Events

Row/Column value subscriptions Message Filters • Conditions on Events

Individual DB connections Addressee Groups • Hierarchical • Channels • Broadcast

Definable actions (stop, retry, log) Exception Processing • Corrupted/Incomplete • Rule (Expiration, Time limit, etc) • Actions (Retry, Log, Event)

Now then, let’s consider the classic architectures and when which of these solutions might be a better fit.

Scenario RS MSG Rationale

Standby System Transaction serialization

Internal system to packaged application such as PeopleSoft

? Schema transparency, interface specification – possibly use both if internal system – use RS to signal EAI solution

Two packaged applications Schema transparency, interface specification

Final v2.0.1

311

Scenario RS MSG Rationale

Corporate Roll-ups/Fan-Out Little if any translation required (ease of implementation); transaction serialization from individual nodes

Shared Primary/Load Balancing Little if any translation required (ease of implementation); transaction serialization from individual nodes

Internal to External (customer/partner)

Schema transparency, control restrictions, protocol differences

Enterprise Workflow ? Possibly use RepConnect to integrate RS with EAI – rationale is business viewpoint differences drive large schema differences plus use of packaged applications (i.e. PeopleSoft Financials).

The real difference between the two and the need for EAI is apparent in a workflow environment. While RS supports some basic workflow concepts (request functions, data distribution, etc.) it is hampered by the need to similar data structures or extensive stored procedure interfaces to map the data at each target location. To see how complex workflow situations can get, lets take the simple online or catalog retail example.

Different Databases/Visualization

Within different business units in the workflow, the “data” is visualized quite differently. Consider the basic premise of a customer ordering a new PC.

Order Processing Database - It’s a HP Vectra PC costing $$$ for Mr. Jones along with a fancy new printer. HR Database - $$$ in sales at 10% commission for Jane Employee Shipping Database - It’s 3 boxes weighing 70lbs to Mulberry St.

Obviously, you could conceive more – Financials, etc. However, the point is a single transaction – which may be represented as a single record in the Order Processing database (and a single SKU) – has different elements of interest to different systems. HR really only cares about the dollar figure and the transaction date for payroll purposes, while shipping cares nothing about the customer nor financial aspects of the transaction – in fact the single record becomes three in it’s systems. Those familiar with replication know it would be a simple task to use function strings and procedure calls to perform this integration from a Replication Server perspective. However, that would require – in a sense – modifying the application (although this is highly arguable as adding a few stored procedures that are strictly used as an RS API is no different than message processing).

Different Companies

Additionally, the workflow often requires interaction with external parties – such as credit card clearing houses, suppliers (hint: buy.com and amazon.com neither one REALLY have that “Pentagon” size inventory). Interactions with external parties has it’s own set of special issues.

• Still want guaranteed transaction delivery (but the transaction may be changed) • Mutually untrusted system access • Complicated by different protocols, structures, (EDI 820 messages, fpML messages) etc.

In addition to the external party complexities that Replication Server really can’t address, the other aspect to external party interaction is that it often requires a “challenge/response” message that is required before workflow can continue. For example, the store needs to debit the credit card and receive and acknowledgement prior to the original message continuing along the path to HR and Shipping.

Different Transactions

Additionally, a single business transaction in a workflow environment may be represented by different transactions at different stages of the workflow. As noted above, some stages of the workflow may become synchronous (i.e. credit card debit) before the workflow can continue. The below list of transaction operations are not couched in the terms of

Final v2.0.1

312

any one EAI product – but are useful when considering the metamorphis a single business transaction can undergo in a workflow system

Transaction spawning - Shipping Request Stock Order – for example, if the purchase depletes the stock of an item below a threshold that spawns and automatic re-ordering of the product from the supplier.

Transaction decomposition/division - One order multiple shipments (due to backorder or multiple/independent suppliers). In this sense the order is not complete until each item is complete.

Transaction multiplication - One order Accounting, Marketing, Shipping…. In a sense this is multiplication in that for each business transaction, N other messages/transactions will result in various workflow systems.

Transaction state - One order Booked vs. Recognized Revenue. In this case, one transaction from the order entry system spawns a transaction to the financial system as well as order fulfillment. In the financial system, the revenue is treated as “booked” but not credited yet. In the order fulfillment department, once the order has been shipped – in a sense they issue a response message to the order entry system stating the order is complete. Additionally, the shipping department’s response also updates the state of the financial system – causing the credit card to actually be debited as well as changing the state of the revenue to “recognized”.

The important aspect to keep in mind is that through each of these systems, a transaction identifier is needed to associate the appropriate responses – for retail, this is the order number/item number combination. Additionally, workflow messaging may require challenge/response messaging (as discussed earlier) as well as message merging (merge airline reservation request, rental car request, hotel reservation request into single trip ticket for travelers) over an extended period of time – consequently, the life span of a message within a messaging system can be appreciable – unlike database replication in which the message has extremely short duration (less recovery configuration settings).

Integrating Replication & Messaging

Having seen that the two are distinctly different solutions, the next question that arises is whether they are complementary. In other words, does it make sense to use both solutions simultaneously in an integrated system. The answer is a resounding “YES”. The single largest benefit of integrating replication and messaging systems when both are needed (i.e. a Warm Standby within a workflow environment) – is that legacy applications may be include in the EAI strategy without the cost of re-writing existing 3 tier applications – and the response time impact to front-end systems of adding messaging on to the transaction time. Additionally, existing systems can now have extended functionality added without a major re-write. For example, today, we expect an email from any online retailer worthy of the name when our order is shipped. This becomes a simple task for RS, RepConnect and EAServer as a single column update of the status in the database via subscription on the shipment status field could invoke a component in EA Server to extract the rest of the order details, construct an email message and pass it to the email system for delivery. Similarly, RS could use an RPC to add a job to an OpenServer or EA Server based queuing mechanism vs. having the systems constantly polling from a database queue.

Performance Benefits of Integration

The chief performance benefits of integrating the two solutions comes from the elimination of using a cpu/process intensive polling mechanism that is commonly used to integrate existing database systems into a new messaging architecture. Any polling mechanism that attempts to detect database changes outside of scanning the transaction log involves one of several techniques: timestamp tracking; or shadow tables.

Timestamp tracking involves adding a datetime field to every row in the database. This field is then modified with each DML operation. At a simplistic level, the polling mechanism simply selects the rows that have been modified since the last poll period. This technique has a multitude of problems:

1. An isolation level 3 read is required – which could significantly impact contention on the data as the shared/read locks are held pending the read completion. Isolation level 3 is required to avoid row movement (deferred update/primary key change, etc.) from causing a row to be read twice.

2. Deleted rows are missed entirely (they are there anymore – so no way to detect a modification via the date).

3. Multiple updates to the same row between polling cycles are lost. This could mean the loss of important business data, such as the daily high for a stock price.

The second implementation is a favorite of many integration techniques – including heterogeneous Replication Agents where log scanning is not supported. This implementation has a number of considerations (not necessarily problems, but could have system impact):

Final v2.0.1

313

1. Lack of transactional integrity – each table is treated independently of the parent transaction. Consequently a transaction tracking table is necessary to tie individual row modifications together in the concept of a transaction. Additionally, each operation (i.e. inserts into different tables) would have to be tracked ordinally to ensure RI was maintained as well as serialization within the transaction.

2. Lack of before/after images – if all that is recorded is the after image, then again, there would be issues with deletes – additionally critical information for updates would be lost. As a result, the shadow table would have to track before/after values for each column.

3. Extensive I/O for distribution. A single insert becomes: a. Insert into real table(s) b. Insert into shadow table(s) c. Insert into transaction tracking table d. Distribution mechanism reads transaction tracking table e. Distribution mechanism reads shadow table(s) f. Distribution mechanism deletes rows from shadow table(s) g. Distribution mechanism deletes rows from transaction tracking table

This last consideration may not be that much of a concern on a lightly or medium loaded system. However, if the system is nearing capacity, this activity could bring it to it’s knees. Additionally, as the distribution mechanism reads or removes records from the shadow tables, it could result in contention with source transactions that are attempting to insert rows.

As a consequence – ignoring the cost/development benefits of an integrated solution – integrating Replication Server with a messaging system could achieve greater overall performance & throughput than simply forcing a messaging solution. The key areas of improved performance would be:

• Reduced latency for event detection – Replication Agents work in Near-Real Time whereas a polling agent would have a polling cycle – possibly taking several minutes to detect a change.

• Reduced I/O load on primary system – by scanning directly from the transaction log, the I/O load - and associated CPU load – of timestamp scanning or maintaining shadow tables are eliminated for ASE systems. Shadow tables may still be necessary for heterogeneous systems.

• Reduced contention.

The conclusion is fairly straight-forward. Any site that has existing applications that does not wish to undertake a massive recoding effort, particularly if the system is already involved in replication (i.e. Warm Standby), integrating replication with messaging may improve performance & throughput over using both individually – and suffering the impacts that a database adapter could inflict.

Messaging Conclusion

This section may have appeared out of context with the rest of this paper. However, it was included to illustrate the classic point that sometimes better performance and throughput is a system-wide consideration and a shift in architecture may achieve more for overall system performance than merely tweaking RS configuration parameters.

Key Concept #38: A corollary to “You can’t tune a bad design” is “A limited architecture may be limiting your business”.

Final v2.0.1

Sybase Incorporated Worldwide Headquarters One Sybase Drive Dublin, CA 94568, USA Tel: 1-800-8-Sybase. www.sybase.com Copyright © 2000 Sybase, Inc. All rights reserved. Unpublished rights reserved under U.S. copyright laws. Sybase and the Sybase logo are trademarks of Sybase, Inc. All other trademarks are property of their respective owners. ® indicates registration in the United States. Specifications are subject to change without notice. Printed in the U.S.A.

sybase replication server performance &...

Documents