henrietta dombrovskaya - a new approach to resolve object-relational impedance mismatch @ postgres...
TRANSCRIPT
A New Approach to Resolve Object-Relational Impedance Mismatch
Henrietta Dombrovskaya Enova Chicago IL [email protected]
What is Object-Relational Impedance Mismatch and why It is
bad? Why should we care?
We should care because we care
about..
2013 Henrietta Dombrovskaya – Enova 2
Database Application Performance
Everybody wants their application to function efficiently, right? After all – why choose to use databases in the first place? Because The DBMS is specialized software designed to manage data in the most efficient way. Nevertheless, the most common complaint of application developers is …. 2013 Henrietta Dombrovskaya – Enova 3
THE DATABASE IS
SLOW
WHY??? 2013 Henrietta Dombrovskaya – Enova 4
Let’s see…
Where do we usually start to look, if we want
to see, why the database performance is bad?...
2013 Henrietta Dombrovskaya – Enova 5
Our plan of actions
Look at the most long-running queries and try to optimize them
Look at the queries, which are executed most often, and make sure all of them are tuned and/or running fast enough
Everything looks good, why is the app still slow?
2013 Henrietta Dombrovskaya – Enova 6
Helpful reports:
2013 Henrietta Dombrovskaya – Enova 7
Our performance monitoring tool (pgBadger) produces several classes of reports, among them: Longest-running queries (slowest queries)
Most frequently running queries Queries, which take up the most time
(top offenders).
Based on these reports - is a database
really slow? The top 50 most offensive queries, i.e. the queries
which eat up the most of the execution time, almost exclusively are the queries, which are very short (fast), but executed too many times For example
SELECT * FROM loans WHERE id=?
is executed 8,500,000 times during the day, each time taking several milliseconds, with total execution time about 2.5 hours.
2013 Henrietta Dombrovskaya – Enova 8
You may say – that’s how many times the
screen is refreshed, but…
UK: account home controller as accessed about 50,000 times during the day, and the customer support tool even less
However, the pgBadger log shows total almost 1.5 million executions of different versions of the
SELECT sum_accounts statement and similar number of SELECT* FROM loans. Statistics show that some application controllers produce
over 1,000 database calls for each screen refresh.
2013 Henrietta Dombrovskaya – Enova
How could this possibly be happening?! Let’s take a step back…
2013 Henrietta Dombrovskaya – Enova 10
First came a program…
• Once upon a time there was a program… • Until mid-60’s – only sequential access storage was available; a program was
considered a stable object, which may process different input data, and produce different output
Input Output • Direct access storage appeared in the early 70s, allowing concurrent data access and
selective updates. At the same time developers started to face rising complexity and costs of software development, concurrency control and data integrity problems, and a need of specialized data-management software became apparent.
• DBMS emerged as specialized programs for centralized data management
2013 Henrietta Dombrovskaya – Enova
Program
11
DBMS
Since Then We Have…
-Imperative programming languages, which tells, how to do things
for (i:=1, i++, n) do … end;
and - Declarative data manipulating languages, which define
what to do: SELECT first_name, last_name FROM people WHERE id=101
2013 Henrietta Dombrovskaya – Enova 12
Database provides: • Independent data storage • Complex data structures • Data integrity • Concurrent access and data consistency Application provides: • User interface • Application control flow 2013 Henrietta Dombrovskaya – Enova 13
So… Now everybody is happy, right? …not really…
2013 Henrietta Dombrovskaya – Enova 14
Why? Because…
Both imperative programming languages and declarative query languages work perfectly to accomplish the tasks the were designed to accomplish. The problems start, when we try to make them to work together.
2013 Henrietta Dombrovskaya – Enova 15
What is Impedance Mismatch?
These tools were designed to operate with objects of different size (granularity): variables vs. tables And different means of identification: address vs. set of attribute values More precisely: • Object incompatibility - database objects are “too big” for the
imperative programming language • This means that classes and methods for tables and sets should be
integrated with the database to work effectively (methods should be executed by the DB engine)
• But most of the architectures do not allow such integration, which leads to reimplementation of database operations at the application layer
2013 Henrietta Dombrovskaya – Enova 16
When We Have an OO Application…
• During one screen rendering: objects may belong to different classes with different access methods
• Object-Relational Mapping (ORM): maps a database object to the in-memory application object: – solves a problem of abstraction from details of data storing – does not provide effective means of manipulation with data sets
This particular case of impedance mismatch is called ORIM – object-relational impedance mismatch
2013 Henrietta Dombrovskaya – Enova 17
ORIM Definition
“A set of conceptual and technical difficulties that is encountered when a relational database management system (RDBMS) is being used by a program written in an object-oriented programming language or style, particularly apparent when objects or class definitions are mapped in a straightforward way to database tables or relational schema”
Ambler, S., Agile Database Techniques: Effective Strategies for the Agile Software Developer
2013 Henrietta Dombrovskaya – Enova 18
Development Standards & Performance
Internal company standards and/or industry standards may prevent from creating efficient database application.
Typical OO design recommendations - clearly separate the following layers of classes:
• End-user interface • Business logic • Relationships • Interaction with the database The standard allows limited interaction between adjacent
levels, therefore equivalents of database operations may be implemented within several different processes, sometimes on different machines.
2013 Henrietta Dombrovskaya – Enova 19
Translation:
• As a database developer you would assume, that to get all payments for all loans for a given customer, you need to join a loans table with the payments table
• But for an application developer loans and payments belong to different classes, therefore they are accessed by different sets of methods
• Which means, that the join operation is replicated “manually”..
Let’s look at some examples…
2013 Henrietta Dombrovskaya – Enova 20
What We Have at Enova?
Enova is a Ruby on Rails shop that uses the ActiveRecord Object Relational Mapping (ORM) library to communicate with a Postgres database. With ActiveRecord a database table or view is mapped into a class and an object instance is tied to a single row in the table. ActiveRecord is the default “model” component of the model-view-controller web-application framework Ruby on Rails, and is also a stand-alone ORM package for other Ruby applications.
2013 Henrietta Dombrovskaya – Enova 21
How ActiveRecord Works
2013 Henrietta Dombrovskaya – Enova 22
What This Means for Application/
Database Interaction Due to the lack of awareness of the underlying database interaction on the part of the object methods, one controller performs multiple trips to the database For example…
2013 Henrietta Dombrovskaya – Enova 23
Account Presenter
def initialize(customer) @customer = customer @customer_extra = customer.customer_extra @person = customer.person @address = customer.person.try(:address) @company = customer.person.try(:company) @bank_account = customer.bank_account(true) @debit_card = customer.debit_card @customer_paydate = customer.customer_paydate(true) @paydate_schedule = customer.customer_paydate.try(:paydate_schedule) @customer_source = customer.customer_source end
2013 Henrietta Dombrovskaya – Enova 24
Corresponding Application Log
SELECT * FROM customers WHERE (customers.id = 12470535) SELECT * FROM people WHERE (people.id= 61657007 AND (type = 'CustomerPerson')) AND ( (people.type = 'CustomerPerson' ) ); SELECT addresses.*, people_addresses.serial_number FROM addresses INNER JOIN people_addresses ON addresses.id = people_addresses.address_id
WHERE (people_addresses.person_id = 61657007 AND (eff_end_date is NULL)); 2013 Henrietta Dombrovskaya – Enova 25
But Wait, There’s More!
SELECT * FROM approvals WHERE (customer_id = 12470535) ORDER BY processed_on desc LIMIT 1 SELECT * FROM customers WHERE (customers.id = 12470535) SELECT * FROM loans WHERE (loans.id = 25563928) SELECT * FROM loans WHERE (customer_id = 12470535 and status_cd in ('applied','approved','on_hold')) ORDER BY funding_date DESC LIMIT 1 SELECT * FROM loans WHERE (customer_id = 12470535 and status_cd in ('applied','approved',E 'on_hold')) ORDER BY funding_date DESC LIMIT 1 SELECT * FROM loans WHERE (customer_id = 12470535 and status_cd in (‘issued','issued_pmt_proc')) ORDER BY funding_date DESC LIMIT 1 SELECT * FROM loans WHERE (customer_id = 12470535 and status_cd in ('applied','approved',E'on_hold')) ORDER BY funding_date DESC LIMIT 1 SELECT * FROM customers WHERE (customers.person_id = 61657007) SELECT count(*) AS count_all FROM loans WHERE (loans.customer_id = 12470535 AND (status_cd in ('applied','approved','on_hold','issued','issued_pmt_proc') and loan_type_cd = 'installment')) SELECT * FROM loans WHERE (customer_id = 12470535 and status_cd in ('applied','approved','on_hold')) ORDER BY funding_date DESC LIMIT 1 SELECT * FROM loans WHERE (customer_id = 12470535 and status_cd in ('applied','approved','on_hold')) ORDER BY funding_date DESC LIMIT 1 2013 Henrietta Dombrovskaya – Enova 26
What We Should See Instead…
2013 Henrietta Dombrovskaya – Enova
SELECT * FROM customers WHERE (customers."id" = 12470535) SELECT * FROM people WHERE (people."id" = 61657007 AND (type = 'CustomerPerson')) AND ( (people."type" = 'CustomerPerson' ) );
SELECT addresses.*, people_addresses.serial_number FROM addresses INNER JOIN people_addresses ON addresses.id = people_addresses.address_id WHERE (people_addresses.person_id = 61657007 AND (eff_end_date is NULL));
27
Why Does This Happen?
… because the app is written in an imperative way:
“Fist we need to select all loans, then for each loan we need to select…..”
2013 Henrietta Dombrovskaya – Enova 28
There Are Some Ways To Improve
Eager loading generates almost correct SQL: SELECT customers."id”
,customers."created_by” <..>,customers."person_id" <..>,customers_extra."estimated_monthly_living_expense" <..>,people."type"
<...>,addresses."id"<...> ,companies."id" <...>
FROM customers LEFT OUTER JOIN customers_extra ON customers_extra.customer_id = customers.id LEFT OUTER JOIN people ON people.id = customers.person_id AND people."type" = 'CustomerPerson’ LEFT OUTER JOIN people_addresses ON people_addresses.person_id = people.id LEFT OUTER JOIN addresses ON addresses.id = <…>
WHERE (customers."id" = 17674188)
2013 Henrietta Dombrovskaya – Enova 29
But Then We Apply the Next Method
has_many :bank_accounts, :foreign_key => 'person_id' do def default_for_customer(customer, reload = false) @bank_account_cache = {} if @bank_account_cache.nil? || reload @bank_account_cache[customer.brand.code] ||= find(:first, :joins => "INNER JOIN payment_instruments.defaults pid ON pid.payment_instrument_id = payment_instruments.bank_accounts.payment_instrument_id", :conditions => ["pid.customer_id = ?", customer], :readonly => false) end And we execute SELECT again! 2013 Henrietta Dombrovskaya – Enova 30
If We Continue With Existing Frameworks
… this problem (ORIM) will never be solved, and we will continue to loose money on timeouts!
2013 Henrietta Dombrovskaya – Enova 31
New Approach to Resolve Object-Relational
Impedance Mismatch Our new approach allows drastically reduce the number of calls to the database, and get exactly the data we need. An application view can be rendered using one or two calls to the database instead of 500-900 call in the old app. In addition, it allows to optimize queries execution without any changes to the application and/or models.
2013 Henrietta Dombrovskaya – Enova 32
The old methods are unaware of the data which is already available – make them
data-aware!
Isn’t this contrary to the standard OO approach? Yes, but…
This is the only way to improve the App/DB interaction.
2013 Henrietta Dombrovskaya – Enova 33
Logic Split Methodology
Disassemble method into atomic steps,
Identify ones, which require data retrieval
Using the knowledge about database objects relationships, construct a single query
Execute
Use retrieved data in other steps
2013 Henrietta Dombrovskaya – Enova 34
Example: Amount_Outstanding
2013 Henrietta Dombrovskaya – Enova
Definitions
AccountsOutstanding= AccountsUncollected + FeesOutstanding + InterestOutstanding + PrincipalAccounts + AccountsDue In turn: • AccountsUncollected =
uncollected_principal + uncollected_installment_principal
2013 Henrietta Dombrovskaya – Enova Financial 36
Under the Hood: Database Calls
SELECT vl.value AS account ,SUM(CASE vl.value WHEN pt.debit_account_cd THEN pt.amount ELSE 0 END) -‐ SUM(CASE vl.value WHEN pt.credit_account_cd THEN pt.amount ELSE 0 END) AS sum FROM payment_transactions pt JOIN valuelists vl ON vl.type_cd ='transaction_account' AND vl.value IN (pt.debit_account_cd, pt.credit_account_cd) AND loan_id=?
… and then a value for specific account is selected. 2013 Henrietta Dombrovskaya – Enova 37
Drawbacks
The method itself would allow retrieving all the information related to one loan “in one shot”. However, because the application developers are unaware of the underlying layers, there appears to be no difference:
– Whether we obtain the values of all account balances one-by-one by following normal Object-Oriented method logic, in an imperative way
– Or if we obtain them all, simultaneously
Now for some results:
2013 Henrietta Dombrovskaya – Enova 38
PGBadger Log
2013 Henrietta Dombrovskaya – Enova 39
Modified Method
2013 Henrietta Dombrovskaya – Enova 40
Under the Hood: Database Calls
SELECT loan_id , sum(CASE WHEN debit_account_cd = 'uncollected_principal' THEN pt.amount ELSE 0 END -‐CASE WHEN credit_account_cd='uncollected_principal' THEN pt.amount ELSE 0 END) AS uncollected_principal <...> , sum(CASE WHEN debit_account_cd = 'uncollected_nsf_fees' THEN pt.amount ELSE 0 END -‐ CASE WHEN credit_account_cd = 'uncollected_nsf_fees' THEN pt.amount ELSE 0 END) AS uncollected_nsf_fees , sum(CASE WHEN debit_account_cd = 'installment_principal' THEN pt.amount ELSE 0 END -‐ CASE WHEN credit_account_cd = 'installment_principal' THEN pt.amount ELSE 0 END) AS installment_principal FROM payment_transactions_committed pt INNER JOIN loans l ON l.id=pt.loan_id WHERE customer_id={?} GROUP BY loan_id
2013 Henrietta Dombrovskaya – Enova 41
Execution Statistics From Dark Testing: Old
2013 Henrietta Dombrovskaya – Enova 42
… and New
2013 Henrietta Dombrovskaya – Enova 43
So… now everybody happy?... …not really… What do application developers say at this point?.... Wait! What about the business logic?! 2013 Henrietta Dombrovskaya – Enova 44
What About the Business Logic?
We need some business logic to know how to execute joins and selects – just enough to minimize the number of db calls.
Selected results transformations and manipulations do not have to be executed on the database side.
2013 Henrietta Dombrovskaya – Enova 45
One More Time
ü Disassemble method into atomic steps, ü Identify ones which require data retrieval ü Using knowledge about database objects
relationships, construct a single query ü Execute ü Use retrieved data in other steps
2013 Henrietta Dombrovskaya – Enova 46
Let’s review another example – account balance calculation for OEC loans.
2013 Henrietta Dombrovskaya – Enova 47
Account_Balance Method for OEC
Ø Obtain account principal balance Ø Obtain outstanding fees and interest as of
next payment due date Ø Calculate the interest credit (unearned
interest) for the number of days left before the payment due date
Ø Obtain existing customer balance Ø Calculate the total account balance using the
values obtained on steps 1-4.
2013 Henrietta Dombrovskaya – Enova 48
Traditional Object-Oriented Approach
Create Account_Balance method which would call the following methods:
Ø Principal_Balance, Ø Interest_Amount, Ø Fees_Amount, Ø Customer_Balance Ø Interest_Credit
Each of these methods would interact with a database independently.
2013 Henrietta Dombrovskaya – Enova 49
Drilling Down Into Each Of the Steps 1. Principal balance as described for account_outstanding requires a
single database call. 2. Outstanding interest and fees require one database call each. 3. Interest credit calculation:
3.1. Obtain the daily interest rate for this customer 3.2. Obtain base amount, which is used to calculate the total interest 3.3. Obtain the number of days, for which the interest should be credited:
3.3.1. Obtain the next payment due date 3.3.2. Calculate number of days based on obtained date and todays’ date
3.4. Calculate amount of credit, based on results from the previous three steps
4. Customer balance can be obtained using one database call, same as steps 1-3.
2013 Henrietta Dombrovskaya – Enova 50
Combining Steps With Data Retrieval
Those steps are: 1,2,3, 3.1, 3.2, 3.3.1, 4, which means that the database accessing method should execute the following task: For a given loan, retrieve payment transactions, which show principal balance, current interest, fees and customer balance, also retrieve loan’s daily interest rate and next payment due date. 2013 Henrietta Dombrovskaya – Enova 51
SELECT statement
SELECT l.id AS loan_id ,sum( CASE WHEN debit_account_cd = ‘principal’ AND t.acct_date<= v_current_date THEN t.amount ELSE 0 END - CASE WHEN credit_account_cd =’principal’ AND t.acct_date<= v_current_date THEN t.amount ELSE 0 END ) AS amount_payable ,sum (CASE WHEN t.debit_account_cd=’fees_provisional’ THEN t.amount ELSE 0 END - CASE WHEN t.credit_account_cd= ‘fees_provisional’ THEN t.amount ELSE 0 END ) AS fees_provisional ,sum (CASE WHEN t.debit_account_cd=’ interest_provisional’ THEN t.amount ELSE 0 END - CASE WHEN t.credit_account_cd =’interest_provisional’ THEN t.amount ELSE 0 END ) AS interest_provisional ,st.end_date AS next_closing_date ,l.daily_rate AS interest_rate ,sum (CASE WHEN t.debit_account_cd ='customer_balance'
THEN -t.amount ELSE 0 END - CASE WHEN t.credit_account_cd ='customer_balance' THEN -t.amount ELSE 0 END ) AS customer_balance FROM loans l LEFT OUTER JOIN payment_transactions_committed t ON l.id=t.loan_id LEFT OUTER JOIN statements st ON l.id=st.loan_id WHERE l.id={?} GROUP BY l.id ,l.daily_rate ,st.end_date
2013 Henrietta Dombrovskaya – Enova 52
Ruby Method: account_balance
def account_balance amount_payable = amount_outstanding + amount_charged_off amt = amount_payable - customer_balance provisional_interest = interest_provisional provisional_fees = fees_provisional if provisional_interest > 0.0 amt = amt + provisional_fees + [ provisional_interest - [unearned_interest, 0].max, 0 ].max else amt = amt + provisional_fees end amt = amt.round_near return [0, amt].max end
2013 Henrietta Dombrovskaya – Enova 53
Ruby Method: unearned_interest
def unearned_interest amt = -1 * (principal_amount + [(customer_balance - interest_provisional - fees_provisional), 0].max) next_closing_date = Date.parse(self.next_closing_date) return 0 unless next_closing_date interest_period = [next_closing_date - Date.today + 1, 0].max interest_rate = oec_daily_rate.to_f interest = amt * interest_period * interest_rate interest = interest.round_down interest < 0.0 ? 0 : interest end
2013 Henrietta Dombrovskaya – Enova 54
How It Works
This methods is used as a callback in a more complex method, which returns Loan_Summary where most of the values can be selected directly from the database. For compliance reasons all calculations are executed in the Ruby model. We need a “virtual column” as one of the fields of the data output; the Ruby callback can be inserted into the data set here. The callback produces required value, and the whole data set is passed to the upper-level method for future processing and then to the web application. 2013 Henrietta Dombrovskaya – Enova 55
2013 Henrietta Dombrovskaya – Enova 56
LOAN_ID, FUNDING_DATE< FUNDING_AMOUNT FEE STATE APR LOAN_STATUS AMOUNT_DUE ACCOUNT_BALANCE PRINCIPAL_AMOUNT CUSTOMER_BALANCE INTEREST_PROVISIONAL <…> LANGUAGE_CD
ACCOUNT_BALANCE METHOD
Note, that all data elements can still be retrieved with a single select statement, which won’t be possible within the standard ORM framework.
On a Larger Scale
In our new application a single Postgres function retrieves summaries for all customer loans. Here is some execution statistics:
2013 Henrietta Dombrovskaya – Enova Financial 57
Number of Database Calls/Customer
2013 Henrietta Dombrovskaya – Enova Financial 58
1 2 3 4 5 6 7 8 9 Old App avg_calls 121 196 373 449 632 765 819 923 1129 New App avg_calls 4 4 4 4 4 4 4 4 4
0
200
400
600
800
1000
1200
Old App avg_calls New App avg_calls
Execution Time/Customer
2013 Henrietta Dombrovskaya – Enova Financial 59
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9
New App avg_time Old App avg_time
Now
• We save both time
• And money!
2013 Henrietta Dombrovskaya – Enova 60
Were we the first to notice the problem? Definitely NOT!
2013 Henrietta Dombrovskaya – Enova 61
Related Work
The problem of object-relational impedance mismatch is a constant discussion topic, when it comes to developing an efficient application. In recent years multiple attempts were made to try to resolve this issue with no significant outcome. 2013 Henrietta Dombrovskaya – Enova Financial 62
First Came HORA
HORA (Hybrid Object-Relational Architecture) approach was first introduced in 1993 and became a foundation of multiple ORM-based systems, such as:
• Java Persistence • ActiveJDBC • ADO.NET • Ruby on Rails ActiveRecord • Many others.
All implementations have the same ORIM issues. 2013 Henrietta Dombrovskaya – Enova Financial 63
Hybernate
Hybernate is a high-performance Object/Relational persistence and query service, considered one of the most flexible and powerful Object/Relational solutions on the market. • Mapping from Java classes to database tables
and from Java data types to SQL data types. • Significantly reducing development time • Allowing the application developers concentrate
on the business side of the project. 2013 Henrietta Dombrovskaya – Enova Financial 64
Hybernate Developers Claim
• In contrast to other solutions it does not hide “the power of SQL” from developers.
• This claim is true in some sense, since the
solution, indeed allows us to write queries similar to SQL queries.
2013 Henrietta Dombrovskaya – Enova Financial 65
The Caveat
• Creation of complex queries using Hybernate is
not an easy task • Similar to other ORM systems, Hybernate
prompts for solutions, which seem more natural for application developers
2013 Henrietta Dombrovskaya – Enova Financial 66
Newer Version of Active Record
• Allows Eager loading and some customer queries, but has the same limitations as Hybernate
• Eager loading may cause an excessive application memory usage
2013 Henrietta Dombrovskaya – Enova Financial 67
Agile Technology
• acknowledges the existence of ORIM ; • not only acknowledges the technical impedance
mismatch, but also a cultural impedance mismatch:
“The object-oriented paradigm is based on proven software engineering principles. The relational paradigm, however, is based on proven mathematical principles”.
• while raising awareness of the problem, the solutions, proposed by Agile data technology refers mostly to the database schema changes and/or more careful design and data refactoring, but it offers little in terms of dealing with inefficient queries.
2013 Henrietta Dombrovskaya – Enova Financial 68
What’s next?
2013 Henrietta Dombrovskaya – Enova 69
Future Work
Rewriting a whole application using our new methodology of splitting logic between an application and a database; we are in the middle of the process. Problems:
– Large amount of the legacy code, where most of the business logic is embedded in Ruby classes.
– Virtually no business specifications, which means that we have to extract the business logic from the existing code.
– The legacy application is evolving and existing models are being modified all the time.
– The human factor
Because of this….
2013 Henrietta Dombrovskaya – Enova Financial 70
Moving In Two Directions:
ü Continue to rewrite the larger parts of our applications, shooting for having only a couple of database queries per screen rendering;
ü While doing this, clarifying our technology, making it more transparent and easier to use by application developers.
2013 Henrietta Dombrovskaya – Enova Financial 71
Aknowledgements
Many thanks to: • Enova CIO/CTO Fred Lee • Team Raven, and everybody who worked on the Aperture
project during it’s lifetime:
• Sheldon Strauch and the entire DB Dev team • Chad Slaughter and Kurt Stephens • Ben Heilman and Donny Jekels 2013 Henrietta Dombrovskaya – Enova Financial 72
– Cheetana Yogeesh – Allan Zoppa – Ana Lebon – Kevin Glowacz – Neha Bhardwaj
– Jef Jonjevic – Marc Groulx – Richard Lee – Preeti Dhiman – Luda Baklanova