salesforce api series: fast parallel data loading with the bulk api webinar

51
Salesforce API Series Fast Parallel Data Loading with the Bulk API February 26, 2014

Upload: salesforce-developers

Post on 26-Jan-2015

123 views

Category:

Technology


2 download

DESCRIPTION

Can you load 20 million records into Salesforce in under an hour? If not, this webinar is for you. You want to load tons of data into Salesforce. No problem, right? Just use the Bulk API and turn on parallel loading. Think again. Unless you carefully plan the big data loads that you want to break up into parallel operations to achieve maximum throughput, those loads can turn out more like slow, serial loads. In this webinar, Sean and Steve will teach you how to realize awesome throughput in your parallel data loads on the Salesforce1 Platform. After learning from the webinar's demos and code samples, you'll be able to apply your new deep knowledge of platform internals to measure load performance, recognize problems that slow your loads down, and work around these roadblocks. Key Takeaways :: Learn what parallelism is and how significant optimizing it is for performance :: Learn how to architect an integration or load tool to optimize parallelism, and obtain the maximum possible throughput :: Learn how to manage locks to avoid lock exceptions that can significantly reduce the throughput in your loads and integrations Intended Audience :: Salesforce architects or Force.com developers with a working understanding of data loading and integration concepts. A high-level understanding of the Bulk API and Java is also useful.

TRANSCRIPT

Page 1: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Salesforce API Series Fast Parallel Data Loading with the Bulk API February 26, 2014

Page 2: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Safe Harbor

Safe harbor statement under the Private Securities Litigation Reform Act of 1995:

This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.

The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter ended July 31, 2012. This documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site.

Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.

Page 3: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Speakers

Steve Bobrowski Architect Evangelist @sbob909

Sean Regan Architect Evangelist @sfdcsregan

Page 4: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Follow Developer Force for the Latest News

@forcedotcom / #forcewebinar

Developer Force – Force.com Community

+Developer Force – Force.com Community

Developer Force

Developer Force Group

Page 5: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

How fast can you load data into Salesforce?

Page 6: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

How many records can you load into Salesforce in 1 hour?

Page 7: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Data load throughput

-

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

OK Fast Faster

Records/Hour

Page 8: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Parallel processing

Page 9: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

A parallel processing analogy: digging a ditch

Page 10: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Serial processing

Page 11: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallel processing

Page 12: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

The number of processes or threads associated with an operation.

Page 13: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Optimal parallel processing

Serial

Parallel

20M records

5M records

5M records

5M records

5M records

Time

Page 14: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Sub-optimal parallel processing

Serial

Parallel

Time

5M records

5M records

5M records

5M records

20M records

Page 15: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Locks, exceptions, triggers, relationships, …

Serial

Parallel

Time

5M records

5M records

5M records

5M records

20M records

Throughput inhibitors

Page 16: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Data load case studies

§  Get hands on with the Salesforce Bulk API §  Contrast serial data loads vs. parallel data loads §  Measure degrees of parallelism and throughput

§  Identify and avoid throughput inhibitors §  Achieve maximum throughput

Page 17: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Prep work

Page 18: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Salesforce Bulk API

§  Asynchronous data loading

§  Optimized for large data sets §  REST API

§  Powers many tools

§  Use to build custom tools with any programming language (Java, etc.)

Page 19: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Demo schema

Page 20: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Bulk API Loads that …

ealize, nvestigate, and lan

Page 21: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Case Studies

Page 22: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Serial Data Load

Page 23: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread

Serial load: Expected plan

Time

•  One job •  100 batches •  10,000 records/batch •  1M total records

Page 24: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Serial load: Job con!guration

Page 25: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Serial load: Batch creation

Page 26: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Serial load: Batch run

Page 27: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Demo Serial load

Page 28: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Serial load summary

Concurrency Mode Serial Records Loaded 1 million Records Failed 0 Run Time 52 minutes Work Completed 48 minutes Throughput 19,500 records per minute Degree of Parallelism 0.94 Key Problem Degree of parallelism explicitly limited to ~1. Solution Explore parallel load for increased throughput.

Page 29: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallelism vs. Throughput of a Single Job

0

50000

100000

150000

200000

250000

300000

350000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Serial

Serial Run •  Low degree of parallelism

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in

Page 30: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Parallel data loads

Page 31: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread

Parallel load: Expected plan

Time

•  One job •  100 batches •  10,000 records/batch •  1M total records

Page 32: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallel load: Job con!guration

Page 33: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Things to watch for

§  Locks can signi!cantly affect parallel loads –  Wasted processing capacity

–  Reduced throughput

–  Failures

§  Retry logic is not all its cracked up to be

Page 34: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Demo Parallel 1

Page 35: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallel load 1 summary

Concurrency Mode Parallel Records Loaded 125,000 Records Failed 875,000 Run Time 10 minutes Work Completed 2 hours and 30 minutes Throughput 20,000 records per minute Degree of Parallelism 15.79 Key Problem Lock Exceptions. Server worked significantly harder but no increase in throughput. Solution Run the load in serial mode or manage locks.

Page 36: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallelism vs. throughput of a single job

0

50000

100000

150000

200000

250000

300000

350000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Serial

Parallel Run 1 •  High degree of parallelism •  Low throughput due to locks

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in

Parallel 1

Page 37: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Time to optimize

§  Let’s make your data load

§  ealize –  Locks inhibit parallelism and throughput

§  nvestigate –  What is causing the locks

§  lan –  Manage the locks

Page 38: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Demo Parallel load 2

Eliminate Locks by Modifying Schema

Page 39: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallel load: Sample results

Concurrency Mode Parallel Records Loaded 1 million Records Failed 0 Run Time 3 minutes and 30 seconds Work Completed 1 hour Throughput 320,000 records per minute Degree of Parallelism 19 Key Problem None Solution n/a

Page 40: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallelism vs. throughput of a single job

0

50000

100000

150000

200000

250000

300000

350000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Serial

Parallel Run 2 •  High degree of parallelism •  High throughput

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in

Parallel 1

Parallel 2

Page 41: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Locks can be managed by

§  Elimination

§  Ordering load !le

Page 42: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Demo Parallel load 3

Avoid Locks with Ordered Data

Page 43: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Managing locks … a discussion while we load

§  Master-detail relationships

§  Lookup relationships

§  Roll-up summary !elds

§  Triggers

§  Work"ow rules

§  Group membership locks*

Page 44: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallel load: Sample results

Concurrency Mode Parallel Records Loaded 1 million Records Failed 0 Run Time 4 minutes Work Completed 1 hour Throughput 250,000 records per minute Degree of Parallelism 16.5 Key Problem Minimal overhead due to locks Solution Remove all unnecessary locks

Page 45: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallelism vs. throughput of a single job

0

50000

100000

150000

200000

250000

300000

350000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Serial

Parallel Run 3 •  High degree of parallelism •  High throughput

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in

Parallel 1

Parallel 2

Parallel 3

Page 46: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Controlled feed/parallel

data loads

Page 47: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Controlled feed load methodology

§  Explicit throttling on parallelism and throughput –  Parallel extraction and loading

–  Prioritization of asynchronous processing capacity

§  Manage inhibitors in complex jobs –  Data Skews

–  Multiple Locks

Page 48: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Parallelism vs. throughput of a single job

0

50000

100000

150000

200000

250000

300000

350000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Serial

Controlled Feed Run •  Reduced parallelism •  Expected throughput

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in

Parallel 1

Parallel 2

Parallel 3

Controlled Feed

Page 49: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Related wiki article and Architect Core Resources

Page 50: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

#forcewebinar

Recap

§  Make your parallel data loads

§  ealize –  Locks inhibit parallelism and throughput

§  nvestigate –  What is causing the locks

§  lan –  Manage the locks

Page 51: Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

Q & A

#forcewebinar

Steve Bobrowski Architect Evangelist @sbob909

Sean Regan Architect Evangelist @sfdcsregan