data array processing with java language
Post on 03-Jul-2015
1.804 Views
Preview:
TRANSCRIPT
Data array processing
Vitalii Tymchyshyn, tivv00@gmail.com
http://michaelgr.com/
Data array processing
What will we be talking about
Real life Example
Main problems to be solved
Configurable task processing
Clustering solutions
Grid vs Client-Server
Controlling CPU usageVitalii Tymchyshyn, tivv00@gmail.com
What will we be talking about
You have a large set of tasks or task stream
Each task is relatively large, it's processing is CPU intensive and require multiple algorithms to be used
The goal is to maximize overall performance, not minimize processing time of single task
Real life Example
The task is web crawling with content analysis
Complex Artificial Intelligence algorithms are used, each release changes resource consumption schema
Some algorithms take single page, others take domain as a whole
Target: 1000s of domains, 100000s of pages per day
Main problems to be solved
Make single task processing be configurable, so that algorithms are independent and easily extended/replaced
Make solution scalable & solid
Make it use the equipment fully, but without overload
1. Configurable task processing
The most popular way is IoC container, e.g. Spring
Another option is data flow – beans do not call each other layer by layer, but are called by container one by one, taking input and producing output
Let's compare the options and check if second option has any benefits with “Hello, world”
“Hello, world” with IoC
Daytime retrieverGreeting text reader
Greeting printer Answer printer
User input reader
Hello World runner
“Hello, world” graph dataflow
Read greeting textCheck day time
Print greeting
Read user input
Print answer
Graph dataflow Pros & Cons
Pros
Full decoupling
Easy parallel processing, clustering & savepoints
Automatic flow management
Single call to get data needed in many places
Data types instead of interfaces
Graph dataflow Pros & Cons
Cons
IoC is still good to use to manage common resources and complex nodes :)
Our graph vs BPM
Lighweight
Connections is data, no central storage
Targeted on small (minutes to hours) automated CPU intensive jobs, subtasks from millisecons to minutes
Highly configurable clustering
Conslusion
With graph dataflow we have algorithm parts as independent blocks
Time to use this block to fill our equipment efficiently.
2. Scalability with Clustering
Simple way is:
to have multiple Vms
each fully does it's set of tasks
each task has it's working set on it's hand
2. Scalability with Clustering
But:
Each algorithm initialization data takes memory while only one algorithm is running
Algorithm may require only small part of task data
Task processing at some point may be split and processed in parallel
Solution: Clustering
Example: web domain processing
Get domain data
Mark cities(needs world city index)
Detect addresses
Define primary address
Example: cluster setup
Primary processor, reads task &
Performs primary address detection
City mark processor,Needs memory for city index,
Works page by page,fast
Address detector,Works page by page,
Slow but you can have manyof this because of low
Memory footprint
Cluster in focus
10-20 hardware nodes
FreeBSD OS, jails in use, so no multicast
Oracle Grid Engine (formely SGE) as cluster processes controller
Complex, memory consuming tasks, with JNI (crashes, long GC)
Two faces of cluster Janus
Data cluster
Is good for task data to be stored in
Can be replaced with central data warehouse, but scalability will suffer
You would better separate it from computing VM if computing is complex
Can perform Computing cluster functions
Computing cluster
Is good for running tasks from multiple task producers
Can be grid-based or client-server
Multiple small clusters may be better then one large
Hazelast
One of few free Data Grids
Has built-in Computing Grid abstractions
Good support from developers
but
Bugs in non-usual usages
Simply did not work reliably in our environment
Grid Gain
May fit like a glove
You'd better not make mitten out of glove
Heavy annotations use – problems with runtime configuration
Weakened interfaces – here are shotgun, you have your foot
Unsafe vendor support
ZooKeeper
The thing we are using now
Low level primitives, yet there are 3rd party libraries with high level
Client-Server architecture
Clustered servers for stability and read scalability.
No write scalability
Part of Hadoop
HDFS
Has Single Point of Failure
Name node memory requirements are linear from number of files
Uses TCP (don't forget to tune OS tcp stack)
Much like unix file system
Again, part of hadoop
Two types of The Time
Wall Time = CPU Time + Wait + LatencyWall Time = CPU Time + Wait + Latency
External wait is managed with cooperative multitasking (discussed later) Latency is vital for interactive services, but has low priority for data processing
Grid vs Client-Server
Grid Client-Server
Grid vs Client-Server
Latency is two times less
A lot more connections
Everyone is watching
Complex cluster membership change procedure
More robust
Servers can be clustered
Central point for debugging
No “watching deads” overhead for everyone
Conclusion
Now our tasks are spread on our equipment.
Time to prevent resource overload!
3. Controlling CPU usage
Low load means processing power not used
High load means that:Parallel tasks unnecessary take memory
High System time because of context switches
CPU caches are split on different switching tasks
Lower total throughput
Parallel vs Sequential on single CPU
Task 1 Task 3Task 2 Task 4
Task 1
Task 4
Task 3
Task 2
Sequential (LA=1):
Average finish time = (1+2+3+4)/4 = 2.5
Parallel (LA=4):
Average finish time = (4+4+4+4)/4 = 4
Thread-pool:
Multiple tasks are processed at one time by different threads
There should be enough threads to use CPU while someone's blocked
There should not be too much threads for not to overload CPU
Event-based:
One task is processed at given time
There must be no blocking
Any blockable call must be replaced with callback / event
Popular options to process tasks
Thread-pool vs event-based
Thread-pool:
Pros
Simple procedural model
A lot of libraries & frameworks
Cons
Context-switch storms
Per-thread memory
Average speed
Event-based:
Pros
Optimal pool size
No waiting threads memory overhead
Cons
More complex event-based programming
Little supporting libraries & frameworks
Introducing cooperative multitask
It's much like thread-pool variant, but:
Each wait (IO) is signaled to multitask coordinator
Only one thread can be in no-wait state, another thread exiting wait will block on a mutex.
If system is overloaded (mutex is always taken), new tasks are not run.
Cooperative multitasking features
Still simple procedural modelControlled CPU usage
Low waiting thread memory usage because of no layered calls in graph dataflow
All regular frameworks & libraries are availableDynamic thread pool size
top related