Storage Informer
Storage Informer

Parallelizing Legacy code using Fine Grained Distributed Processing

by on Sep.02, 2009, under Storage

Parallelizing Legacy code using Fine Grained Distributed Processing

Adding parallel processing to legacy code is a desire of every software company that has an existing product which is significant in complexity and which needs to run faster. Processor clock rates are not increasing much and now multiple cores are being added to chips instead. The problem of speeding up software is moving from a hardware improvement problem to a software parallelization problem. This is a follow-on post to Why Parallel Processing? Why now? What about my legacy code?, please read this posting for more background.

Typically with multi-core processors, the first thought is to use multiple threads in a shared memory programming paradigm to parallelize a software algorithm. This approach can work very well, especially for software designed from the ground up to be thread safe and thread efficient. Thread safety means that the threads and data structures are written in such a way that there are no race conditions between the threads for shared data. Thread efficient is a term I use to discuss the efficiency of the scheme used to avoid race conditions as well as the code¡¦s ability to efficiently use the processor and its cache and memory bandwidth to keep the processors busy. Making legacy code thread safe and thread efficient is often a difficult task, especially for large pre-existing code bases and/or code bases that have been developed over a long period of time. In such code a top level understanding of the code call chains and architecture is often not complete. Simple locking can make the code thread safe but often leads to locks which have a long duration and make the code thread safe but not thread efficient. Re-coding the data structures and code can be prohibitively expensive and lengthy for the short term needs of the software¡¦s user.

To scale serial programs through parallelization in the most thread efficient way typically involves a major re-architecture and re-implementation. This is especially true when scaling programs not only to multi-core systems but to many-core systems. Multi-core is usually defined as systems with 4, 8 maybe 16 cpus. Many-core is defined as systems having at least 16 processors and usually 64 or 128 processors. It is nearly impossible to get significant speedups on many-core systems without coding with such systems in mind. I do not think the approaches and shortcuts I have used on multi-core systems will allow legacy software to scale on many-core systems. However it is possible on multi-core systems to get a decent speedup with smaller changes to the code using clever software engineering approaches.

In this posting I want to discuss one such approach which is the use of Fine Grained Distributed Processing to speed up compute intensive pieces of serial programs.

Many programs which run for a long time have parts of the code which are a significant bottleneck or percentage of the runtime. These parts of the code can be found by profiling the code and looking to see where the time is spent. Once the code that needs to be sped up is found, the first step, before attempting any parallel coding, is to speed up the single CPU performance as much as possible. Optimizing the single CPU algorithm and data structure layout and access is key to being able to parallelize effectively. If the algorithm is poor then even when parallelized there will still be inefficient use of cpus. If the data locality is poor causing many cache misses then the memory to CPU data bandwidth will be the bottleneck in the single CPU version of the code and this will get even worse in the N-cpu version of the code since at the limit the parallel algorithm may require N times the amount of data bandwidth. In my experience this optimization is generally a local optimization and requires changing the local algorithm and the specific data structures upon which it depends. Sometimes it turns out to be a more global problem and when this is the case it is definately better to find out about it before trying to parallelize the program and fix it in the serial version.

Once the code and data access is optimized, then the next step is to find parallelism in the algorithm. Many times algorithms or steps of a program which are time consuming involve loops or other constructs which can be decomposed into parallel pieces. If the underlying code and structures are or can be made thread safe then shared memory threads can be a great way to go. Often this is not the case. When shared memory threads are not feasible one possible approach is to start separate processes, either on the local machine, if there are cpus available, or on remote machines that have good connectivity to the master process.

The remote processes can then be set up as servers which receive messages with work to do and which reply with a result. The client server paradigm fits well here. This paradigm is similiar to having threads running and whose work is determined by populating a queue. The remote processes are servers much like a web server which take requests and return results. The main program is the client and like an internet browser orchestrates the work and is the main interface to the user.

The protocol between the master and slave processes should be set up so that whether the slave ends up running on the local machine or a remote host that the socket programming will be the same. In this way the distributed programming can work on a single multi-core machine or across a farm of machines. On a farm the master program will not know what machine the slave will wake up on so a small handshaking protocol can be set up to initialize the connection between the master and the slave. I have summarized one simple protocol below. This protocol is kept very simple to avoid any chance of deadlock and to make debugging simple. The slaves are set up to log incoming messages so that if there is a bug in the slave it can be debugged independently of the master and other slaves.

The application which I am working on uses TCL as its command line language. To keep the protocol even simpler I found it helpful for the master and slave to be the same executable and for the master to pass TCL commands whose parameters contained the work to be done and whose return values were the results of the work. In this way the act of adding new TCL commands to perform the specialized work could be done in such a way as to re-use code from the serial master and allow a lower level access to one of the sub-calculations which was part of the serial master routine that is being parallelized.

The master program was loaded with the full set of data structures and state of the original serial program. The slave invocations of the same executable were done in a stateless mode. No data was loaded and the program was set up to listen to a socket for TCL commands to execute and to reply to with results. These stateless servers have the advantage that they use very little memory and free any memory used from work command to work command. They tend to have great data locality since the incoming message is converted to a small targeted set of data structures by the same processor which will perform the calculations. It is also helpful since a process tends to have affinity for a single processor and a processor has affinity for the data which it allocates.

I call this approach Fine Grained Distributed Processing since the TCL commands can be a very fine grain of work as long as the work is enough to offset the time spent to package up the message for the socket and return the result. Even when running on a single multi-core machine the programming paradigm is distributed.

The servers do not have to strictly be stateless, but the more state they have, the more complex the protocol and synchronization becomes between the processes.

Some observations that I have with this approach are that on a modern local machine with 4cpus I could send 30K 30KB size messages per second when the master sent messages and the slaves did nothing but echo the message back to the master. I also noticed in real use that if the task size became less that 20ms that scalability really would suffer. These benchmarks need to be taken into account when designing the messages and analyzing the suitability of this approach for a given problem. On our Cadence Farm which has machines with gigabit or better connectivity I saw the benchmark degrade by about 10% to 30% when the slaves were on remote machines. I could send the same number of shorter messages or reduce the number of messages. This number I am sure will vary a lot from farm to farm and also varied for me from run to run which I assume was affected by network traffic at the time. In production usage we have used the local machine for fine grained tasks and use the remote version for larger tasks of multiple minutes in duration. In this way the message passing overhead is kept minimal.

One example application I used this technique on was to parallelize a parasitic reducer of electronic circuits. Basically on a chip the wires connecting the chips would ideally be perfect conductors but there are parasitic resistances and capacitances that need to be analyzed before the speed of the circuit can be calculated. This analysis is called reduction since the output of the analysis is a reduced mathematical model known as a transfer function. This problem is easily parallelizable since the wires connecting the devices can be computed in any order. However the code in question depends on large data structures and a math library that are not thread safe. Rather than make the major changes needed for thread safety, I tried the fine grained distributed approach with sockets and tcl commands. I was able to get around a 3X speedup on 4cpus for this step. An example of a small circuit with around 20K nets connecting components is below.

The Fine Grained Distributed Processing approach can be an easy and effective way to get parallelism out of legacy code for certain problems where the messages are easy enough to compose and not too long compared to the work that needs to be done.

One thing to note is that TCL itself has a robust socket language which is easy to use and works well across platforms. I recommend it for prototyping and quick work. has a small example.

This is the first time I have tried to describe this approach in pure written form, usually I do it in person. If you have any questions post a comment and we can discuss.


:, , , , , , , , , , , , , ,

Leave a Reply

Powered by WP Hashcash

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...