Tag: Browser
Parallelizing Legacy code using Fine Grained Distributed Processing
by admin on Sep.02, 2009, under Storage
Parallelizing Legacy code using Fine Grained Distributed Processing
Adding parallel processing to legacy code is a desire of every software company that has an existing product which is significant in complexity and which needs to run faster. Processor clock rates are not increasing much and now multiple cores are being added to chips instead. The problem of speeding up software is moving from a hardware improvement problem to a software parallelization problem. This is a follow-on post to Why Parallel Processing? Why now? What about my legacy code?, please read this posting for more background.
Typically with multi-core processors, the first thought is to use multiple threads in a shared memory programming paradigm to parallelize a software algorithm. This approach can work very well, especially for software designed from the ground up to be thread safe and thread efficient. Thread safety means that the threads and data structures are written in such a way that there are no race conditions between the threads for shared data. Thread efficient is a term I use to discuss the efficiency of the scheme used to avoid race conditions as well as the codeˇ¦s ability to efficiently use the processor and its cache and memory bandwidth to keep the processors busy. Making legacy code thread safe and thread efficient is often a difficult task, especially for large pre-existing code bases and/or code bases that have been developed over a long period of time. In such code a top level understanding of the code call chains and architecture is often not complete. Simple locking can make the code thread safe but often leads to locks which have a long duration and make the code thread safe but not thread efficient. Re-coding the data structures and code can be prohibitively expensive and lengthy for the short term needs of the softwareˇ¦s user.
To scale serial programs through parallelization in the most thread efficient way typically involves a major re-architecture and re-implementation. This is especially true when scaling programs not only to multi-core systems but to many-core systems. Multi-core is usually defined as systems with 4, 8 maybe 16 cpus. Many-core is defined as systems having at least 16 processors and usually 64 or 128 processors. It is nearly impossible to get significant speedups on many-core systems without coding with such systems in mind. I do not think the approaches and shortcuts I have used on multi-core systems will allow legacy software to scale on many-core systems. However it is possible on multi-core systems to get a decent speedup with smaller changes to the code using clever software engineering approaches.
In this posting I want to discuss one such approach which is the use of Fine Grained Distributed Processing to speed up compute intensive pieces of serial programs.
Many programs which run for a long time have parts of the code which are a significant bottleneck or percentage of the runtime. These parts of the code can be found by profiling the code and looking to see where the time is spent. Once the code that needs to be sped up is found, the first step, before attempting any parallel coding, is to speed up the single CPU performance as much as possible. Optimizing the single CPU algorithm and data structure layout and access is key to being able to parallelize effectively. If the algorithm is poor then even when parallelized there will still be inefficient use of cpus. If the data locality is poor causing many cache misses then the memory to CPU data bandwidth will be the bottleneck in the single CPU version of the code and this will get even worse in the N-cpu version of the code since at the limit the parallel algorithm may require N times the amount of data bandwidth. In my experience this optimization is generally a local optimization and requires changing the local algorithm and the specific data structures upon which it depends. Sometimes it turns out to be a more global problem and when this is the case it is definately better to find out about it before trying to parallelize the program and fix it in the serial version.
Once the code and data access is optimized, then the next step is to find parallelism in the algorithm. Many times algorithms or steps of a program which are time consuming involve loops or other constructs which can be decomposed into parallel pieces. If the underlying code and structures are or can be made thread safe then shared memory threads can be a great way to go. Often this is not the case. When shared memory threads are not feasible one possible approach is to start separate processes, either on the local machine, if there are cpus available, or on remote machines that have good connectivity to the master process.
The remote processes can then be set up as servers which receive messages with work to do and which reply with a result. The client server paradigm fits well here. This paradigm is similiar to having threads running and whose work is determined by populating a queue. The remote processes are servers much like a web server which take requests and return results. The main program is the client and like an internet browser orchestrates the work and is the main interface to the user.
The protocol between the master and slave processes should be set up so that whether the slave ends up running on the local machine or a remote host that the socket programming will be the same. In this way the distributed programming can work on a single multi-core machine or across a farm of machines. On a farm the master program will not know what machine the slave will wake up on so a small handshaking protocol can be set up to initialize the connection between the master and the slave. I have summarized one simple protocol below. This protocol is kept very simple to avoid any chance of deadlock and to make debugging simple. The slaves are set up to log incoming messages so that if there is a bug in the slave it can be debugged independently of the master and other slaves.
The application which I am working on uses TCL as its command line language. To keep the protocol even simpler I found it helpful for the master and slave to be the same executable and for the master to pass TCL commands whose parameters contained the work to be done and whose return values were the results of the work. In this way the act of adding new TCL commands to perform the specialized work could be done in such a way as to re-use code from the serial master and allow a lower level access to one of the sub-calculations which was part of the serial master routine that is being parallelized.
The master program was loaded with the full set of data structures and state of the original serial program. The slave invocations of the same executable were done in a stateless mode. No data was loaded and the program was set up to listen to a socket for TCL commands to execute and to reply to with results. These stateless servers have the advantage that they use very little memory and free any memory used from work command to work command. They tend to have great data locality since the incoming message is converted to a small targeted set of data structures by the same processor which will perform the calculations. It is also helpful since a process tends to have affinity for a single processor and a processor has affinity for the data which it allocates.
I call this approach Fine Grained Distributed Processing since the TCL commands can be a very fine grain of work as long as the work is enough to offset the time spent to package up the message for the socket and return the result. Even when running on a single multi-core machine the programming paradigm is distributed.
The servers do not have to strictly be stateless, but the more state they have, the more complex the protocol and synchronization becomes between the processes.
Some observations that I have with this approach are that on a modern local machine with 4cpus I could send 30K 30KB size messages per second when the master sent messages and the slaves did nothing but echo the message back to the master. I also noticed in real use that if the task size became less that 20ms that scalability really would suffer. These benchmarks need to be taken into account when designing the messages and analyzing the suitability of this approach for a given problem. On our Cadence Farm which has machines with gigabit or better connectivity I saw the benchmark degrade by about 10% to 30% when the slaves were on remote machines. I could send the same number of shorter messages or reduce the number of messages. This number I am sure will vary a lot from farm to farm and also varied for me from run to run which I assume was affected by network traffic at the time. In production usage we have used the local machine for fine grained tasks and use the remote version for larger tasks of multiple minutes in duration. In this way the message passing overhead is kept minimal.
One example application I used this technique on was to parallelize a parasitic reducer of electronic circuits. Basically on a chip the wires connecting the chips would ideally be perfect conductors but there are parasitic resistances and capacitances that need to be analyzed before the speed of the circuit can be calculated. This analysis is called reduction since the output of the analysis is a reduced mathematical model known as a transfer function. This problem is easily parallelizable since the wires connecting the devices can be computed in any order. However the code in question depends on large data structures and a math library that are not thread safe. Rather than make the major changes needed for thread safety, I tried the fine grained distributed approach with sockets and tcl commands. I was able to get around a 3X speedup on 4cpus for this step. An example of a small circuit with around 20K nets connecting components is below.
The Fine Grained Distributed Processing approach can be an easy and effective way to get parallelism out of legacy code for certain problems where the messages are easy enough to compose and not too long compared to the work that needs to be done.
One thing to note is that TCL itself has a robust socket language which is easy to use and works well across platforms. I recommend it for prototyping and quick work. http://www.tcl.tk/about/netserver.html has a small example.
This is the first time I have tried to describe this approach in pure written form, usually I do it in person. If you have any questions post a comment and we can discuss.
URL: http://feedproxy.google.com/~r/IntelBlogs/~3/G0TvMnV1VLw/
Vizioncore Offers vConverter SC and vControl Multi-Hypervisor Management as Freeware
by admin on Sep.02, 2009, under Storage
Vizioncore Offers vConverter SC and vControl Multi-Hypervisor Management as Freeware
————————————————————————————————————-
————————————————————————————————————-
Vizioncore today announced that vControl’s multi-hypervisor management capability will now be available for free.
Customers who have VMware ESX infrastructures, Microsoft Hyper-V, and Citrix XenServer infrastructure can now manage these environments through a single console – helping to reduce administration costs and improve management consistency.
vControl Multi-Hypervisor Management provides a browser-based interface that enables control of multiple platforms, allowing administrators to leverage easy-to-trigger workflows for task-based administration of VMs. Furthermore, vControl Multi-Hypervisor Management provides a cost-effective solution for high availability in the data center.
In addition to vControl Multi-Hypervisor Management, Vizioncore offers vControl Self-Service Provisioning for $399/socket. With this product VM consumers can build and provision VMs for themselves through customizable VM templates.
Furthermore, the company has also announced that vConverter SC (Server Consolidation) will now be available for free as well. Customers may use the tool to accelerate towards a virtualized environment, making the most of maintenance windows and reducing downtime on current physical machines.
vConverter SC offers an intuitive GUI to enable rapid simultaneous conversions of single or multiple servers without disruption to the source system. By combining cutting-edge disk and networking technology with extensive automation of pre- and post-conversion tasks, it boasts the ability to successfully migrate more servers per conversion window, saving valuable maintenance time. Furthermore the product is compatible with hypervisors from leading vendors such as VMware, Microsoft and Citrix, making it the perfect fit for almost every environment.
URL: http://feedproxy.google.com/~r/Virtualizationdotcom/~3/V43f8CsmEFs/
and on getting heckled
by admin on Jul.24, 2009, under Storage
Moblin – the Web2.0 OS – and on getting heckled
I gave a chalk talk yesterday on Moblin at OSCON 2009, the Open Source Convention in San Jose. This was a no slides white-board chat about what&aposs cool in Moblin, where is the future going, and what are the ways people can collaborate and participate.
I touched on some of the high points of what makes Moblin great as an OS design *for* netbooks, rather than adapted to them as an afterthought. I had one heckler yell at me from a neighboring booth (more on this later) and shared my little epiphany about Web 2.0 and Moblin.
- Fast boot – Moblin in its current form has a goal to boot in 8 seconds from when the BIOS finishes its work to the moment the CPU and Disk are truly idle and able to do useful work. I talked about some of the really hard work and some tricks we use to achieve this.
- Fast shutdown – I commented on how quick Moblin is to shut down, just hit the power switch and it&aposs off within a few seconds. I mentioned that when you want to turn off your cell phone, TV, whatever, you hit the power switch, and it&aposs the same with netbooks on Moblin.
- User Experience – I refered to the great work that the guys at Opened Hand have done to make an amazing user experience.
- Web 2.0 – it&aposs built in, in fact it&aposs a fundamental part of the OS experience.
The heckling came when I mentioned fast shutdown. You see Moblin doesn&apost have a shutdown "button" or anything in the OS. We just use the power button from the netbook itself. I began kidding around and saying, "I remember one particular OS, which will remain nameless, in which you had to hit the &aposStart&apos button to &aposStop&apos the computer."
From the next booth, I hear a call, "Hey! We&aposre listening here!". Yes, the booth next door was the Microsoft booth. Hey, I didn&apost mention the OS, did I? Guess they&aposre just sensitive.
But the epiphany I shared on Web 2.0 was three or four years ago when I sent an email to one of my teen-age daughters, and then asked her a few days later if she had seen it.
"No, Dad; I don&apost look at my email every day!"
What? How can this be? A technologically plugged-in student who isn&apost addicted to mail? But then I realized that for the generation after mine that email is really dead as a communication tool. Blogs and IM have taken its place.
Fast forward to today, and blogs are just about dead too – Twitter, Flickr and Facebook have taken their place of prominence because of their immediacy.
And Moblin 2.0 is designed with social media as a core and foundational part of the experience of the OS. It&aposs been years since browsing alone was the core of the operating system. I&aposm amused to see that even today, there are OS&aposs announced which a little more than a browser on a simple OS, and this creates all kinds of excitement. In fact, having social media integrated at a fundamental level is what makes me so excited about Moblin.
URL: http://feedproxy.google.com/~r/IntelBlogs/~3/44AwfvXISjE/