Tag: Chips
Intel Proves Xeon 5500 Processors Are “Cash Machines for IT”
by admin on Oct.07, 2009, under Storage
Intel Proves Xeon 5500 Processors Are “Cash Machines for IT”
When Intel launched “Nehalem-EP,” more commonly known as the Intel Xeon 5500 Series processors back in March, we pointed out that their energy-efficient performance and other attributes lead to a very quick return on investment – as soon as eight months. Pat Gelsinger and Kirk Skaugen even referred to the new servers as literally becoming “Cash Machines for IT.”
Intel began its accelerated server refresh strategy in 2007 – we’re replacing aging servers on a regular basis to cut operating costs, avoiding incremental data center spending and increasing capacity to ensure ongoing innovation. This is especially important since our business computing demand is growing despite economic conditions. This is driven by the innovative design and manufacturing of our chips, which are smaller, faster, and contain an increasing number of features. Our server refresh strategy generated quick results, saving us $45 million in 2008, and we’re continuing to have enormous savings. Some might question the wisdom of a server refresh strategy in the midst of challenging economic times. We actually re-evaluated the strategy but further analysis showed that halting it would actually increase operating costs by $19 million.
An added bonus is the significant energy-efficiency benefits; the server refresh strategy is the No. 1 driver of Intel IT’s carbon footprint reduction. We are projected to reduce our carbon output by 4,000 metric tons in 2009 alone.
Speaking of Green IT, did you know that the newest ally for IT to help drive carbon-reduction and energy cost savings is the energy utilities? A prime example of this is the Energy Trust of Oregon, which offers cash incentives to motivate Oregon businesses to make energy saving investments. Intel gained access to a $250,000 incentive from them as a result of energy savings gained by replacing older servers with newer, more energy-efficient servers in our data centers.
“We applaud and support efforts to reduce datacenter energy consumption at companies from all industries in Oregon,” said Corban Lester, Program Development Manager for Energy Trust’s Data Center Initiative. “By replacing older servers with modern energy-efficient servers, Enterprise IT departments and small/medium business owners may be eligible for utility incentives for energy savings and can lower their operating costs and reduce the impact of their business on the environment.”
Customers such as DreamWorks, Oracle, Humana, and BMW, are just a few of the companies reaping sizeable benefits by refreshing their data centers with Xeon 5500-based servers. In fact, DreamWorks if positively giddy over the chip.
“Nehalem is—by a substantial margin—the fastest architecture we have ever used in the render farm,” said Derek Chan, head of digital operations, DreamWorks Animation. “It performs well on a per-core basis and the wide-open I/O bandwidth lets us keep all cores busy, much more efficiently than previous architectures.”
If you’re still not convinced, try using the Xeon Server Refresh Savings Estimator server to assess the value of replacing aging x86 hardware with Xeon 5500-based servers.
You can also check out this video of a press briefing our CIO Diane Bryant recently held where she and some Intel customers discussed the success of their server refresh strategies.
Comments (0)
URL: http://feedproxy.google.com/~r/IntelBlogs/~3/fqmhtpx2rI0/intel_proves_nehalem-ep_proces.php
Parallelizing Legacy code using Fine Grained Distributed Processing
by admin on Sep.02, 2009, under Storage
Parallelizing Legacy code using Fine Grained Distributed Processing
Adding parallel processing to legacy code is a desire of every software company that has an existing product which is significant in complexity and which needs to run faster. Processor clock rates are not increasing much and now multiple cores are being added to chips instead. The problem of speeding up software is moving from a hardware improvement problem to a software parallelization problem. This is a follow-on post to Why Parallel Processing? Why now? What about my legacy code?, please read this posting for more background.
Typically with multi-core processors, the first thought is to use multiple threads in a shared memory programming paradigm to parallelize a software algorithm. This approach can work very well, especially for software designed from the ground up to be thread safe and thread efficient. Thread safety means that the threads and data structures are written in such a way that there are no race conditions between the threads for shared data. Thread efficient is a term I use to discuss the efficiency of the scheme used to avoid race conditions as well as the codeˇ¦s ability to efficiently use the processor and its cache and memory bandwidth to keep the processors busy. Making legacy code thread safe and thread efficient is often a difficult task, especially for large pre-existing code bases and/or code bases that have been developed over a long period of time. In such code a top level understanding of the code call chains and architecture is often not complete. Simple locking can make the code thread safe but often leads to locks which have a long duration and make the code thread safe but not thread efficient. Re-coding the data structures and code can be prohibitively expensive and lengthy for the short term needs of the softwareˇ¦s user.
To scale serial programs through parallelization in the most thread efficient way typically involves a major re-architecture and re-implementation. This is especially true when scaling programs not only to multi-core systems but to many-core systems. Multi-core is usually defined as systems with 4, 8 maybe 16 cpus. Many-core is defined as systems having at least 16 processors and usually 64 or 128 processors. It is nearly impossible to get significant speedups on many-core systems without coding with such systems in mind. I do not think the approaches and shortcuts I have used on multi-core systems will allow legacy software to scale on many-core systems. However it is possible on multi-core systems to get a decent speedup with smaller changes to the code using clever software engineering approaches.
In this posting I want to discuss one such approach which is the use of Fine Grained Distributed Processing to speed up compute intensive pieces of serial programs.
Many programs which run for a long time have parts of the code which are a significant bottleneck or percentage of the runtime. These parts of the code can be found by profiling the code and looking to see where the time is spent. Once the code that needs to be sped up is found, the first step, before attempting any parallel coding, is to speed up the single CPU performance as much as possible. Optimizing the single CPU algorithm and data structure layout and access is key to being able to parallelize effectively. If the algorithm is poor then even when parallelized there will still be inefficient use of cpus. If the data locality is poor causing many cache misses then the memory to CPU data bandwidth will be the bottleneck in the single CPU version of the code and this will get even worse in the N-cpu version of the code since at the limit the parallel algorithm may require N times the amount of data bandwidth. In my experience this optimization is generally a local optimization and requires changing the local algorithm and the specific data structures upon which it depends. Sometimes it turns out to be a more global problem and when this is the case it is definately better to find out about it before trying to parallelize the program and fix it in the serial version.
Once the code and data access is optimized, then the next step is to find parallelism in the algorithm. Many times algorithms or steps of a program which are time consuming involve loops or other constructs which can be decomposed into parallel pieces. If the underlying code and structures are or can be made thread safe then shared memory threads can be a great way to go. Often this is not the case. When shared memory threads are not feasible one possible approach is to start separate processes, either on the local machine, if there are cpus available, or on remote machines that have good connectivity to the master process.
The remote processes can then be set up as servers which receive messages with work to do and which reply with a result. The client server paradigm fits well here. This paradigm is similiar to having threads running and whose work is determined by populating a queue. The remote processes are servers much like a web server which take requests and return results. The main program is the client and like an internet browser orchestrates the work and is the main interface to the user.
The protocol between the master and slave processes should be set up so that whether the slave ends up running on the local machine or a remote host that the socket programming will be the same. In this way the distributed programming can work on a single multi-core machine or across a farm of machines. On a farm the master program will not know what machine the slave will wake up on so a small handshaking protocol can be set up to initialize the connection between the master and the slave. I have summarized one simple protocol below. This protocol is kept very simple to avoid any chance of deadlock and to make debugging simple. The slaves are set up to log incoming messages so that if there is a bug in the slave it can be debugged independently of the master and other slaves.
The application which I am working on uses TCL as its command line language. To keep the protocol even simpler I found it helpful for the master and slave to be the same executable and for the master to pass TCL commands whose parameters contained the work to be done and whose return values were the results of the work. In this way the act of adding new TCL commands to perform the specialized work could be done in such a way as to re-use code from the serial master and allow a lower level access to one of the sub-calculations which was part of the serial master routine that is being parallelized.
The master program was loaded with the full set of data structures and state of the original serial program. The slave invocations of the same executable were done in a stateless mode. No data was loaded and the program was set up to listen to a socket for TCL commands to execute and to reply to with results. These stateless servers have the advantage that they use very little memory and free any memory used from work command to work command. They tend to have great data locality since the incoming message is converted to a small targeted set of data structures by the same processor which will perform the calculations. It is also helpful since a process tends to have affinity for a single processor and a processor has affinity for the data which it allocates.
I call this approach Fine Grained Distributed Processing since the TCL commands can be a very fine grain of work as long as the work is enough to offset the time spent to package up the message for the socket and return the result. Even when running on a single multi-core machine the programming paradigm is distributed.
The servers do not have to strictly be stateless, but the more state they have, the more complex the protocol and synchronization becomes between the processes.
Some observations that I have with this approach are that on a modern local machine with 4cpus I could send 30K 30KB size messages per second when the master sent messages and the slaves did nothing but echo the message back to the master. I also noticed in real use that if the task size became less that 20ms that scalability really would suffer. These benchmarks need to be taken into account when designing the messages and analyzing the suitability of this approach for a given problem. On our Cadence Farm which has machines with gigabit or better connectivity I saw the benchmark degrade by about 10% to 30% when the slaves were on remote machines. I could send the same number of shorter messages or reduce the number of messages. This number I am sure will vary a lot from farm to farm and also varied for me from run to run which I assume was affected by network traffic at the time. In production usage we have used the local machine for fine grained tasks and use the remote version for larger tasks of multiple minutes in duration. In this way the message passing overhead is kept minimal.
One example application I used this technique on was to parallelize a parasitic reducer of electronic circuits. Basically on a chip the wires connecting the chips would ideally be perfect conductors but there are parasitic resistances and capacitances that need to be analyzed before the speed of the circuit can be calculated. This analysis is called reduction since the output of the analysis is a reduced mathematical model known as a transfer function. This problem is easily parallelizable since the wires connecting the devices can be computed in any order. However the code in question depends on large data structures and a math library that are not thread safe. Rather than make the major changes needed for thread safety, I tried the fine grained distributed approach with sockets and tcl commands. I was able to get around a 3X speedup on 4cpus for this step. An example of a small circuit with around 20K nets connecting components is below.
The Fine Grained Distributed Processing approach can be an easy and effective way to get parallelism out of legacy code for certain problems where the messages are easy enough to compose and not too long compared to the work that needs to be done.
One thing to note is that TCL itself has a robust socket language which is easy to use and works well across platforms. I recommend it for prototyping and quick work. http://www.tcl.tk/about/netserver.html has a small example.
This is the first time I have tried to describe this approach in pure written form, usually I do it in person. If you have any questions post a comment and we can discuss.
URL: http://feedproxy.google.com/~r/IntelBlogs/~3/G0TvMnV1VLw/
Making of Core i7 — What’s Next?
by admin on Aug.19, 2009, under Storage
Making of Core i7 — What’s Next?
Nehalem played a big role during IDF 2008. Many new details were introduced during keynotes and in briefings, including features built into the microarchitecture like Turbo Mode, which allow the cores to dynamically scale up to handle demanding needs like video encoding or scale down to use just enough of a core to finish a simple task like typing.
Just after IDF 2008 in San Francisco, we visited some of the engineers who played a role in the making of the Intel Core i7 processor. This video was create just as the first Nehalem designed processors for consumers were about to hit the market. We visited design and testing labs in Hillsboro, Oregon and Santa Clara, California.
As we get closer to IDF 2009 in San Francsico, weˇ¦ll be visiting with manufacturing, design and test engineers working away on whatˇ¦s nextˇKcodename Westmere.
Where Nehalem was new chip archeticture design, Westmere is the next design being used to build processors that feature two 32nm cores with 4MB of cache that sit next to a memory controller and integrated graphics built on a separate, neighboring 45nm chip, all in one package. Westmeres will be the basis of upcoming all new Core chips (Core i3, i5, and 7) over the next few months.
Westmere processors will share some of the same features that were built into Nehalem, including Hyper-Threading and Turbo Boost (descirbed above).
Some Westmeres will feature HyperThreading will allow each core to handle two threads ˇX or process two jobs at once.
As we work on the ˇ§Making ofˇ¨ Westmere video prior to IDF, here is a video I shot with Intelˇ¦s Steve Smith, as he prepared for the first public demonstation of a Westmere-designed chip earlier this year.
If youˇ¦d like to follow the integration and innovation stories leading up to, during and after the Intel Developer Forum (September 22-24 in San Francisco), please follow:
- Facebook Fan Page
- @IDF Twitter
- @IntelNews on Twitter
- And during the event, news will be available in the Intel Pressroom
Comments (0)
URL: http://feedproxy.google.com/~r/IntelBlogs/~3/kkytO63c0RE/making_of_core_i7_–_whats_nex.php