Storage Informer
Storage Informer

Tag: Dedupe

Storage Is Software

by on Jul.09, 2010, under Storage

Storage Is Software

EMC logo
The title of this post is a variant on the intriguing "infrastructure is code" meme from the #devops community. I think it’s a useful idea to remind ourselves of — especially as technology transitions.

Even though some of you reading this might thing the statement is blindingly obvious, it’s clear that the vast majority of people think of boxes with blinking lights when you say the word "storage".

And I think this is going to be up for change sooner than later.

Things Change

Having now been directly involved in storage for over 15 years, I feel I can safely make a reasonable judgment when things are changing.

So let’s go look at the current landscape …

For starters, most storage hardware today is built out of the same industry-standard parts bin used by the server guys.  Yes, there are a few storage stalwarts trying to claim differentiation through this bit or that bit of unique silicon, but the secular trend is pretty obvious — parts is parts.

Now, I think there’s still room for useful hardware differentiation in areas like innovative architecture, or clever packaging, or using the latest merchant silicon chips, or perhaps more reliable manufacturing processes. 

All that being said, I think the opportunities for sustained differentiation through hardware prowess alone are becoming more rare over time.

And we all p lay in a very competitive market indeed  Much like customers won’t accept dated or over-priced server hardware designs, they won’t accept dated or over-priced storage hardware designs.

Thinking About Storage Software

At its most basic level, you expect to write information to a storage platform, and get it back again. 

You’d like to do so in a convenient format — more traditional block and file formats, perhaps something newer like objects, even maybe something like tables.  That’s a function of software, not hardware.

You’d like the integrity of the data protected from all sorts of bad things that can happen — hardware failures, software failures, human error, the list goes on.  That’s a function of software, not hardware.

You’d like to wring the maximum in performance and efficiency from the hardware you own: move the popular data to the high-perfrmance media, the less-popular data to cost-effective stuff, and wring the excess capacity out with things like compression and deduplication.

More software.

If you tend to think geographically, you’d like the right information in the right location at the right time if possible.  Whether that’s to better protect, or improve user experience, or something else — that’s all software as well.

I could go on, but — when you think about it — just about everything we talk about that’s new, interesting and useful tends to boil dow n to a software discussion. 

Sure, there are new hardware bits like faster processors, and enterprise flash drives, and newer 10GbE interconnects — but it takes software to make all that stuff really useful.

The Impact Of Open Source

Much like industry-standard components and architectures set the floor for cost-effective hardware, I think open source software sets the floor for cost-effective software functionality.

There’s still room to innovate in software, but you have to do it in areas that haven’t been well-covered by the open source community.  And — make no mistake — it’s a safe bet that open source software will be an ever-increasing part of our enterprise environments.

Resistance to either trend appears futile :-)

Separating Software From Hardware

We’ve just come to assume that storage software is inevitably woven to storage hardwa re.  But as the industry moves to more standard components and architectures, that’s becoming more of a business model discussion, and less of a technology discussion.

Examples are starting abound, especially within EMC’s portfolio. 

Our Atmos cloud storage platform is now available as a VMware virtual machine.  Run it on just about any VMware-supported hardware platform, and you’ve got a fully featured, next-generation distributed object-oriented metadata-rich policy-driven cloud storage environment

One could separately debate the meri ts of running Atmos storage software on a generic hardware platform vs. one that is specifically built for purpose, but that’s more of a discussion around implementation choices — and choice is good.

Many of you are aware that the Avamar client-side dedupe backup platform works basically the same way — your backup target can either be a dedicated hardware device running Avamar, or the same functionality running in a VM on generic hardware — it’s your choice.

Going further, there’s a much larger universe of EMC storage products just waiting to escape the confines of phy sical hardware: RecoverPoint, VPLEX, Celerra, Centera — the list goes on. 

Even some interesting open-source choices if you go looking: for example, the EMC LifeLine stack which powers the increasingly more powerful Iomega unified storage devices.

So why aren’t all these great things being done today?  Lots of issues, but the big one is — it’s hard!

Making storage software work predictably and reliably in a virtual machine takes substantial engineering effort.  And that incremental effort&#016 0; has to be balanced against other investment opportunities: things like adding new features, or supporting new hardware, or perhaps deep integrations with other environments.

It’s happening — it’s just not an overnight process.  Sorry to say, the future isn’t quite here yet …

Fast Forward Several Years

Imagine you’re in charge of storage decisions at your company, and you’re trying to put together a solution for part of your operation.

You might start by assembling a set of services you’ll need to provide for applications and users.  You evaluate different software stack options. for functionality, price, reliability, support, ease-of-use, integration, APIs, etc.

You do so by composing various storage software VMs, and putting the resulting stacks through their paces.  Basic presentation services (file, block, object, etc.).  Some replication stuff, maybe some auto-tier ing and or intelligent archival stuff.

You test features and functions, integration points and management interfaces.  All using virtual machines in whatever test bed you’ve got handy.

No need to consider storage as hardware just yet.

When you’re ready to implement, you’ve got more choices: you can stick with the storage-software-in-a-VM approach, or perhaps consider purpose-built hardware if your needs so dictate. 

Functionality first, implementation second.

Farther Down The Line

The migration of storage functionality from hardware to software will likely change how storage hardware itself is built.  At the low end of the market, all-in-one storage can learn new tricks simply by invoking new elements of a (presumably virtualized) software stack.

And at the high end of the market, it’s not hard to imagine larger, dynamic pools of virtualized storage capabilities that flex both resources and functionality much the way virtualized servers do today.  To be fair, though, that’s a reasonable description of what a VMAX and VPLEX does today.

Indeed, w e can easily see storage software functionality running flexibly where it makes the most sense — on a general purpose all-in-one storage hardware platform, or perhaps as a set of virtualized tasks in a server farm, or perhaps on an appliance dedicated to a task — or any combination as needs shift.

And that’s going to force some changes in thinking all around.

Final Thoughts

The runaway success of VMware has caused many of us to think of "servers" in terms of software images that are invoked as needed.  The hardware is still there, and it needs to do its job, but we think about it differently.

Will we learn to think of storage the same way?

Update your feed preferences

URL: 172652e68746d6c

Leave a Comment :, , , , , , , , , , , , , , , , , , , , , , , more...

How Much Is Too Much?

by on Jul.24, 2009, under Storage

How Much Is Too Much?

EMC logo How Much Is Too Much?

You can never be too thin or too rich. (Although I don’t think whoever came up with that had Nicole Richie or the Sultan of Brunei in mind.) And in the world of backup, you can never be too fast….

You can never be too thin or too rich. (Although I don’t think whoever came up with that had Nicole Richie or the Sultan of Brunei in mind.)

And in the world of backup, you can never be too fast. It is just not possible.

Rich Colbert and Daniel Budiansky have posts regarding the importance of speed to backup. You can read those here and here.

The premise of their approach seems to me two-fold: one, the DD880 is fast enough that it obsoletes post-process deduplication technologies; and two, that this speed is high enough that for the first time in-line deduplication will no longer be a bottleneck in the backup process for the vast majority of customers.

Now I don’t entirely disagree with these arguments. For a long time I have been saying of a DL4206 and DL4406 when asked of performance: “They are so fast that I can virtually guarantee that they will not be the bottleneck in your backup process.” At up to 2,200 MB/s, we can see that there are many opportunities for backups to degrade before the EDL becomes the problem: data must be read from disk, through a client, across a network, sometimes mediated by an database or application interface, to a backup server, meta-data must be processed, and then it must traverse another network before it is written to a target. There are so many obstacles here (including the presence of an OS, a filesystem, different disk architectures, etc.) that there is very little chance that the EDL will be a bottleneck. Certainly not for a single backup server or storage node or media server.

Those customers that do find it to be a bottleneck almost certainly have many media servers, or storage nodes. One customer that I can think of that is asking for 4,500 MB/s of sustained backup bandwidth has over 40 TSM servers that are driving this workflow.

With the introduction of the DD880, what we see is significant decrease in the number of backup organizations that will experience the backup target as a bottleneck. This stands in contrast to the DD690 and the DL3000 (both with a rough maximum speed of 400 MB/s for in-line deduplication); where many customers were forced to buy multiple systems in order to match their throughput requirements.

Or they considered tiering their backup infrastructure with a DL4000 with post-process deduplication.

But architecturally, those were really the only two choices: buy many slower systems to deduplicate in line, or buy fewer, tiered backup targets. (For arguments’ sake I think we could stipulate that the ratio would be in the order of 6-8 to 1 — if you wanted to be as fast as a tiered device, you would need 6-8 in-line systems.)

The DD880 changes this dynamic.

But it is important to note that it changes it quantitatively, rather than qualitatively. And what I mean by that is this: we have narrowed the scope for tiered devices. By increasing the speed of the in-line deduplication target, we have reduced the scope of the use case for tiered backup (the DL4000 with deduplication). The number of customers that will be interested in this approach will be smaller–because the performance DD880 is sufficiently fast, that for their requirements that they can do their backup to a single target (not the multiple systems that would previously have been required).

Another dimension to this discussion: how many backup targets are you willing to manage? If a DL4406 is twice as fast as a DD880, you need to ask this question (assuming you need 2,000 MB/s plus of backup bandwidth): do you value a single target for management, with delayed deduplication, or do you value in-line deduplication more, even if that entails multiple targets?

Let me be clear: I don’t think there is any one right answer here! I think that different organizations are likely going to weight priorities differently, and will have different answers to that question. The important thing here is that it is a question you will likely have to answer for yourself if you have the requirement for a very large amount of backup bandwidth: a single tiered target? or multiple in-line targets?

So, I don’t agree 100% with Rich when he writes: “speeds and feeds are no longer an inline versus post-process argument … speeds and feeds are no longer a dedupe versus non-dedupe argument either.” I think (because I am a sucker for precision) that it is the case that the DD880 has dramatically reduced the number of cases in which this decision needs to be made. The number of customers that need to think about an in-line device like the DD880, and a tiered device, like the DL4000, is much smaller than it was before the introduction of the DD880.

However, no technology stands still. The DL4000 line will continue to get bigger, faster, and better. For the time frame into which I think I have a useful amount of insight, there will continue to be a real gap (2-3x performance?) between straight VTL technologies, with tiered deduplication, like the DL4000, and pure in-line technologies, like the DD880. So for the foreseeable future, there will always be a certain number of organizations at the very high end of the market that will have to weigh these considerations: in-line versus tiered, and one versus many.

For the rest of the world, the potential complexity of your backup environment was just reduced: one device will now suffice. And for the first time in a long time, we can truthfully say to many customers that it is very likely that not only will a VTL not be a bottleneck to your backup environment, but that an inline deduplication appliance will not be a bottleneck either.

It is about time.

Update your feed preferences


Leave a Comment :, , , , , , , , , , , , , , more...

PHD Virtual Extends esXpress Support To VMware vSphere 4

by on Jul.23, 2009, under Storage

PHD Virtual Extends esXpress Support To VMware vSphere 4



PHD Virtual Technologies, provider of the esXpress data protection and recovery solution for virtual machines, today announced that esXpress has been extended to support VMware vSphere 4.

This new release of esXpress version 3.6 also includes significant enhancements for all versions of VMware’s ESX platform version 3.0.2 and above. An optimized deduplication engine dramatically increases backup speeds and fuels performance for file-level restores, as well as VMDK restores and data archival via a Windows Share.

esXpress, with new support for vSphere 4, performs backup and recovery using the virtual environment itself. By creating virtual backup appliances (VBAs) – small virtual machines – the solution can be deployed in minutes on VMware servers, and provides the most scalable environment for backing up virtual machines. New performance enhancements include:

  • Improved file level restore speeds are now up to four times faster
  • Data Restoration and Archival via Windows’ Shares are now up to four times faster
  • Improved PHDD deduplication image-level restore speeds up to twice as fast
  • Accelerated deduplication engine provides initial backups that are seeded at double the previous rates

esXpress continues to support up to 16 concurrent backup/restore streams per host and all backups can be self-restored without using esXpress or other proprietary virtual machine infrastructure. esXpress’ block level backups are de-duplicated source side, ensuring data is compressed and deduped before it every leaves the host. This ensures that network traffic is kept to a minimum even while backing up over a WAN link.


Leave a Comment :, , , , , , , , , , , , , , , more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...