Home Connected

Part 3: Storage as a Service – Clouds of Data

Okay, I finally figured it out.  The problem with my sons’ go-kart, I mean.  Rather, “problems”, plural, I should say.  What does my boys’ go-kart have to do with data networking, cloud computing, or anything else written about on this site?  Well, the go-kart problem is a metaphor for the complexity that inevitably arises when two or more failures occur in any system.  The symptoms associated with multiple failures – or root causes – can be difficult to analyze, comprehend, and diagnose.  Never mind attempting a cure, each failure has to be isolated and identified before a fix can be applied.  Identifying root causes in data networks is never easy, in my experience, but the degree of difficulty increases exponentially if two or more failures occur.

So, what does this have to do with Ed Koehler’s third installment of “Storage as a Service – Cloud of Data”?  The promise of The Cloud is to reduce the complexity of IT operations by essentially “out-sourcing” much of the infrastructure.  The fundamental business premise is to devote more of an enterprise’s resources to its business requirements, sharing the IT infrastructure with others, and leaving the operational costs and expertise in the hands of the cloud provider.  Who is, presumably, better and more efficient in delivering IT to its customers rather than having each customer provide from himself.  Storage is a sufficiently complex operation that there appears to be sufficient demand in many vertical markets to warrant its being offered as a service to others.  And not just to the small and medium sized enterprises, but to the Fortune 1000 as well.  Leaving those thorny multiple root cause network, server, and storage problems in the hands of folks far better qualified to solve them than us mere mortals.

Oh, yeah.  The go-kart.  Root cause #1: two of its three batteries – yes, it’s an electric vehicle – had defective cells.  Root cause #2: the insulation of one of the high-current carrying conductors had a pin-hole opening, offering a high-resistance path to the kart’s frame.  The result?  Everything appeared to be normal until the driver depressed the accelerator pedal, which caused the kart to initially move forward, then lost power almost immediately and stopped.  I even borrowed a portable oscilloscope from a former colleague now at H-P to verify that the PWM output of the controller indeed diminished in amplitude when the pedal was actuated.  Hey, I can still do a few basic engineering tasks!

But Ed Koehler can do a lot more.  Let’s read on.



Technologies for SaaS
The above use models assumed the use of underlying technologies to move the data, reduce it and store it. These are then merged with supporting technologies such as web services, collaboration and perhaps content delivery to create a unified solution to the customer. Again, this could be as simple as a storage target where data storage is the primary function or it could be as complex as a full collaboration portal where data storage is more ancillary. In each instance, the same basic technologies come into play. It is obvious that from the point of view of the customer, only the best will do. While from the point of view of the provider, it is providing what will meet the level of service required. This results in a dichotomy as often occurs in any business model. The end result is an equitable compromise which uses the technologies below to arrive at an equitable solution that satisfies the interest of the user as well as that of the provider. The end result is a tenable set of values and benefits to all parties which is the sign of a good business model.

Disk Arrays
Spinning disks have been around almost as long a modern computing itself. We all have the familiar spinning and clicking (now oh so faint!) on our laptops as the machines chunks through data on its relentless task of providing the right bits at the right time. Disk technology has come a long ways as well. The MTBF rating for even lower end drives are exponentially higher than the original platter technologies. Still though, this is the Achilles Heel. This is the place where the mechanics occur. Where mechanics occur, particularly high speed mechanics ñ failure is just one of the realities that need to be dealt with.

I was surprised to learn just how common it is that just a bunch of disks are set up and used for cloud storage services. The reason is simple, cost. It is far more cost effective to place whole disk arrays out for leasing than it is to take that same array and sequester a portion of it for parity or mirroring. As a result, many cloud services offer best effort service and with smaller services that pretty much works, particularly if the IT staff is diligent with backups. As the data volume grows however, this approach will not work as the MTBF rate of potential failure will out weigh the ability to pump the data back into the primary. That exact number is related to the network speed available and since most organizations do not infinite bandwidth available, that limit is a finite number.
Now one could go through the math to figure the probability of data loss and gamble, or one could invest into RAID and be serious about the offering they are providing. As we shall see later on, there are technologies that assist in the economic feasibility. In my opinion, it would be the first question I asked someone who wanted to provide me a SaaS offering. That is first beyond backup and replication or anything else. Will my data be resident on a RAID array? If so what type? Another question to ask is the data replicated? If so, the next question is how many times and where?

Storage Virtualization
While a SaaS offering could be created with just a bunch of disk space. Allocation of resources would have very rough granularity and the end result would be an environment that would be drastically over provisioned. The reason for this is that as space is leased out the resource is ëusedí whether it has data or not. Additionally, as new customers are brought on line to the service additional disk space must be acquired and allocated in a discrete fashion. Storage virtualization overcomes this limitation by creating a virtual pool of storage resources that can consist of any number and variety of disks. There are several advantages that are brought about by the introduction of this type of technology. The most notable is that of thin provisioning. Which, from a service provider standpoint is some thing that is as old as service offerings itself. As an example, network service providers do not build their networks to be provisioned to 100% of the potential customer capacity 100% of the time. Instead they analyze and look at traffic patterns and engineer the network to handle the particular occurrences of peak traffic. The same might be said of a thinly provisioned environment. Instead of allocating the whole chunk of disk space at the time of the allocation, a smaller thinly provisioned chunk is setup but the larger chunk is represented back to the application. The system then monitors and audits the usage of the allocation and according to high water thresholds, allocate more space to the user based on some sort of established policy. This has obvious benefits in a SaaS environment as only very seldom will a customer purchase and use 100% of the space at the outset. The gamble is that the provider keeps enough storage resources within the virtual pool to accommodate any increases. Being that most providers are very familiar with type of practice in bandwidth provisioning, it is only a small jump to apply that logic in storage.

Not all approaches to virtualization are the same however. Some implementations are done at the disk array level. While this approach does offer pooling and thin provisioning, it only does so at the array level or within the array cluster. Additionally, the approach is closed in that it only works with that disk vendor’s implementation. Alternatively, virtualization can be performed above the disk array environment. This approach more appropriately matches a SaaS environment in that the open system approach allows any array to be encompassed into the resource pool which better leverages on the SaaS providersí purchasing power. Rather than getting locked into a particular vendors approach, the provider has the ability to commoditize the disk resources and hence allow better pricing points.
There are also situations called margin calls. These are scenarios that can occur in thinly provisioned environments where the data growth is beyond the capacity if the resource pool. In those instances, additional storage must physically be added to the system. With array based approaches, this can run into issues such as spanning beyond the capacity of the array or the cluster. In those instances, in order to accommodate for the growth, the provider needs to migrate the data to a new storage system. With the open system approach, the addition of storage is totally seamless and it can occur with any vendors hardware. Additionally, implementing storage virtualization at a level above the arrays allows for very easy data migration, which is useful in handling existing data sets.

Data Reduction Methods
This is a key technology for the providers return on investment. Here remember that storage is the commodity. In typical Cloud Storage SaaS offerings the commodity is sold by the Gigabyte. Obviously, if you can retain 100% of the customers data and only store ten or twenty percent of the bits, the delta is revenue back to you for return on investment. If you are then able to take that same technology and not only leverage it across all subscribers but across all content types as well then it becomes something that is of great value to the overall business model of Storage as a Service. The key to the technology is that the data reduction is performed at the disk level. Additionally, the size of the bit sequence is relatively small (512 bytes) rather than the typical block levels. As a result, the comparative is large (the whole SaaS data store) while the sample is small (512 bytes) The end result, is that as more data is added to the system the context of reference is widened correspondingly meaning that the probability that a particular bit sequence will match another in the repository is hence increased.

But beware, data reduction is not a panacea. Like all technologies it has its limitations and there is the simple fact that some data just does not de-duplicate well. There is also the fact that the data that is stored by the customer is in fact manipulated by an algorithm and abstracted in the repository. This means that some issues of regulatory legal compliance may come into play with some types of content. For the most part however, these issues can be dealt with and data reduction can play a very important role in SaaS architectures, particularly in the back end data store.

Replication of the data
If you are doing due diligence and implementing RAID rather than selling space on just a bunch of disks, then your most probably the type that will go further to create secondary copies of the primary data footprint. If you do this, you also probably want to do this on the back end so as not to impact the service offering. You also probably want to use as little network resource as possible to keep that replicated copy up to date. Here technologies like Continuous Data Protection and thin replication can assist in getting the data into the back end and performing the replication with minimal impact to network resources.

There are more and more concerns about placing content in the cloud. Typically these concerns are from business users who see it as a major compromise of security policy. Individual end users are also broaching concerns around confidentiality of content. Encryption can not solve the issue by itself but it can go a long way towards it. It should be noted though that with SaaS encryption needs to be considered in two aspects. First is the encryption of data in movement. That is protecting the data as it is posted into and pulled out of the cloud service. Second is the encryption of data at rest, which is protecting the content once it is resident in a repository. The first is addressed by methods such as SSL/TLS or IPSec. The second is addressed by encryption at the disk level or prior to disk placement.

Access Controls
Depending on the type and intention of the service, access controls can be relatively simple (i.e. user name & password) to complex (RSA type). In private cloud environments, normal user credentials for enterprise or organization access would be the minimum requirement. Likely, there will be additional passwords or perhaps even tokenization to access the service. For semi-private clouds the requirements are likely to not be as intense but again, can be if needed. Also, there may be a wide range in the level of access requirements. As an example, for a backup service there only needs to be an iSCSI initiator/target binding and a monthly report on usage that might be accessible over the web. In other services such as collaboration, a higher level portal environment will need to be provided, hence the need for a higher level access control or log on. Needless to say, some consideration will need to be made for access to the service, even if it is for the minimal task of data separation and accounting.

The technologies listed above are not required, as pointed out above just a bunch of disks on the network could be considered cloud storage. Nor is the list exhaustive. But if the provider is serious about the service offering and also serious about its prospect community, it will make investments into at least some if not all of them.

By Ed Koehler

William Seifert is the former Chief Technology Officer of Data Solutions at Avaya. He has served on numerous corporate boards, and holds a Bachelor and Master of Science, Electrical Engineering from Michigan State University. more

Leave a Reply