Getting data to the cloud

One of the problems facing cloud computing is the difficulty in getting data from your local servers to the cloud. My home Internet connection offers me maybe 768 Kbps upstream, on a good day, if I’m standing in the right place and nobody else in my neighborhood is home. Even at the office, we have a fractional T1 connection, so we get something like 1.5 Mbps upstream. One (just one!) of the VM images I use for testing is 3.3 GB. Pushing that up to the cloud would take about five hours under ideal conditions!

I don’t know what the solution to this problem is, yet, but it’s definitely something a lot of people are working on. I thought I’d point out a couple of interesting ideas in this area. First is the Fast and Secure Protocol, a TCP replacement developed by Aspera and now integrated with Amazon Web Services. The basic idea is to improve transmission rates by eliminating some of the inefficiencies in TCP. In theory this will allow you to more reliably achieve those “ideal condition” transfer rates, and if their benchmarks are to be believed, they’ve done just that. However, all this does is help me ensure that transferring my VM image really does take “only” 5 hours — so I guess that’s good, but this doesn’t seem like a revolution.

From my perspective, a more interesting idea is LBFS, the low-bandwidth filesystem. This is a network filesystem, like NFS, but expressly designed for use over “skinny” network connections. It was developed several years ago at MIT, but I hadn’t heard of it until today, so I imagine many of you probably haven’t either. The most interesting idea in LBFS is that you can reduce the amount of data you transfer by exploiting commonalities between different files or different versions of the same file. Basically, you compute a hash for every block of every file that is transferred, and then you only send blocks that haven’t already been sent. On the client side, it takes the list of hashes and uses them to reassemble the file. This can give you a dramatic reduction in bandwidth requirements. For example, consider PDB files, the debugging information generated by the Visual C++ compiler: every time you compile another object referencing the same PDB, new symbols are added to it and some indexes are updated, but most of the data remains unchanged.

Like I said, I don’t know what the solution to this problem is, but there are already some exciting ideas out there, and I’m sure we’ll see even more as cloud computing continues to evolve.

Follow me

Eric Melski

Eric Melski was part of the team that founded Electric Cloud and is now Chief Architect. Before Electric Cloud, he was a Software Engineer at Scriptics, Inc. and Interwoven. He holds a BS in Computer Science from the University of Wisconsin. Eric also writes about software development at http://blog.melski.net/.
Follow me

Share this:

2 responses to “Getting data to the cloud”

    • Eric Melski says:

      That’s an excellent question. In fact LBFS and rsync definitely share some concepts, but you can think of LBFS as going further than rsync does. With rsync, we compute a checksum for each block of a file, then ask the target, “hey, is the checksum for block 1 of file A the same as what I have computed?” If so, then rsync saves by not transferring that block.

      With LBFS, it actually looks for identical blocks at different locations within the same file and even across files. So on the source host we compute a checksum for chunk A-1, and then say to the target host, “hey, do you have a chunk (from any location in any file) that matches this checksum?” If a matching chunk is found, it gets stitched into file A on the target machine. This gives you the potential for even more savings than rsync can provide.

      Thanks for the comment!

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe

Subscribe via RSS
Click here to subscribe to the Electric Cloud Blog via RSS

Subscribe to Blog via Email
Enter your email address to subscribe to this blog and receive notifications of new posts by email.