poor old ubuntu needs to update

Why no alternate compression algorithms in rsync?

I was thinking the other day, gzip is all fine and good, but why doesn’t rsync support other compression methods? There are a few use cases where using LZO (a very low latency compression algorithm) would be a better choice.

One such case would be when operating with a relatively slow CPU, such as on an old system or an embedded device. LZO should still get the job done quickly.

Another case is where you’re transferring files over a fast network, but moving so much data that you still want some compression in effect. In this case, gzip -1, while fast, may still be slowing down the transfer too much to take full advantage of the available bandwidth. On anything but the slowest CPUs, LZO should give near-line speed performance.

rsync –skip-compress=LIST option

On Ubuntu 12.04, rsync has the --skip-compress=LIST option, which is fricken’ rad since it defaults to skipping files with the extensions 7z, avi, bz2, deb, gz, iso, jpeg, jpg, mov, mp3, mp4 ogg, rpm, tbz, tgz, z and zip.

So before, when we used to agonize over whether to use the -z option if the transfer might include some already-compressed files (or at least I would have agonized over it), now we can now rest easy knowing that rsync won’t try to compress at least some of them by default, and that we can override the defaults when needed by using something like this syntax: --skip-compress=gz/jpg/mp[34]/7z/bz2.

Cool Shell One-Liner of the Day

awk -F, '{print $1}' CSV | sort | uniq -c | grep -vw 1 | tee /dev/tty | wc -l

UPDATE: I went back and saw this post and thought to myself, “Self, why didn’t you annotate this garbage, you cheeky bastard?” OK, so the first part is pretty clear: get the first (or whichever) column you want from a simple (unquoted) csv file, and then count dupes. The grep is where we remove non-dupes and should probably be grep -Ev '^ *1 ' to avoid matching any of the csv data. Now here’s the magic. The pipe to tee /dev/tty echoes everything to stdout, but one copy of the output can go through more pipes before being displayed. So the wc -l is actually counting the number of entries which have duplicates (not the total number of all duplicates!), and displays that number at the bottom.

Here’s the tail end of what I get from this on a sample csv:

gzip by default

In my lastĀ post on gzip, I discovered that gzip can compress data in a more sync-friendly way. This totally unrelatedĀ blog entry from nginx discusses a new gunzip filter that decompresses compressed data for clients that don’t support gzip.

I was thinking about this the other day. Why not store all your content compressed, then you can just quickly use sendfile() or some other fast method to deliver data directly to a client, and decompress the compressed data for clients that don’t support it?

  • Decompressing is always faster than compressing (apples to apples).
  • You get to save storage space.
  • You could potentially reduce your IO by a large margin (over the network obviously, but also inside the box).
  • Since nearly every web browser in use today supports compression, you’d use it almost all the time. It’s the default case now, not the edge case.

There you have it. Compress to impress. Maybe we’ll see a return to the days of using compressed filesystems, but with multiple entry points depending on whether you want to get the data in a compressed or uncompressed form, like mounting a block device from /uncomp to retrieve a decompressed file, and a /comp mount point to get files in the native compressed form.

Gzip and Rsync

Gzip and Rsync were sitting in a tree, k-i-s-

Ok, I’ll stop. I just wanted to mention that I came across this little nugget in the gzip manpage the other night:

That, I think, is pretty cool.