GNU sort is normally crazy fast at what it does. However, recently I was trying to sort & unique several huge files and it seemed to be taking way too long. I did a little googling, and realized that it takes a lot longer to sort the full range of Unicode characters because it has to decode one or more bytes (UTF-8) before deciding where a character should be placed. There’s an easy way to increase the speed of the sort command, given a few caveats.
I’m not sure how I haven’t run into this already, but I love whenever I run into one of these little gems. The solution is pretty simple:
LC_ALL=C sort -uo uniqueoutput biginput1 biginput2
The C locale simply uses byte-ordering, so non-ASCII characters may end up in the wrong place. If you don’t need strict lexicographical sort, just a consistent sort, this seems to be the way to go.