2016/09/05

Speed of the sort command

GNU sort is normally crazy fast at what it does. However, recently I was trying to sort & unique several huge files and it seemed to be taking way too long. I did a little googling, and realized that it takes a lot longer to sort the full range of Unicode characters because it has to decode one or more bytes (UTF-8) before deciding where a character should be placed. There’s an easy way to increase the speed of the sort command, given a few caveats.

I’m not sure how I haven’t run into this already, but I love whenever I run into one of these little gems. The solution is pretty simple:

LC_ALL=C sort -uo uniqueoutput biginput1 biginput2

1	LC_ALL=C sort -uo uniqueoutput biginput1 biginput2

The C locale simply uses byte-ordering, so non-ASCII characters may end up in the wrong place. If you don’t need strict lexicographical sort, just a consistent sort, this seems to be the way to go.

Speed of the sort command is original content from devolve.

Tags:command line, data wrangling, performance

About The Author

Charlie Herron

Denizen of Portland, Maine; tech jack; lover / hater / whatever; philosophical dabbler. http://twitter.com/realgeek