sort -u versus sort | uniq

I just ran into an interesting situation with sort -u. I had generated a couple files with md5sum and they had a lot of equal lines in them. So I thought I would create a merged version.

“OK,” I thought, “I’ll just cat the two files into sort -u.” But the md5sum is in the first column and the file path in the second. So they were sorted by md5sum and the file paths were all out of order afterwards. “No problem, I’ll just tell which column to sort with sort -k 2 -u“. This seemed perfectly natural to me, but it didn’t produce the expected results. There weren’t duplicate paths with different md5sums, at least not that I could see.

I think what happened is that it sorted the second column, but only unique’d by that column as well. By splitting the procedure across a pipe (sort -k 2 | uniq), you ensure that the whole data set is sorted before stripping out the non-unique entries.

sort -u versus sort | uniq is original content from devolve.