Tag: data wrangling

Machine-readable Dates

I had some directories named in the format of “Jul 18, 2012”. Thanks, iPhoto export, but no thanks. [crayon-64cebc11bc4f3167652088/] Note: gdate is GNU date after doing homebrew install coreutils.

sort -u versus sort | uniq

I just ran into an interesting situation with sort -u. I had generated a couple files with md5sum and they had a lot of equal lines in them. So I thought I would create a merged version.

join: the command

From the manual: [crayon-64cebc11bcb19499736291/] I had two CSVs, baz01.csv and baz02.csv. They shared the same first column, which was a list of database table names. The second column contained the number of rows from each table. The row numbers between the two files were different, and I wanted to compare them. The join command to

AWK blows me away

How did I not know this about awk!? Don’t get me wrong, I’m no awk expert; I’m always using some of the most simple and obvious features it has. But I almost always use the -F option to specify the field separator. Until today, I thought you could only give it either a single character

MySQL engines, constraints & keys

I wanted to see how I could improve the performance of a MySQL database with mixed table engines by converting all the MyISAM tables to InnoDB, as well as make the huge DB responsive while backing up by using mysqldump with the --single-transaction option. I used the following PHP script (I know, spare me): [crayon-64cebc11bd162477492586/]

climagic is magic

If you’re not following @climagic, you should be forced to listen to this for hours on end: [crayon-64cebc11bd443250607716/] That’s just one of the many glorious bits from this timeline.

sed is great, but not that great

TIL: http://stackoverflow.com/questions/1103149/non-greedy-regex-matching-in-sed It turns out, sed has no concept of a non-greedy match. You have to use perl or some other advanced tool to get that regex feature. The workaround given at Stack Overflow only works if you have a single character ending match delimiter (in this case, it was [^/]+ to match until the

Cool Shell One-Liner of the Day

awk -F, '{print $1}' CSV | sort | uniq -c | grep -vw 1 | tee /dev/tty | wc -l UPDATE: I went back and saw this post and thought to myself, “Self, why didn’t you annotate this garbage, you cheeky bastard?” OK, so the first part is pretty clear: get the first (or whichever)