Tag: data wrangling

Defend against fake google bots

I can think of some reasons why folks might use the Googlebot user agent on their non-Google bots, but I can’t think of any good, upstanding reasons to do it. Here’s how one might find some fine folks who would do such a thing. As of right now (May 2018), all valid Google Bot source

Finding how much time Apache requests take

When a request is logged in Apache’s common or combined format, it doesn’t actually show you how much time each request took to complete. To make reading logs a bit more confusing, each request is logged only once it’s completed. So a long-running request may have an earlier start time but appear later in the

scary rando stuff

You don’t see stuff like this everyday (I hope). [crayon-5c69beec6d13b515243976/] [crayon-5c69beec6d140585388400/]

Speed of the sort command

GNU sort is normally crazy fast at what it does. However, recently I was trying to sort & unique several huge files and it seemed to be taking way too long. I did a little googling, and realized that it takes a lot longer to sort the full range of Unicode characters because it has

Moving Evernote notes into WordPress

proprietary insecurity I’ve accumulated many notes (2000+) in Evernote over the years, and love that it can store binary attachments such as images or other media files. My favorite feature is the Evernote Web Clipper browser extension; it does a fantastic job at saving the parts of an article I want to save while keeping

distribution: histograms in the terminal

My new favorite tool is a python program called distribution that can easily show histograms in your terminal: [crayon-5c69beec6d408076064855/] I used homebrew to install it, but you can see some usage examples and a few other tools on this stackoverflow page. I eagerly anticipate showing off some histograms to people.

GNU xargs is missing the -J option. WHY!?!

I find that using an idiom like [crayon-5c69beec6d491239853749/] is so useful. It replaces the replstr (“%” in this example) with all the arguments at once, or as many as can fit without going over the system’s limit. I couldn’t believe it when I learned that the GNU version of xargs lacks this flag. Yes, it’s

Allow webapps to make outgoing requests

I was experiencing a pretty bad slowdown while trying to use the admin pages of a WordPress site recently. The load on the machine was quite low, so I began to suspect that it was trying to call out to external services (facebook, pinterest, etc) that might have been blocked by CSF (configserver firewall). I

Discard first column without AWK

UPDATE: Major derp moment on my part, thinking that you needed a loop in AWK to print all but one fields. Commandlinefu just cause a forehead-slapping moment when I saw this in my feed: [crayon-5c69beec6d7ca844932266/] So, it seems AWK wins again. Carry on. If you’re trying to print one or more particular columns from some

Finding call-time pass by references in PHP.

While trying to move an older code base to a newer system and thus a newer version of PHP (5.3 -> 5.5), I knew that some of the code would need to be changed to avoid using some removed features. Specifically, I mean call-time pass by references. For those who don’t know, this is kind

Get items unique only to list1

…from two lists with some overlap. Spent some time working in Python on this problem. Afterwards, I realized it’s a shell one-liner. comm -23 <(sort f_most) <(sort f_some) | sort -n > f_uniq_to_1 I re-sort the output numerically since comm assumes its input is sorted lexicographically, and I happen to be comparing lists of numbers.