Defend against fake google bots

I can think of some reasons why folks might use the Googlebot user agent on their non-Google bots, but I can’t think of any good, upstanding reasons to do it.

Here’s how one might find some fine folks who would do such a thing.

As of right now (May 2018), all valid Google Bot source IPs start with the same prefix, 66.249. This may change in the future, so if you’re having problems being crawled by Google, check to make sure you’re not blocking a new range they may have started using. OK, here’s the nitty-gritty.

Interesting that a few of the IPs from my logs indicated that Facebook in Ireland are using the Google user agent. Naughty! Anyway, if you want to test that you’re not blocking a valid Google address, then you need to do an IP lookup on some of these groups of addresses. And of course you can modify the above to scan the current log files instead of the archived gzipped files.

Here’s how I’m blocking the baddies (this isn’t original, I searched and found a version of this). This goes in your Apache config or .htaccess file:

Quickly show which directories are serving host names on a multiple Vhost Apache system

Does this look familiar? Maybe you need more fiber in your diet. Or maybe you need THIS:

You’re welcome.

Finding the most persistent, pernicious baddies by processing log files

Logwatch is a great utility for emailing me a summary of system logs over the last 24 hours. One of the things it shows are unsuccessful login attempts and their source IP addresses. But the default unsorted output is hard to analyze and take action on, since a single IP may appear many times in the output but at random locations.

It looks kind of like this (I’ve obscured the full IP to protect the guilty).

So, here we go. Create a shell script or alias with the following:
pbpaste | ggrep -Po '\b((?:\d{1,3}\.){3}\d{1,3})\s' | distribution

Once you’ve got the sections from the logwatch email copied to the clipboard, run this to see which source IPs are the top offenders. Since I’m using pbpaste and ggrep, it should be clear I’m on a Mac. This works on Linux using xsel --clipboard --output and grep, respectively.

And if you haven’t checked out distribution, you should. Super useful.

Finding how much time Apache requests take

When a request is logged in Apache’s common or combined format, it doesn’t actually show you how much time each request took to complete. To make reading logs a bit more confusing, each request is logged only once it’s completed. So a long-running request may have an earlier start time but appear later in the log than quicker requests.

To help look at some timing info without going deep enough to need a debugger, I decided that step one was to use a custom log format that saved the total request time. After adding usec:%D to the end of my Apache custom log format, we can now see how long various requests are taking to complete.

tail -q -1000 *access.log | mawk 'FS="(:|GET |POST | HTTP/1.1|\")" {print $NF" "$6}' | sort -nr | head -100 > /tmp/heavy2

I’m using the “%D” format for compatibility with older Apache releases, which reports the response time in microseconds. I would prefer milliseconds, but when I tried using “%{ms}T” on a server running 2.4.7, it didn’t work; too old. This output is a bit hard to read when looking at the numbers, so we can try to add in a little visual aid with commas as the thousands separator.

cat /tmp/heavy2 | xargs -L 1 printf "%'d %s\n" | less

Note that because we are measuring the total request time, some of the numbers may be high due to remote network latency or a slow client. I recommend correlating several samples before blaming some piece of local application code.

Hope this helps finding your long-running requests!

scary rando stuff

You don’t see stuff like this everyday (I hope).

Speed of the sort command

GNU sort is normally crazy fast at what it does. However, recently I was trying to sort & unique several huge files and it seemed to be taking way too long. I did a little googling, and realized that it takes a lot longer to sort the full range of Unicode characters because it has to decode one or more bytes (UTF-8) before deciding where a character should be placed. There’s an easy way to increase the speed of the sort command, given a few caveats.

I’m not sure how I haven’t run into this already, but I love whenever I run into one of these little gems. The solution is pretty simple:

The C locale simply uses byte-ordering, so non-ASCII characters may end up in the wrong place. If you don’t need strict lexicographical sort, just a consistent sort, this seems to be the way to go.

Moving Evernote notes into WordPress

proprietary insecurity

I’ve accumulated many notes (2000+) in Evernote over the years, and love that it can store binary attachments such as images or other media files. My favorite feature is the Evernote Web Clipper browser extension; it does a fantastic job at saving the parts of an article I want to save while keeping the styling intact.

Evernote has a free plan which I’ve enjoyed for a long time, but recently the financial status of the company has come into question, and they restricted syncing to only two devices. Also, the last thing I want to happen is another kind of Google Reader shutdown fiasco. I doubt that a shutdown would make my existing notes disappear, but it’s better to be prepared ahead of time. To that extent, I’ve been looking for a viable option to migrate my notes into another platform. Continue reading “Moving Evernote notes into WordPress”

distribution: histograms in the terminal

My new favorite tool is a python program called distribution that can easily show histograms in your terminal:

I used homebrew to install it, but you can see some usage examples and a few other tools on this stackoverflow page. I eagerly anticipate showing off some histograms to people.

GNU xargs is missing the -J option. WHY!?!

I find that using an idiom like

is so useful. It replaces the replstr (“%” in this example) with all the arguments at once, or as many as can fit without going over the system’s limit. I couldn’t believe it when I learned that the GNU version of xargs lacks this flag. Yes, it’s only on the BSD xargs as far as I can tell.

Every time I’ve searched, someone suggests using the -I flag on GNU xargs instead, but they are not quite the same. The -I flag substitutes the replstr one argument at a time, so that in the earlier example, instead of executing

only once, with the -I flag it will instead do

I’ve also tried using the -n and -L flags, but they are mutually exclusive with each other and with -I. OK, so we need some kind of klugey workaround.

This adds the “bar/” suffix to the standard input before adding it to the end of the mv command. “But,” you say, “those strings are supposed to be null-terminated!” True, but we’re providing a suffix rather than an extra replacement argument, so the EOF signaled from the input stream is really all we need.

There’s another, more intuitive way, but harder to get right; get the argument list output from a subshell command:

But this suffers from not handling weird file names the right way. Instead one could do:

This actually works better for file names, but lacks the flexibility of find.

Is this stuff really what we ought to do? Just give us the -J, GNU. If you know a different way to deal with this, tweet me @realgeek and I’ll update this post.

Allow webapps to make outgoing requests

I was experiencing a pretty bad slowdown while trying to use the admin pages of a WordPress site recently. The load on the machine was quite low, so I began to suspect that it was trying to call out to external services (facebook, pinterest, etc) that might have been blocked by CSF (configserver firewall).

I started playing around with tcpdump and friends and then realized that the information I was looking for (blocked outgoing requests) was already being logged in /var/log/kern.log on our Ubuntu system (same on Debian). Continue reading “Allow webapps to make outgoing requests”

Discard first column without AWK

UPDATE: Major derp moment on my part, thinking that you needed a loop in AWK to print all but one fields. Commandlinefu just cause a forehead-slapping moment when I saw this in my feed:

So, it seems AWK wins again. Carry on.

If you’re trying to print one or more particular columns from some input it is quite straightforward with AWK. You’d simply specify the variable(s) you know exist from the input (e.g.,

). However, it’s pretty AWKward (sorry) to omit one column of data and to print the rest, particularly if you don’t know exactly how many columns of input are expected on each line. Then you’d need to actually program a loop in AWK. Ugh. Continue reading “Discard first column without AWK”

Finding call-time pass by references in PHP.

While trying to move an older code base to a newer system and thus a newer version of PHP (5.3 -> 5.5), I knew that some of the code would need to be changed to avoid using some removed features. Specifically, I mean call-time pass by references. For those who don’t know, this is kind of a weird feature of earlier versions of PHP that allows one to call a function and pass any of the arguments by reference rather than the usual call by value if the caller prepends an argument variable with the “reference to” operator &.

So, to illustrate, normally this code won’t have side effects because of call by value:

However there will be side effects if the caller chooses pass by reference:

I thought a regex might be in order to find these guys and fix them:

but it was a naive idea, and this regex devolved (heh) to its current form before I realized I could just use the built-in linter to find the problem spots.

HTH