Prolong the life of the SD card in your Raspberry Pi

The web is littered with stories of people who love their Raspberry Pis but are disappointed to learn that the Pi often eats the SD card. I’ve recovered a card once, but otherwise had a few that have been destroyed and were not recoverable. I’ll lay out how I use This One Weird Trick(tm), ahem, to try and prolong the life of the SD card.

First I should point out that my Pi storage layout is not typical. I basically followed this guide to boot from SD card, but run the root filesystem on a flash drive. While the stated purpose of the guide is to help reduce activity on the SD card (and improve storage performance somewhat), I come at the SD card corruption issue from a different perspective.

In my view, the corruption is most likely caused by a timing bug which could be rather low-level in the design or implementation of the hardware itself. Writing to the card less often probably reduces the chances of corruption, but my personal feeling is that after a Pi has been powered on for a certain amount of time, you can’t really predict if the bug is going to manifest. I don’t believe that most instances of SD card corruption happen in the first hours or days of a Pi booting up, so my goal was to only write to it within that initial period of time, if possible.

After following the guide linked above, the SD card is now only hosting the /boot partition. After init has started on / (the external storage), we really don’t need /boot any longer. In the middle of my /etc/rc.local file, I’ve added
mount -o ro,remount /boot

In the typical usage of a running system, /boot doesn’t really need to be mounted read-write. Of course, if you forget it’s mounted read-only, then things like apt-get upgrade or rpi-update may certainly fail. Now when I want to run those commands I first reboot the Pi, and remount the /boot partition with
sudo mount -o remount,rw /boot

Once the updating is done, I reboot again and leave /boot read-only.

kworker using cpu on an otherwise idle system

I have an old thin client that I upgraded to a home server by adding some additional RAM and storage. I noticed after a recent kernel upgrade that the system seemed sluggish at times, despite doing nothing in particular at the time. top showed that a kworker process was using CPU, not all of it, but perhaps 25 to 50% of the total CPU.

I did a lot of searching to try and track down the offender. I used tools such as perf and iotop, read about various tunables under /proc related to power management. Finally, I ran Intel’s powertop command. It showed that “Audio codec alsa…” was hammering on some event loop.

I looked at the loaded kernel modules, and on a whim, I did sudo rmmod snd_hda_intel and that fixed the issue for me.

Others may find that a kworker is running in a tight loop for some other reason. It could be some other misbehaving driver or an I/O problem.

Finding how much time Apache requests take

When a request is logged in Apache’s common or combined format, it doesn’t actually show you how much time each request took to complete. To make reading logs a bit more confusing, each request is logged only once it’s completed. So a long-running request may have an earlier start time but appear later in the log than quicker requests.

To help look at some timing info without going deep enough to need a debugger, I decided that step one was to use a custom log format that saved the total request time. After adding usec:%D to the end of my Apache custom log format, we can now see how long various requests are taking to complete.

tail -q -1000 *access.log | mawk 'FS="(:|GET |POST | HTTP/1.1|\")" {print $NF" "$6}' | sort -nr | head -100 > /tmp/heavy2

I’m using the “%D” format for compatibility with older Apache releases, which reports the response time in microseconds. I would prefer milliseconds, but when I tried using “%{ms}T” on a server running 2.4.7, it didn’t work; too old. This output is a bit hard to read when looking at the numbers, so we can try to add in a little visual aid with commas as the thousands separator.

cat /tmp/heavy2 | xargs -L 1 printf "%'d %s\n" | less

Note that because we are measuring the total request time, some of the numbers may be high due to remote network latency or a slow client. I recommend correlating several samples before blaming some piece of local application code.

Hope this helps finding your long-running requests!