PROLOGUE
This is where the “meandering” part of this blog comes into play…I tend to veer off on tangents for some posts. The handy-man and I.T. portions of me have always admired creative and simple solutions for everyday problems so I thought I would share something technical that in my opinion falls into that arena. The following image has nothing to do with the content of this post, but somehow I do feel it’s complementary to the general theme.
WEB ROBOTS BACKGROUND
First, let’s learn what a “robots.txt” file is:
“Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.” – moz.com
Some who have web-based blogs may not realize their website is being searched or crawled daily by web robots. The Google search bot is one of the most well know web robot, but all Internet search sites or web indexes will each have their robot. Then there are those black-op sites who have robots searching for/harvesting email address, hidden pages, links and other data they deem of value within site. Email address harvested can quickly turn up in colossal spam email lists or used as targets for break-in attempts.
Web-based blog owners may also discover that their website is suddenly maxing our or using more of their CPU or bandwidth allowances while their visitor web traffic hasn’t increased. This consumption of resources may be due to web robots. I’ve experienced this high resource usage with Meandering Passage and only after research discovered it was due to misbehaving web robots.
A PROBLEM AND COMMON COUNTERMEASURES
Good web robots can be easily controlled using a “robots.txt” file in the root directory of a website — it’s how it’s supposed to work. You can find multiple instructions on how to create/use a robot.txt file on the web. My robots.txt file is set to allow the major search engines access (at reasonable crawl rates) to the posts area of my blog while denying access to admin areas. I’ve designated all other robots to be completely denied access to all areas. Interestingly most bad (doesn’t obey instructions) web robots will read the robots.txt file looking for denied or protected links/folders/pages/files and then search them along with everything else.
As a solution for this unauthorized robot search of your website, you can install one of the significant WordPress Security Plugins which along with many other features will stop many of these bad bots. However, I’ve found many of these packages to have a good deal of overhead (CPU and resources) and in most cases will nag you about subscribing to their premium service. They also may require you to make a judgment if some web robot is bad or not. While many of these security plugins are very capable, for my needs they seemed to be overkill as I preferred more straightforward focused solutions.
I’ve run the WordPress plugin “Bad Behavior” for years and it has been without my intervention effectively blocking many of the bad bots and other troublesome visitors. It’s lightweight, doesn’t use a lot of resources and most importantly, I’ve never known it to block something I didn’t want. So, I’m keeping it activated. But, when checking my website logs I noticed there were a few questionable bots getting by “Bad Behavior.”
ONE SIMPLE SOLUTION
By chance, I came across a simple plugin solution for bad bots named Blackhole by Jeff Starr. This plugin uses a bait-and-catch method which I think is ingenious. Setting the plugin up you define a fictitious folder/path in the Blackhole setting, “?blackhole” is the default value, and then you add that same folder/path as a deny line for all robots in your robots.txt file…this is the bait. The plugin provides easy instruction on how to do this. Once your robots.txt file is modified, activate the Blackhole plugin.
When active the Blackhole Plugin quietly and efficiently monitors all traffic to your website. Good web robots will follow instructions and not try and visit “?blackhole” and the only way any visitor can know of a nonexistent “?blackhole” path is to read the robots.txt file. Remember almost all (good or bad) robots will read the robot.txt file. So, if a robot is bad (i.e., doesn’t follow instructions) and they try to visit “?blackhole”…BAMM, Blackhole Plugin adds their IP address to its blacklist, blocks the robot from any access, and tries to identify the robot and its location. They are then permanently barred unless you add them to the whitelist or remove them from the blacklist. A simple and foolproof solution.
I’ve been running the Blackhole Plugin for four months with no problems. At this point, it’s “captured” 10 bad robots which slipped by all other countermeasures, and I know without guessing they are…BAD BOTS!
🎶 “Bad Bots, Bad Bots, What You Goin To Do? What You Goin To Do When They Come For You?” 🚓 :-)
Note: I didn’t include any links to mentioned plugins as I thought they would be easy enough to search for via WordPress.
Interesting, Earl. Although I admit that I wouldn’t know how to know that any of those things were even happening. I sometimes worry about how much I don’t know!
Probably not interesting to many and my I.T. background definitely gives me an up in this area. Perhaps it’s not about what we know at any one moment but just being open to learning when we need to.
That is ingenious! I’ll have to install that. Bad, Bot! Bad, Bot!!! No! :)
Yes, I thought it brilliantly slick way to test and capture robots that don’t play by the rules.
I never knew that and still not sure I understand all of it. I’m going to check into it more.
Monte, when it all just works it’s sometimes hard to be aware of or even imagine the many small fragments, protocols, and standards which make it all “click.” I’ve picked up a lot of it because of my career and because I find it fascinating, but some of it has come to light because of managing this blog under four different hosting providers and the related issues involved and solved. :-) I don’t know that this post info will be of benefit to everyone, or anyone for that matter, but I thought I’d share it so it might save someone a little time. Take care!