How to Block Unwanted Bots from Your Website with .htaccess

Deny Access to Certain Bots/Programs


How to Block Unwanted Bots from Your Website with .htaccess

by Christopher Heng, thesitewizard.com

I received requests from a few webmasters some time ago asking me if there was a way to block unwanted bots from their website. This article shows you how you can do this using .htaccess.

Preliminary Information

  1. What are bots?

    "Bots", for those not familiar with the term, are basically computer programs that "surf" multiple websites to perform a variety of automated tasks. It's short for "robots". Examples of bots include those used by the search engines. Those bots retrieve a copy of your web page so that they can include relevant terms from that page in their search index. Not all bots are benign however. Some bots go through your website looking for web forms and email addresses to send you spam. Other bots probe your website for security vulnerabilities.

  2. Who is this article for?

    Before you rush to implement the things suggested by this article, I should probably mention the following prerequisites.

    1. Your website must be hosted on an Apache web server, and your web host must have a facility known as ".htaccess overrides" enabled. If this is not the case, you won't be able to do anything mentioned here without bringing your site down. In practice, this usually means that your website is hosted with a commercial web host since most free web hosts don't allow you to override server behaviour using .htaccess.

    2. You need to be able to check your site's raw web logs. Again, this probably means that you are using a commercial web host rather than a free one. If all you have is the web statistics provided by your web host or a free web statistics and analytics service you won't be able to get the information you need to block the bot.

    3. You need to have a specific bot that you wish to block. If you arrived at this page hoping to find a list of bots to block, you're at the wrong place. This article is a practical guide designed to help webmasters who already know what they want to block.

  3. Blocking unwanted bots is like trying to rid the world of pests.

    Don't think that you can really get rid of all unwanted bots from your website using the method described here. As I mentioned to one user who asked me for help, trying to block all undesirable bots from your site is like trying to rid the world of pests. Swat one, and another few will take its place. This doesn't mean that you can't try, of course. I'm just saying this so that you don't get your hopes too high about what you can actually achieve.

How to Identify the Bot You Want to Block

Before you can block a bot, you will need to know at least one of two things: the IP address where the bot is coming from or the "User Agent string" that the bot is using. The easiest way to find this is to look into your raw web log.

Download your web log from your web host, uncompress it using an archiver, and open it in an ASCII text editor. You'll probably need to use a better editor than Notepad if your logs are large. If you have a search and replace utility like those listed on the Free Text Search and Replace Utilities page, you can use those instead of the editor. Search through the file for the bot you want to block. It helps if you know either the page it tried to access or the time it hit your web, so that you can narrow your search down.

Once you've located the entries that belong to the bot, look for the IP address and the user agent string.

The IP address is a series of 4 numbers separated by dots. They look like "127.0.0.1". The "User Agent string" is just the name that the program accessing your site goes by. For example, version 9.51 of the Opera web browser has a user agent string of "Opera/9.51 (Windows NT 5.1; U; en)" (among others) while the Google search engine bot goes by "Googlebot/2.1 (+http://www.google.com/bot.html)" (among others). You won't need to know the entire user-agent string. Just find some part of the user agent string that is unique to that particular bot, that is, that no other bot or web browser uses.

Note the IP addresses used by the bot and the user agent string.

Be careful though. Just because a bad bot has visited your website using a particular IP address does not mean that if you block that IP address, you'll be rid of that bot forever. Some viruses and malware infect a normal computer user's machine to turn it into a machine that sends spam and probes sites for vulnerabilities. The IP address that you plan to block may well belong to such an ordinary person, and you could be blocking an internet provider's IP address. When that user disconnects from the Internet, and another user logs in, the internet provider could assign the new user the same IP address. When you block by IP address, you may end up blocking an entire internet provider, and thus a lot of real users and potential customers.

Likewise, many bad bots intentionally use User Agent names that correspond to normal web browsers. As such you won't be able to tell from the user agent alone whether it's a bot or a real user. If you wantonly block user agents with the name of "Mozilla", for example, you could end up blocking nearly every human from your website.

In general, if you don't know what you're doing, it's best not to block anything, unless you don't mind inadvertently blocking users and perhaps even whole countries (if you're especially careless).

Download Your .htaccess File

Once you know the bot's IP address or user agent string, connect to your site using an FTP or SFTP client. Go to the top web directory of your site, where your home page is located. Look for a file named ".htaccess". If it exists, download it to your computer.

If it doesn't exist, make sure that it is not hidden from your view. Depending on the FTP program you use, you may need to log off, set a "Remote file mask" of "-a" (without the quotation marks) in the options for the program, and log in again to check.

(The "remote file mask" is the term used in the FTP client that I use. Your program may use a different term.)

Alternatively, log into your site using your web host's control panel. Most commercial web hosts allow you to access your web directories from your web browser and download files that way. If your host has a setting to "show hidden files" or the like, make sure you enable it to look for the .htaccess file.

If, after all your efforts to find it, you cannot locate any .htaccess file in the top web directory of your site, don't worry. It's quite normal not to have any in the default setup for most web hosts. You will simply have to create one yourself. The reason we went to all that trouble to locate it is that if one exists, you will need to get it so that you can add to the settings already present in that file. If you don't, and you create one from scratch to overwrite an existing one, you may inadvertently wipe out some other settings that you want for your site.

Open or Create the .htaccess File

If you've managed to get the .htaccess file, open it in an ASCII text editor (like Notepad). If one does not exist, use the editor to create a new blank document. The rest of this article will assume that you have already started the editor with the .htaccess open or with a blank document if no .htaccess file previously existed.

WARNING: do not use a wordprocessor like Word, Office, or WordPad to create or edit your .htaccess file. If you do, your site will mysteriously fail when you upload the file to your web server.

How to Block by IP Addresses

To block a certain IP address, say, 127.0.0.1, add the following lines to your .htaccess file. If your file already has some content, just move your cursor to the end of the file, and add the following on a new line in the file. If you don't have an existing .htaccess file, just type it into your blank document. You should of course change the numbers "127.0.0.1" to point to the correct IP address you want to block.

Order Deny,Allow
Deny from 127.0.0.1

The first line has the effect that if the web server encounters a request that matches any Deny rule, it will deny the request. If the request that does not match any Deny rule, it will be allowed. This is generally the behaviour that most people want for the normal web directories on their site.

The second line sets the rule that if a request comes from the IP address "127.0.0.1", the web server is to deny the request. The program making that request will receive the "Forbidden" error instead of the normal page at that address.

If you have more than one IP addresses to block, just add another "Deny from" line with that IP address underneath. For example, if you also want to block "192.168.1.1" in addition to "127.0.0.1", the code to use is as follows:

Order Deny,Allow
Deny from 127.0.0.1
Deny from 192.168.1.1

You may add as many IP addresses as you wish, although if your .htaccess file becomes very large, your site may become sluggish due to the number of rules the server has to process each time it has to deliver your site's pages.

How to Block by User Agent String

To block a bot by a user agent string, look for a part of the user agent string that is unique to that robot and that contains ordinary letters of the alphabet with no spaces, slashes or punctuation marks (unless you are familiar with regular expressions).

For example, if you are planning to block a robot that has this user agent string, "SpammerRobot/5.1 (+http://www.example.com/bot.html)", and you decide that the portion "SpammerRobot" is unique to this robot, add the following lines to your .htaccess file.

As in the case of blocking by IP address, add the lines to the end of the file if you already have an existing .htaccess file. Otherwise, type the lines into your blank document. You should of course change "SpammerRobot" to the actual user agent you want to block.

BrowserMatchNoCase SpammerRobot bad_bot
Order Deny,Allow
Deny from env=bad_bot

The first line tells the web server to check the user agent string of the program making the request. If the user agent string contains the word "SpammerRobot", it will set an "environment variable" (a sort of internal flag used by the server) called bad_bot. Note that the word "SpammerRobot" can be in any mixture of capital (uppercase) or small (lowercase) letters. If you only want to match the exact case, use BrowserMatch instead of BrowserMatchNoCase. In addition, I simply made up the name "bad_bot" for the purpose of this article. You can call your environment variable some other name if you wish, although if you're not familiar with the rules for naming such variables, just accept "bad_bot".

The second line has already been explained above, in the section on blocking by IP address.

The third line tells the server to deny the request if finds that an environment variable called "bad_bot" has been set.

To add more user agent strings to your block list, just add another "BrowserMatchNocase" line. For example, if you want to block "SecurityHoleRobot" in addition to "SpammerRobot", the lines to use are:

BrowserMatchNoCase SpammerRobot bad_bot
BrowserMatchNoCase SecurityHoleRobot bad_bot
Order Deny,Allow
Deny from env=bad_bot

(Before you ask, "SpammerRobot" and "SecurityHoleRobot" are just names I invented for this article. As far as I know, these robots don't exist.)

Note that your .htaccess file can contain block rules for both user agents and IP address. Just put them all in the same file. An example of a .htaccess file with rules to block both by IP address and user agent strings is as follows:

BrowserMatchNoCase SpammerRobot bad_bot
BrowserMatchNoCase SecurityHoleRobot bad_bot
Order Deny,Allow
Deny from env=bad_bot
Deny from 127.0.0.1
Deny from 192.168.1.1

There's no need to repeat the "Order Deny,Allow" line when you combine the rules.

Uploading the .htaccess File

Once you've finished with blocking unwanted bots in your .htaccess file, save the file. If you are using Notepad, and are creating a new document, remember to save the file as ".htaccess", including the quotation marks, otherwise you will encounter the problem of Notepad adding a .txt extension to your filename.

Then upload the file to your web server using an FTP/SFTP program (or with your web host's control panel). If you want to use an FTP program, and don't know how to do so, check out my tutorial on How to Upload a File to Your Website Using the FileZilla FTP Client.

Conclusion

Properly implemented, the method described in this article will allow you to block specific bots from accessing your website by either their IP address or their User Agent string.

This article can be found at http://www.thesitewizard.com/apache/block-bots-with-htaccess.shtml

Copyright © 2008-2012 by Christopher Heng. All rights reserved.
Get more free tips and articles like this, on web design, promotion, revenue and scripting, from http://www.thesitewizard.com/.

thesitewizard™ News Feed (RSS Site Feed)  Subscribe to thesitewizard.com newsfeed

Do you find this article useful? You can learn of new articles and scripts that are published on thesitewizard.com by subscribing to the RSS feed. Simply point your RSS feed reader or a browser that supports RSS feeds at http://www.thesitewizard.com/thesitewizard.xml. You can read more about how to subscribe to RSS site feeds from my RSS FAQ.

Please Do Not Reprint This Article

This article is copyrighted. Please do not reproduce this article in whole or part, in any form, without obtaining my written permission.

Related Articles

New Articles

Popular Articles

How to Link to This Page

It will appear on your page as:

How to Block Unwanted Bots from Your Website with .htaccess





Home
Donate
Contact Us
Link to Us
Topics
Site Map

Getting Started
Web Design
Search Engines
Revenue Making
Domains
Web Hosting
Blogging
JavaScripts
PHP
Perl / CGI
HTML
CSS
.htaccess / Apache
Newsletters
General
Seasonal
Reviews
FAQs
Wizards

 

 
Free webmasters and programmers resources, scripts and tutorials
 
HowtoHaven.com: Free How-To Guides
 
Site Design Tips at thesitewizard.com
Find this site useful?
Please link to us.