I have our production web servers set up to email me notifications when unhandled exceptions occur, and if a site is publicly accessible crawlers, spiders, and other types of search bots can make this a pain. Most crawlers try to go through pages they have in their index, and pull new content. If you have promo-type pages that are only up for a limited amount of time, they will try to access that URL later on after it is gone, and bam ... another email in my box.
There are CAPTCHA components out there, but they aren't really appropriate for this scenario ... I don't want site users to have to read some squiggly letters and type those out before the site sends me an error email. So I have written some pretty simple code that helps me filter out if the request came from a crawler. My 10 second Goggle on the subject didn't turn up much, so maybe this will help someone else out there. If you figure out a way to improve it, please post some comments or contact me.
The solution is two part. I have a regular expression pattern stored in the web.config, and then a single "IsBot" method that returns true if the current request is from a crawler and false otherwise. I store the pattern in the web.config because it is still evolving, and probably far from "all encompassing" ... but when an error email slips through that is from a bot, I will look at the agent it presents itself as and then add a new keyword to the pattern to detect that crawler in the future. If I was to guess, I would say the pattern posted here probably detects over 90% of hits from crawlers ... as of today.
Current pattern in web.config's AppSetting setting section:
<
add key="botRegex" value="bot|crawler|spider|slurp|ask|teoma" />
C# IsBot method:
/// <summary>
/// Returns true if the current request is from a bot crawling the site, and false otherwise.
/// </summary>
public static bool IsBot
{
get
{
// If this method can't access the current context that means the executing thread doesn't have access
// to the current request's properties ... since we can't pull any agent information we have to assume
// this is not a bot.
if(HttpContext.Current == null)
return false;
string HTTP_USER_AGENT = "";
if (HttpContext.Current.Request.ServerVariables["HTTP_USER_AGENT"] != null)
HTTP_USER_AGENT = HttpContext.Current.Request.ServerVariables["HTTP_USER_AGENT"].ToLower();
// Check to see if the user agent field contains any of the terms in the botRegex set in the web.config
string expression = ConfigurationManager.AppSettings["botRegex"];
Regex botRegex = new Regex(expression);
return botRegex.IsMatch(HTTP_USER_AGENT);
}
}