Until recently I was under the impression that search engine spiders conformed to “sane” standards. If a page returns a HTTP status code of 404 you would have thought that the page / link would be removed from the index.
Seemingly this is not the case…
I’ve now been advised that I should try to use either a 410 status code
There are some interesting differences between the two HTTP responses:
10.4.11 410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server’s site. It is not necessary to mark all permanently unavailable resources as “gone” or to keep the mark for any length of time — that is left to the discretion of the server owner.
While the good old 404 response code is defined as:
10.4.5 404 Not Found
The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.
That’s a bit of a head wrecker.
I suppose if you know exactly which pages you want to remove, then a 410 response code is easy enough to generate, but in my case it isn’t. I’d either have to change the 404 page to a 410 for the entire site, or simply wait it out.
adam says
Generate a list of pages you want to return 410 for, push not found through mod_rewrite to PHP, consult the list from there and return 410 if there’s a match.
michele says
Sounds complicated 🙁
Rob says
Depending on the popularity of the page in question you may be waiting a VERY long time. One approach would be to use mod_perl or mod_python to catch the offending requests in apache’s dispatch phase and return 410 then; probably a little simpler than mucking around with mod_rewrite.
In general, Google does stick to the letter of the standards; this has caused trouble before, with the Google Web Accelerator thing getting confused about sessions, for instance.
adam says
Not really. The only problem is that you can’t use Apache’s ErrorDocument, as the status is already set by the time it gets there; so you need to push /everything/ through mod_rewrite. That isn’t as bad as it sounds though, the overhead isn’t really that heavy unless the site is /very/ large and busy.
So you’d use a mod_rewrite setup quite similar to WordPress, like this:
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /410handler.php [L]
Then you’d put the list of pages you want to return 410 status into a text file, one per line, and create 410handler.php like this:
That’s it, unless the missing pages are in WordPress; in which case you’d drop the mod_rewrite stuff and do the same thing as an extension.
adam says
WordPress killed the PHP, let’s try it without tags.
$pages = file(‘410pages.txt’);
foreach($pages as $page) { // can’t use in_array() because of the line break
$page = trim($page);
if ($page == $_SERVER[‘REQUEST_URI’]) {
header(“HTTP/1.0 410 Gone”);
}
}
Anthony says
What about IIS? We’ve recently overhauled dublin.ie and all links from google are now returning 404. I’d rather return 410
adam says
Migrate to LAMP. 🙂
Rob says
For Dublin.ie, it might make more sense to have sensible permanent redirects to (approximately) the right places. You can do that in IIS, right?
Alternatively, it’s always better to leave things where they are if possible. I’m currently replacing a PHP site with one written in Common Lisp; it may end up with .php extensions for this reason 🙂
Hugh says
You could just use Google’s url removal tool
adam says
Here you go Michele:
Which Google Webmaster Tools Do You Want?
Google’s Matt Cutts told me he’ll be reading along for inspiration, so who knows, your wish might trigger something…