HTML Logo by World Wide Web Consortium (www.w3.org). Click to learn more about our commitment to accessibility and standards.

Moving forward with Composr

ocPortal has been relaunched as Composr CMS, which is now in beta. ocPortal 9 will be superseded by Composr 10.

Head over to compo.sr for our new site, and to our migration roadmap. Existing ocPortal member accounts have been mirrored.


ocPortal directing visitors to 404.htm

Login / Search

 [ Join | More ]
 Add topic 
Posted
Rating:
#77961 (In Topic #16002)
Avatar

Community saint

Since the beginning of December, I have over 3600 hits with status=200 to the 404.htm page. This just seems impossibly wrong so I started reviewing the logfile. Most (over 2700) of the entries have a referrer of "-" and another 700 have a referrer of http://DOMAIN/404.htm.

I started researching the entries with the referrer "-". Almost 750 of these are from the 209.85.128.0/17 block (209.85.128.1 to 209.85.255.254) which belongs to Google.  Another 60 come from 109.242.171.60 and so on.

I checked my custom blocklist as well as the software updated IP ban list and neither the range or the IP is blocked. As a precaution I have specifically allowed the 209.85.128.0/17 block to see if that makes a difference. My custom blocklist result in a hard 403 error so I am pretty sure this is not the issue.

How can I determine what is triggering these. Many (most) occur on entry but others are triggered after visitors have browsed about for a bit. The one common denominator is that they most alway cause an exit which means I am losing potentially valuable traffic.

Any help would be appreciated.

Bob

Back to the top
 
Posted
Rating:
#77989
Avatar

Bear in mind our default htaccess has…

ErrorDocument 404 /index.php?page=404

I don't know how Apache logs this to be honest. Here is the manual page:
Custom Error Responses - Apache HTTP Server
I note it says:
Both the original URL and the URL being redirected to can be logged in the access log.
I notice it uses the term "redirect" in this sentence. And I don't know when it says "both" if it refers to separate log lines, or them both being on the same line. And I have no idea what the default Apache settings would have it do.

A common issue with the ErrorDocument directive is it applies to any 404, including broken images (which often load 'invisibly', at least in some sense - the user may not see the broken image - and if they do see it, they won't see the HTML page being loaded into the image call). So if something as simple as one broken image is there, it may actually be logging 404.htm calls. It certainly will in ocPortal (because that ErrorDocument rule is just asking it to do a normal page delivery for the 404 page, regardless of what it is for), but as I say I have no idea what Apache would log, I've never investigated.

A separate fact is ocPortal will log 404.htm calls internally for any kind of missing resource error, because it's programmed to treat the "missing resource" error screen as equivalent to the 404 page.


Become a fan of ocPortal on Facebook or add me as a friend. Add me on on Twitter.
Was I helpful?
  • If not, please let us know how we can do better (please try and propose any bigger ideas in such a way that they are fundable and scalable).
  • If so, please let others know about ocPortal whenever you see the opportunity.
  • If my reply is too Vulcan or expressed too much in business-strategy terms, and not particularly personal, I apologise. As a company & project maintainer, time is very limited to me, so usually when I write a reply I try and make it generic advice to all readers. I'm also naturally a joined-up thinker, so I always express my thoughts in combined business and technical terms. I recognise not everyone likes that, don't let my Vulcan-thinking stop you enjoying ocPortal on fun personal projects.
  • If my response can inspire a community tutorial, that's a great way of giving back to the project as a user.
Back to the top
 
Posted
Rating:
#78000
Avatar

Community saint

After talking with the hosting company, it seems the clue is the fact that the referrer is "-" which probably means unknown. Since many of these are coming from Google IPs at a Council Bluffs, Iowa datacenter that also shows no user-agent, it seems that Google is hitting archived links that no longer exist to test for special handling for search engines. This is similar to what Microsoft does (blind UA) to check if sites contain any code that directs a search engine to a special page.

In Google's enthusiasm to ferret out these evil-doers, they seem to have disregarded dropping inactive URLs from their "list" which may not be the index itself. I have been pretty good to remove dead URLs using Google's URL removal tool so I can't imagine how else this is happening.

This is somewhat consistent with the "Crawler access errors" that occur although at a higher than one-to-one frequency. I did find today another set of dead URLs reported for the non-www version of my site in Webmaster tools.

Frankly, it's annoying that I have to deal with URLs long ago dead at the same time that Google refuses to drop "duplicate" URLs that I have identified in their URL parameters area as entries to be ignored. Their response is that they take that input as a suggestion, not a directive. I am hoping the canonicalized links in v8 will resolve this.

I had hope that they had finally started to honor the requests when yesterday, the number of indexed pages dropped from 3900+ down to 1260 last night. Today I'm back up at over 4000 indexed pages when they number should be around 500 (and that a generous estimate).

Anyways, Happy Holidays to everyone. This one gets further attention after Christmas.

Bob
Back to the top
 
1 guests and 0 members have just viewed this: None
Control functions:

Quick reply   Contract

Your name:
Your message: