HTML Logo by World Wide Web Consortium (www.w3.org). Click to learn more about our commitment to accessibility and standards.

Moving forward with Composr

ocPortal has been relaunched as Composr CMS, which is now in beta. ocPortal 9 will be superseded by Composr 10.

Head over to compo.sr for our new site, and to our migration roadmap. Existing ocPortal member accounts have been mirrored.


robots.txt

Login / Search

 [ Join | More ]
 Add topic 
Posted
Rating:
#71137 (In Topic #15008)
Avatar

Community saint

Looked and couldn't find anything in the forums.

I'm curious what others' robots.txt file looks like. What directories are you keeping the search engines out of besides adminzone?

Anyone care to post theirs as an example?

Bob
Back to the top
 
Posted
Rating:
#71139
Avatar

Community saint

robots.txt won't be what's keeping the search engines out of the admin zone. Its your normal site privileges, so the robots will see only what guests can see.
 

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71142
Avatar

Community saint

That's true. But are there any areas with access that search engines should be dissuaded from (e.g., search)?

I'm always conflicted about robots.txt as it basically provides a partial roadmap to your site.

Bob
Back to the top
 
Posted
Rating:
#71144
Avatar

Community saint

Robots shouldn't be filling in forms (which is what the search field/page is), so I don't think it will be an issue.

Search engines will make their own roadmaps even without a robots.text because they will follow every link on your site, so that is generally a moot point.

I'm sure there are some beneficial things that could be put in robots.txt, like media related restrictions, but I haven't bothered looking into it (yet!).

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71147
Avatar

Community saint

It is a little more than hypothetical for me. I have my site password-protected while I'm in development. But my main contributor was having issues getting into the site while he was visiting with family promoting the site so I had to remove the password. Unfortunately, Mr. Googlebot decided to pay a visit during these few days and as a result I had over 290 pages indexed, much to my surprise. I just glanced over them quickly but I noticed that many of them started with "Search". I can't review them now because I had Google remove the URLs from its index.

I'd love to think that I have 290+ unique pages at this point but the truth is that there are probably between 50 and 70 while I am awaiting material.

I wish I had kept a list of the URLs so I could demonstrate what I saw. On the upside, it was pretty remarkable to me that Google indexed that many pages in so short a time. It is a tribute to ocPortal's design (plus good copy and keywords) that it indexes so well.

Bob

EDIT: It just dawned on me that those entries were probably indexed due to the tag cloud. I need to look at this more carefully as I don't want duplicate content issues.


Last edit: by BobS
Back to the top
 
Posted
Rating:
#71150
Avatar

Community saint

BobS said

EDIT: It just dawned on me that those entries were probably indexed due to the tag cloud. I need to look at this more carefully as I don't want duplicate content issues.

Hmmm, this is confusing.

Looking at /site/pages/modules/search.php, it appears that a <meta name="robots" content="noindex" /> is added to the header when the output is being built for the search results page.

I really wish I had kept those site index results so I could track this down but I definitely remember seeing "Search" as part of the index entry.

I wonder how upset Google will be with me if I reinclude the items long enough to get a list and then turn around and ask for a URL removal again.

Bob
Back to the top
 
Posted
Rating:
#71151
Avatar

Community saint

I'm sure it was only/mainly the tags. Google shows my site as having 627 pages. I wish!

I'll try adding the following to my robot.txt to see if it kills indexing of tags.

Code

User-agent:*
Disallow:/cms/site/pg/search/results/

Note that as my site is installed at mydomain.net/cms, I have an extra /cms in the disallow path.

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71152
Avatar

Community saint

Let us know how it works out.

It's surprising that this was never mentioned before. I'd expect that Google would be having a conniption fit over duplicate content. Maybe it's nothing but we both had way more pages indexed than what we had. Something's not right.

Bob
Back to the top
 
Posted
Rating:
#71153
Avatar

Community saint

I've also submitted "/cms/site/pg/search/results/" for removal, so if they get removed it should mean that robots.txt will work also.

The problem is that google does not, and can not, see the search results pages as duplicate content because the search term in the url will also be in the results page as "Your search for "bla" gave n results:". So it will see it as unique.

Another one that I have noticed is "/cms/data/iframe.php" as in:

Code

"/cms/data/iframe.php?zone=site&wide_high=1&utheme=Green&page=search&type=results&content=bla&only_search_meta=1&all_defaults=1&search_news=1"
which, amongst other things, returns gallery search results. I'm reluctant to exclude it just at the moment because I'm not sure what it will do the members page. Should be fine, but need to experiment with it first.

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71154
Avatar

Community saint

As usual, you are 20 steps ahead of me.

I do appreciate you looking into the issue.

Bob
Back to the top
 
Posted
Rating:
#71155
Avatar

Community saint

I hadn't really though about search too much other then getting the site indexed in the first place. I've got time at the moment to play around with it, so its as good a time as any.

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71156
Avatar

Community saint

This is what I've come up with so far:

Code

User-agent:*
Disallow:/cms/site/pg/search/results/
Disallow:/cms/pg/feedback/
Disallow:/cms/data/snippet.php
Disallow:/cms/data/iframe.php
Disallow:/cms/site/catalogue_file.php
Disallow:/cms/*slideshow=1

Now its just a waiting game to see what google does with it.

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71176
Avatar

Community saint

Looks like a good start. I will be interested to hear your results on Google.

I'm several weeks away from launch and my site is locked up tight once again so I won't be able to check from my end.

Has anyone else crafted a robots.txt file they'd like to share?

Bob
Back to the top
 
Posted
Rating:
#71213
Avatar

Community saint

At the moment, I've got the results down from 627 to 372, with my best being 117.

I'm still tweaking as I learn what works and what doesn't.

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71215
Avatar

Community saint

Great, temp. Thanks for the info.

I hope Chris pops into this thread. Some of these restrictions should probably be included in the default recommended.htaccess file. But it is likely very dependent on what add-ons are installed and what blocks are used.

Bob
Back to the top
 
Posted
Rating:
#71217
Avatar

Community saint

These are indexing restrictions that should be in robots.txt and not in .htaccess. You wouldn't want to actually block any of these, just stop them from being indexed, and the best place for that is robots.txt.

All of the heavy hitters are from the base install, so the impact of most add-ons should be minimal.

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71228
Avatar

As I was asked to comment ;)

You pretty much answered it yourself though Bob, we are putting out signals from inside ocPortal so that we don't have to worry about robots.txt. Now, of course you may have good reasons for having one, we don't say not to use it, it's just better if we can code sensible defaults into the product so it's not needed to achieve the basic reasonable behaviours.

Internal Search Engine Results should not come up on Google, if they are please show me an example and I can investigate as a bug.

There are a lot of rel="nofollow" in ocPortal templates, esp in the calendar (as that has infinite screens). I just added one to the slide-show button based on reading one of temp1024's posts above.


Become a fan of ocPortal on Facebook or add me as a friend. Add me on on Twitter.
Was I helpful?
  • If not, please let us know how we can do better (please try and propose any bigger ideas in such a way that they are fundable and scalable).
  • If so, please let others know about ocPortal whenever you see the opportunity.
  • If my reply is too Vulcan or expressed too much in business-strategy terms, and not particularly personal, I apologise. As a company & project maintainer, time is very limited to me, so usually when I write a reply I try and make it generic advice to all readers. I'm also naturally a joined-up thinker, so I always express my thoughts in combined business and technical terms. I recognise not everyone likes that, don't let my Vulcan-thinking stop you enjoying ocPortal on fun personal projects.
  • If my response can inspire a community tutorial, that's a great way of giving back to the project as a user.
Back to the top
 
Posted
Rating:
#71232
Avatar

Community saint

Thanks for chiming in Chris.

Here are some search bugs for your consideration (I'll discuss them in terms of the disallow list from one of my previous post above):

1) Disallow:/cms/site/pg/search/results/

These looks like they are primarily from tags (and I assume would also apply to tag clouds which I don't use).

I can't think of a single reason why any search link would need to be indexed as all the content should be available via regular navigation links.

2) Disallow:/cms/pg/feedback/

Every feedback/contact us link is being picked up as a unique page rather then as just one feedback page for the entire site.

3) Disallow:/cms/data/snippet.php

While on my site there is currently only one item returned in the search result (a link to "lastgamer.net/cms/data/snippet.php?snippet=captcha_wrong&na
me=&ei=xesJTvL…" that goes to a blank page) I've included it because "snippet.php" looks to me like it could popup again in other situations.

4) Disallow:/cms/data/iframe.php

These are links to search results fragments (which don't even completely honour the theme). There should never be links to these.

5) Disallow:/cms/site/catalogue_file.php

I haven't actually seen these in search results, it was included as a preventive measure as I saw it as a potential for the robots to either provide direct links to download content, or the robots themselves might try and download the catalogue files. I'm assuming that there will be something similar for galleries as well.

6) Disallow:/cms/*slideshow=1

Looks like you are on top of this one :) .

To see example of the above, you can try googling  "site:lastgamer.net" and page through the results. That is assuming that google has not removed it from the index by the time you get a chance to look at it.

Do you have a Samsung Galaxy S / Galaxy S II ? If so, why not check out my ScreenFree FM Radio .
Back to the top
 
Posted
Rating:
#71249
Avatar

Community saint

Chris Graham said

As I was asked to comment ;)

I am always glad when you comment even whenI disagree. You, at least, explain your position which is much better than many other software products.

Too bad we can't clone you because I realize that being on the forums takes away from the time available for actually working on ocPortal.

Bob
Back to the top
 
Posted
Rating:
#71250
Avatar

Community saint

temp1024 said

Thanks for chiming in Chris.

Here are some search bugs for your consideration (I'll discuss them in terms of the disallow list from one of my previous post above):

As usual, thanks for taking the time to research this so thoroughly. Your efforts are most definitely helping ocPortal to become an even better product.

Bob
Back to the top
 
1 guests and 0 members have just viewed this: None
Control functions:

Quick reply   Expand