Advertisement:

robot.txt file importance

Aloittaja princess38, elokuu 29, 2007, 08:57:42 IP

« edellinen - seuraava »

princess38

ok, I have read for awhile and found a few things that helped me like I didnt know to change the meta tags to suit my theme. Also I just stumbled on this I think it has something to do with how the spiders crawl your site. or if they crawl at all. This is a validator


http://www.designerwiz.com/test/robots.txt_validator.htm


from my understanding the robot.txt file has to be in root directory in order to be seen. This tool will check to see if you have it there. Hope this helps because I am lost too. seems I dont have a robot.txt in my root. so now I am wandering where to get it at? Any advice on how to handle this will spead things up for me, thanks.

Niteblade

Lainaus käyttäjältä: princess38 - elokuu 29, 2007, 08:57:42 IP
ok, I have read for awhile and found a few things that helped me like I didnt know to change the meta tags to suit my theme. Also I just stumbled on this I think it has something to do with how the spiders crawl your site. or if they crawl at all. This is a validator


http://www.designerwiz.com/test/robots.txt_validator.htm


from my understanding the robot.txt file has to be in root directory in order to be seen. This tool will check to see if you have it there. Hope this helps because I am lost too. seems I dont have a robot.txt in my root. so now I am wandering where to get it at? Any advice on how to handle this will spead things up for me, thanks.

That link is tossing up an error.

Anyhow, robots.txt is important in order to prevent the search engines from publishing/penalizing your website for duplicate content. Here's mine, btw. It's still a work in progress.


User-agent: Fasterfox
Disallow: /

User-agent: *
Crawl-Delay: 7

User-agent: *
Allow: /texas-forum.php
Disallow: arcade/
Disallow: /arcade/
Disallow: /attachments/
Disallow: /attachments
Disallow: /avatars/
Disallow: /avatars
Disallow: /blog/
Disallow: /blog
Disallow: /cgi-bin/
Disallow: /cgi-bin
Disallow: /chat/
Disallow: /chat
Disallow: /FCKeditor/
Disallow: /FCKeditor
Allow: *archive.php*
Allow: *index.php?action=forum*
Allow: *index.php?action=tags*
Disallow: /gallery/
Disallow: /gallery
Disallow: /images/
Disallow: /images
Disallow: /Packages/
Disallow: /Packages
Disallow: /permian-mall/
Disallow: /permian-mall
Disallow: /Smileys/
Disallow: /Smileys
Disallow: /Sources/
Disallow: /Sources
Allow: /shopping/
Allow: /shopping
Allow: /shopping/texas-mall.php
Disallow: /Themes/
Disallow: /Themes
Disallow: /tp-downloads/
Disallow: /tp-downloads
Disallow: /tp-images/
Disallow: /tp-images
Disallow: /trap/
Disallow: /trap
Disallow: /wysiwyg/
Disallow: /wysiwyg
Disallow: *apc.php*
Disallow: *ssi_examples.php*
Disallow: *ssi_examples.shtml*
Disallow: *status.php*
Disallow: *Settings.php*
Disallow: *Settings_bak.php*
Disallow: *action=admin*
Disallow: *action=activate*
Disallow: *action=arcade*
Disallow: *action=calendar*
Disallow: *action=collapse*
Disallow: *action=deletemsg*
Disallow: *action=editpoll*
Disallow: *action=gallery*
Disallow: *action=help*
Disallow: *action=helpadmin*
Disallow: *action=lock*
Disallow: *action=login*
Disallow: *action=logout*
Disallow: *action=markasread*
Disallow: *action=mergetopics*
Allow: *action=mall*
Disallow: *action=mlist*
Disallow: *action=modifykarma*
Disallow: *action=movetopic*
Disallow: *action=notify*
Disallow: *action=notifyboard*
Disallow: *action=pm*
Disallow: *action=post*
Disallow: *action=printpage*
Disallow: *action=profile*
Disallow: *action=register*
Disallow: *action=removetopic2*
Disallow: *action=reporttm*
Disallow: *action=search*
Disallow: *action=sendtopic*
Allow: *action=sitemap*
Disallow: *action=splittopics*
Disallow: *action=stats*
Disallow: *action=sticky*
Disallow: *action=tpadmin*
Disallow: *action=tpmod*
Disallow: *action=trackip*
Disallow: *action=unread*
Disallow: *action=unreadreplies*
Disallow: *action=who*
Allow: *index.php/topic*
Allow: *index.php/board*


And yes, put that in your root of your domain -- for example, my site is @ http://www.midessa.net,  and the file goes @ http://www.midessa.net/robots.txt.
affiliate blog

princess38


Just wondering why most people don't use this? This is suppose to cover all the action commands.



User-agent: *
Disallow: /index.php?action=
located in root /


Anyway, I added this and changed my files inside the index template to this.
<head>

<meta http-equiv="Content-Type" content="text/html; charset=', $context['character_set'], '" />
   
   <meta name="keywords" content="PHP, MySQL, bulletin, board, community, open, source, smf, simple, machines, dating,webmasters,cams,payper view,chat,friend finder,ebony, Mature" />
   <meta name="Description" CONTENT="Forum on sexy black, ebony, men and BBW, women to chat or post about adult, mature, topics, or watch pay per view, pay sites, movies or videos, and cam 2 cam with women with big booties.">
   <meta name="Author" CONTENT="mysite.com">
<META name="ROBOTS" content="NO INDEX"/>

   <script language="JavaScript" type="text/javascript" src="', $settings['default_theme_url'], '/script.js?rc2p"></script>

3 questions

1. Did I do it correctly

2. The robots.txt file are there any security settings I should add?


3. Do I now need to submit a sitemap to google?

RustyBarnacle

User-agent: *
Disallow: /index.php?
located in root /

Would that work to disallow the whole forum?

tumbleweed

Using this code will disallow any crawler from any directory in your webroot
User-agent: *
Disallow: /

Using this code you can disallow any crawler from a specified directory and or specific page,And still allows them to crawl other portions of your site.

User-agent: *
Disallow: /cgi-bin/
Disallow: /your directory name/
Disallow: /your directory name/blank.htm
G.C. SOLUTIONS - Hosting Quality Sites Since 2006. Experience Your Forums On A Whole New Level
Elastic Sites Stress Fast CPU/Ram Upgrades- More Info Here.
Reviews By SMF Forum Owners - Read Our Rev

Heterman

Is there a good reference guide on this pertaining to any default ROBOTS.txt installed just by installing the board?  I seem to recall something along those lines, or was that just for keeping them out of any "closed doors"?

Dannii

LainaaJust wondering why most people don't use this? This is suppose to cover all the action commands.
Because we don't want to block all actions, only the ones that wouldn't be useful for search engines to index.
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

metallica48423

LainaaIs there a good reference guide on this pertaining to any default ROBOTS.txt installed just by installing the board?  I seem to recall something along those lines, or was that just for keeping them out of any "closed doors"?

SMF doesn't come with a robots.txt.  It makes use of a noindex tag only on pages containing what would be duplicate content
Justin O'Leary
Ex-Project Manager
Ex-Lead Support Specialist

LainaaMicrosoft wants us to "Imagine life without walls"...
I say, "If there are no walls, who needs Windows?"


Useful Links:
Online Manual!
How to Help us Help you
Search
Settings Repair Tool

karlbenson

#8
Indeed.
IMO its better to use meta robots noindex rather than robots.txt for most things
especially to prevent spidering of
- duplicated material
- sensitive/security areas.

Why (duplicate content)?
Because with the meta robots noindex tag, search engines will still follow ALL links on the page and then spider them.
If you used the robots.txt it won't spider it at all.
So spider visits user 189's profile. You have chosen to block profiles and the results are.
With meta robots, not indexed, 8 new links obtained
With robots.txt, not indexed, no followed, 0 links obtained.

IMO generally it is NOT advisable for security reasons to try to block spidering of 'sensitive' areas with a robots.txt eg admin areas.
Why (for sensitive areas)?
Because 'hackers' can and do write scripts to 'scan' millions of sites robots.txt looking for sensitive areas like 'admin.php', and what you've done is lead them straight to your admin area.  Should they know of or wish to test for vulnerability, they can then do so.
Always use meta <meta name="robots" content="noindex,nofollow" />
The same thing which makes robots.txt useful to search engines, makes they useful to hackers.

The only things you really should disallow with robots.txt are
- folders that have no visual functionality other than as folders of files eg images/ sources/ etc
- old folders/areas since site reresign eg /oldsite/ (to prevent lots of 304 errors in your server error log)
- any bots you want to control/disallow

Even if you don't have anything to put in a robots.txt you should still have a blank one.
Why? Because your will get errors in your logs as search engines constantly try to get one.  It will also eat unnecessary bandwidth serving them your error pages or landing pages as opposed to 0kb file.

Robot.txt is easy, but it isnt necessarily the best thing for security or SEO purposes.

So for smf it would be better to find where <meta name="robots" content="noindex" /> is added, and make it say <meta name="robots" content="noindex,follow" />
And also edit WHEN it is shown to show for more areas you don't want indexed, but still want the seo benefit.

Advertisement: