Simple Machines Community Forum

Archived Boards and Threads... => Archived Boards => SMF Feedback and Discussion => Aiheen aloitti: princess38 - elokuu 29, 2007, 08:57:42 IP

Otsikko: robot.txt file importance
Kirjoitti: princess38 - elokuu 29, 2007, 08:57:42 IP
ok, I have read for awhile and found a few things that helped me like I didnt know to change the meta tags to suit my theme. Also I just stumbled on this I think it has something to do with how the spiders crawl your site. or if they crawl at all. This is a validator


http://www.designerwiz.com/test/robots.txt_validator.htm


from my understanding the robot.txt file has to be in root directory in order to be seen. This tool will check to see if you have it there. Hope this helps because I am lost too. seems I dont have a robot.txt in my root. so now I am wandering where to get it at? Any advice on how to handle this will spead things up for me, thanks.
Otsikko: Re: robot.txt file importance
Kirjoitti: Niteblade - elokuu 30, 2007, 08:49:39 AP
Lainaus käyttäjältä: princess38 - elokuu 29, 2007, 08:57:42 IP
ok, I have read for awhile and found a few things that helped me like I didnt know to change the meta tags to suit my theme. Also I just stumbled on this I think it has something to do with how the spiders crawl your site. or if they crawl at all. This is a validator


http://www.designerwiz.com/test/robots.txt_validator.htm


from my understanding the robot.txt file has to be in root directory in order to be seen. This tool will check to see if you have it there. Hope this helps because I am lost too. seems I dont have a robot.txt in my root. so now I am wandering where to get it at? Any advice on how to handle this will spead things up for me, thanks.

That link is tossing up an error.

Anyhow, robots.txt is important in order to prevent the search engines from publishing/penalizing your website for duplicate content. Here's mine, btw. It's still a work in progress.


User-agent: Fasterfox
Disallow: /

User-agent: *
Crawl-Delay: 7

User-agent: *
Allow: /texas-forum.php
Disallow: arcade/
Disallow: /arcade/
Disallow: /attachments/
Disallow: /attachments
Disallow: /avatars/
Disallow: /avatars
Disallow: /blog/
Disallow: /blog
Disallow: /cgi-bin/
Disallow: /cgi-bin
Disallow: /chat/
Disallow: /chat
Disallow: /FCKeditor/
Disallow: /FCKeditor
Allow: *archive.php*
Allow: *index.php?action=forum*
Allow: *index.php?action=tags*
Disallow: /gallery/
Disallow: /gallery
Disallow: /images/
Disallow: /images
Disallow: /Packages/
Disallow: /Packages
Disallow: /permian-mall/
Disallow: /permian-mall
Disallow: /Smileys/
Disallow: /Smileys
Disallow: /Sources/
Disallow: /Sources
Allow: /shopping/
Allow: /shopping
Allow: /shopping/texas-mall.php
Disallow: /Themes/
Disallow: /Themes
Disallow: /tp-downloads/
Disallow: /tp-downloads
Disallow: /tp-images/
Disallow: /tp-images
Disallow: /trap/
Disallow: /trap
Disallow: /wysiwyg/
Disallow: /wysiwyg
Disallow: *apc.php*
Disallow: *ssi_examples.php*
Disallow: *ssi_examples.shtml*
Disallow: *status.php*
Disallow: *Settings.php*
Disallow: *Settings_bak.php*
Disallow: *action=admin*
Disallow: *action=activate*
Disallow: *action=arcade*
Disallow: *action=calendar*
Disallow: *action=collapse*
Disallow: *action=deletemsg*
Disallow: *action=editpoll*
Disallow: *action=gallery*
Disallow: *action=help*
Disallow: *action=helpadmin*
Disallow: *action=lock*
Disallow: *action=login*
Disallow: *action=logout*
Disallow: *action=markasread*
Disallow: *action=mergetopics*
Allow: *action=mall*
Disallow: *action=mlist*
Disallow: *action=modifykarma*
Disallow: *action=movetopic*
Disallow: *action=notify*
Disallow: *action=notifyboard*
Disallow: *action=pm*
Disallow: *action=post*
Disallow: *action=printpage*
Disallow: *action=profile*
Disallow: *action=register*
Disallow: *action=removetopic2*
Disallow: *action=reporttm*
Disallow: *action=search*
Disallow: *action=sendtopic*
Allow: *action=sitemap*
Disallow: *action=splittopics*
Disallow: *action=stats*
Disallow: *action=sticky*
Disallow: *action=tpadmin*
Disallow: *action=tpmod*
Disallow: *action=trackip*
Disallow: *action=unread*
Disallow: *action=unreadreplies*
Disallow: *action=who*
Allow: *index.php/topic*
Allow: *index.php/board*


And yes, put that in your root of your domain -- for example, my site is @ http://www.midessa.net,  and the file goes @ http://www.midessa.net/robots.txt.
Otsikko: Re: robot.txt file importance
Kirjoitti: princess38 - elokuu 30, 2007, 04:02:58 IP

Just wondering why most people don't use this? This is suppose to cover all the action commands.



User-agent: *
Disallow: /index.php?action=
located in root /


Anyway, I added this and changed my files inside the index template to this.
<head>

<meta http-equiv="Content-Type" content="text/html; charset=', $context['character_set'], '" />
   
   <meta name="keywords" content="PHP, MySQL, bulletin, board, community, open, source, smf, simple, machines, dating,webmasters,cams,payper view,chat,friend finder,ebony, Mature" />
   <meta name="Description" CONTENT="Forum on sexy black, ebony, men and BBW, women to chat or post about adult, mature, topics, or watch pay per view, pay sites, movies or videos, and cam 2 cam with women with big booties.">
   <meta name="Author" CONTENT="mysite.com">
<META name="ROBOTS" content="NO INDEX"/>

   <script language="JavaScript" type="text/javascript" src="', $settings['default_theme_url'], '/script.js?rc2p"></script>

3 questions

1. Did I do it correctly

2. The robots.txt file are there any security settings I should add?


3. Do I now need to submit a sitemap to google?
Otsikko: Re: robot.txt file importance
Kirjoitti: RustyBarnacle - lokakuu 16, 2007, 04:06:28 IP
User-agent: *
Disallow: /index.php?
located in root /

Would that work to disallow the whole forum?
Otsikko: Re: robot.txt file importance
Kirjoitti: tumbleweed - lokakuu 17, 2007, 01:37:50 AP
Using this code will disallow any crawler from any directory in your webroot
User-agent: *
Disallow: /

Using this code you can disallow any crawler from a specified directory and or specific page,And still allows them to crawl other portions of your site.

User-agent: *
Disallow: /cgi-bin/
Disallow: /your directory name/
Disallow: /your directory name/blank.htm
Otsikko: Re: robot.txt file importance
Kirjoitti: Heterman - lokakuu 17, 2007, 02:43:11 AP
Is there a good reference guide on this pertaining to any default ROBOTS.txt installed just by installing the board?  I seem to recall something along those lines, or was that just for keeping them out of any "closed doors"?
Otsikko: Re: robot.txt file importance
Kirjoitti: Dannii - lokakuu 17, 2007, 04:35:04 AP
LainaaJust wondering why most people don't use this? This is suppose to cover all the action commands.
Because we don't want to block all actions, only the ones that wouldn't be useful for search engines to index.
Otsikko: Re: robot.txt file importance
Kirjoitti: metallica48423 - lokakuu 17, 2007, 03:43:56 IP
LainaaIs there a good reference guide on this pertaining to any default ROBOTS.txt installed just by installing the board?  I seem to recall something along those lines, or was that just for keeping them out of any "closed doors"?

SMF doesn't come with a robots.txt.  It makes use of a noindex tag only on pages containing what would be duplicate content
Otsikko: Re: robot.txt file importance
Kirjoitti: karlbenson - lokakuu 17, 2007, 06:16:59 IP
Indeed.
IMO its better to use meta robots noindex rather than robots.txt for most things
especially to prevent spidering of
- duplicated material
- sensitive/security areas.

Why (duplicate content)?
Because with the meta robots noindex tag, search engines will still follow ALL links on the page and then spider them.
If you used the robots.txt it won't spider it at all.
So spider visits user 189's profile. You have chosen to block profiles and the results are.
With meta robots, not indexed, 8 new links obtained
With robots.txt, not indexed, no followed, 0 links obtained.

IMO generally it is NOT advisable for security reasons to try to block spidering of 'sensitive' areas with a robots.txt eg admin areas.
Why (for sensitive areas)?
Because 'hackers' can and do write scripts to 'scan' millions of sites robots.txt looking for sensitive areas like 'admin.php', and what you've done is lead them straight to your admin area.  Should they know of or wish to test for vulnerability, they can then do so.
Always use meta <meta name="robots" content="noindex,nofollow" />
The same thing which makes robots.txt useful to search engines, makes they useful to hackers.

The only things you really should disallow with robots.txt are
- folders that have no visual functionality other than as folders of files eg images/ sources/ etc
- old folders/areas since site reresign eg /oldsite/ (to prevent lots of 304 errors in your server error log)
- any bots you want to control/disallow

Even if you don't have anything to put in a robots.txt you should still have a blank one.
Why? Because your will get errors in your logs as search engines constantly try to get one.  It will also eat unnecessary bandwidth serving them your error pages or landing pages as opposed to 0kb file.

Robot.txt is easy, but it isnt necessarily the best thing for security or SEO purposes.

So for smf it would be better to find where <meta name="robots" content="noindex" /> is added, and make it say <meta name="robots" content="noindex,follow" />
And also edit WHEN it is shown to show for more areas you don't want indexed, but still want the seo benefit.