News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

SJSB and Canonical

Started by jdi_knght, February 17, 2009, 11:26:22 PM

Previous topic - Next topic

jdi_knght

I just thought I'd throw this in here, since I'm new to SMF, and have been lucky in getting it to work extremely well with my Joomla site.

One thing that concerned me was that Google would obviously see duplicate content, since my forum was available from 2 places:
hxxp:forums.eyeglassretailerreviews.com [nonactive] (the basic forum)
hxxp:www.eyeglassretailerreviews.com/forums.html [nonactive] (the forum shown on the website through the SJSB wrapper)

It took hours of searching through PHP stuff and trying to learn exactly how to do what I wanted to (I know almost nothing about PHP except what I can figure out by reading templates), but I finally got enough pieces of information to put something together.

Anyway to get to the point, if you want to use the canonical tag, this is a method that I came up with in the end that works:

if (!empty($_SERVER['QUERY_STRING'])) {
echo '
<link rel="canonical" href="http://www.eyeglassretailerreviews.com/forums.html?';
echo "{$_SERVER['QUERY_STRING']}";
echo '" />';
} else {
echo '
<link rel="canonical" href="http://www.eyeglassretailerreviews.com/forums.html" />';
}


BACK EVERYTHING UP FIRST! Put it in your index.template.php file under whatever directory you've got (for example /Themes/default/index.template.php ). I stuck mine in the function template_html_above() section between a couple other <link...> entries (which is where I figure it should go). What it basically does is take whatever parameters are being used, and enters them into the canonical tag. Since there are no parameters on the main page, the else section writes it without anything. Obviously use your own URLs. The canonical tag WILL show up in the html on both sites, but should point to your preferred site.

If there's an easier way to do this (perhaps somebody's made a mod), please let me know.

Anyway, I hope it helps somebody else out. I just switched to SMF yesterday (from phpBB3) and am loving it so far :)

Edit: I should note that this was SMF 2 RC1 and Joomla is using the built-in SEF urls in case it makes a difference.

jdi_knght

#1
Just thought I'd throw in an update since the next part may matter a fair bit. :p

For whatever reason, the forums as shown through SJSB gives a "noindex" tag on the wrapped site (Joomla version) regardless of the page. If you did the 1st part without doing this next one, don't panic - if Google couldn't index the wrapped site, it will have simply picked your regular non-wrapped forum and just ignored the canonical tag from what I understand.

In any case, to "cure" this, you need the following in your index.template.php file as well:

if (($_SERVER['SERVER_NAME']=="forums.eyeglassretailerreviews.com")) {

// Please don't index these Mr Robot.
if (!empty($context['robot_no_index']))
echo '
<meta name="robots" content="noindex" />';
}


The middle stuff is already there in the file! Don't re-add it! You're basically just adding the 1st line, and the last line (squigly brackets). This basically makes it so that the noindex tag only appears when the site is... whatever site you put in there (in this case, hxxp:forums.eyeglassretailerreviews.com [nonactive]). That means the noindex tag will not appear on your SJSB wrapped version (www.eyeglassretailerreviews.com/forums.html in my case).

----------

This creates a minor issue - now it's indexing EVERYTHING on the wrapped site! Now obviously that's better than indexing NOTHING on the wrapped site, but there's one more change you can make to reduce the chance of duplicate content being indexed.

Edit the robots.txt for the Joomla site. It's probably already there, since Joomla includes one with the install - you just have to add to it.

User-agent: Googlebot
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
Disallow: /forums.html*%3B*
Disallow: /forums.html*.msg*
Allow: /forums.html*wap2$
Allow: /forums.html*wap$
Allow: /forums.html*imode$

You probably already have the first 14 lines (although it will be under User-agent: *). Below all that, add the Googlebot one. You *may* not have to repeat everything, but I did just in case.

What this does is uses wildcards to keep certain things from being indexed. It's not as perfect as the built-in nofollow, but it's close. The %3B means a semi-colon ( ; ), since most stuff that would end up as a duplicate in SMF has a semi-colon in the URL. This includes search results, seeing "new posts", etc. The .msg part is to keep single messages from being indexed. Don't get mixed up - when viewing a thread, you don't get the semi-colon. But if you view a single post (for example click on the post title of this post), you get a .msg in the URL - that's duplicate content. Yuck!

The final 3 lines are to let Google go through the WAP, WAP2, and IMODE versions of the site. These are necessary because just about everything in those versions have a semi-colon in them. This will explicitly allow Googlebot to index those (if you don't put in the last 3 lines, the wap versions of your site won't be indexed).

A final thing to note is that if you use AdSense, you'll also want to create a User-agent similar to the above but with the following changes:
User-agent: Mediapartners-Google
----same middle stuff as above (or already in your robots.txt)------
Allow: /forums.html*%3B*
Allow: /forums.html*.msg*
Allow: /forums.html*wap2$
Allow: /forums.html*wap$
Allow: /forums.html*imode$

and would be a good idea to do
User-agent: Adsbot-Google
---middle stuff again---
Allow: /forums.html*%3B*
Allow: /forums.html*.msg*
Allow: /forums.html*wap2$
Allow: /forums.html*wap$
Allow: /forums.html*imode$


The reason for these is that you want Adsense to be able to provide ads without the limitations. If you don't do this, any duplicate pages that don't get indexed also won't have ads showing (or will have default ads showing). Adsense doesn't care about duplicate content, and you WANT it to display on all content - duplicate or not, so there's no reason not to do this.


--------------------

A few notes:
1) Yes I know it's a lot of work. I haven't found a more simple way to do it though.
2) I checked the robots.txt stuff in Google's Webmasters tools - yes it should work (test out a few URL's with it in Webmaster Tools and see for yourself).
3) Do NOT use wildcards in the default User-agent ( * ). Not all search engines follow wildcards and it may mess stuff up. Google does accept those wildcards - you can test in Webmaster Tools. I've heard that Yahoo allows wildcards, and not sure about MSN, but if you choose to set up robots.txt rules for those ones (or other search engines), MANUALLY set each user-agent - don't just use *, and make sure you read up beforehand to ensure they'll accept wildcards in that fashion.
4) Make sure you back up all your files first. Paricularly index.template.php Test everything too (yes, use Webmaster Tools even though I did it already to make sure nothing funky's going on).
5) I ended up changing my own mind about which side I wanted it to index on my live site, so before anyone checks it and says "hey all your canonicals are pointing to the other site!", don't worry, it's intended ;)  The above write-ups still apply to pointing everything to your SJSB-wrapped site.
6) Originally I tried to find where in the template files it was causing noindex to show on ALL the wrapped pages. If someone knows specifically where, it will probably be a MUCH easier fix. I looked for hours and couldn't find it though, hence all these steps.
7) Just as a note (before someone mentions it), Joomla usually has a "follow, index" tag on all the pages. Problem is that you *also* get the "noindex" tag in there, and search engines are free to follow whatever they want, hence why the above fixes/changes were put in.
8 ) If you do NOT tell SJSB to carry the META tags over from SMF, none of this will be necessary (although there are a buttload of other issues if you don't carry the META's - particularly in that it causes a broken theme).

If anyone has comments or recommendations (or a way to do things better!) please let me know :)

Advertisement: