Getting a lot of 403 errors for "prev_next,next.html" page.

Started by medicMe, July 17, 2013, 05:21:21 PM

Previous topic - Next topic

medicMe

403 error on this type of file from Apache has been popping up in my logs a lot lately. (Logwatch summary below.)

     /forum/index.php/topic,62.0/prev_next,next.html: 48 Time(s)
     /forum/index.php/topic,55.0/prev_next,next.html: 11 Time(s)
     /forum/index.php/topic,21.0/prev_next,next.html: 7 Time(s)
     /forum/index.php/topic,9.0/prev_next,next.html: 4 Time(s)
     ...


This continues on for a bit since it hits most pages.

I dug around in the source to see what's linking there, and it appears to be a <link> tag with rel="prev" and rel="next.

Now, I've opted to not show a link to the next topic in a board for users on a topic page (in the body and visible to users that is), so really this <link> tag is either useless, or meant just for spiders/bots to follow.

Would it be safe to just remove the reference? It's not 403 erroring on some threads (presumably because there's a 'next' or 'previous' thread to view.) Would this affect how bots crawl the site?

And furthermore, why is SMF even using rel="next" and "previous" for going to the next topic entirely... shouldn't we be following accepted web practices and use these rel link tags to paginate a single article/topic, not from topic to topic on a board? 

(at least, that's what google suggests: http://googlewebmastercentral.blogspot.ca/2011/09/pagination-with-relnext-and-relprev.html)


Anyway, what am I breaking by removing it from the template?

Or is there a really good reason to keep them and fix the errors another way I'm just not seeing?

MrPhil

Tell your host to shut off the "mod security" in the server. It's not needed by SMF and causes lots of problems. Yours could well be caused by it. If your host swears on his firstborn's head that mod_security is not running, we'll have to look at other things.

medicMe

Quote from: MrPhil on July 17, 2013, 06:25:57 PM
Tell your host to shut off the "mod security" in the server. It's not needed by SMF and causes lots of problems. Yours could well be caused by it. If your host swears on his firstborn's head that mod_security is not running, we'll have to look at other things.

I swear by my first born that mod_security is not running on the domain (I manage the host myself, it's an Amazon EC2 instance I configured from the ground up.)

I did have mod_security installed at one point, and I totally got bombarded with 403 errors for nearly anything that wasn't a direct URL request (and thus, realized that configuring mod_security to work with URL syntax used by SMF would require losing the rest of the hair on my head, so I turned it off entirely for the domain that SMF is hosted on.)

Time passed, and I stopped getting the 403 errors, all except this one specific prev_next,next/prev.html line specifically.

As well, I know it's not old errors caused by mod_security since logwatch and the logs are referencing this within the last day or two, while I took mod_security off nearly a month ago.

That aside, I've located where to delete the <link> lines generated in the source entirely from the site, and I can't see a reason to keep them, but I as per my post I wanted to know if this is a good idea or if I'll just be breaking something and/or making the site harder for search bots to crawl.

EDIT: my first post also suggested that SMF might be using <link rel="prev"> and rel="next" incorrectly since it's generally used to paginate a single article meant to be read together between multiple pages, not to connect different subjects/articles to each other as SMF is currently doing by making rel="prev" and rel="next" connect entirely different topics to each other within the same board.

medicMe

ok, well I removed the <link rel="prev" / "next"> lines.

I found multiple thread and even a few bug reports that talk about the incorrect use of this link tag by SMF... so I feel nothing bad could happen by removing it, and search index bots will follow regular links on their own and probably will traverse the site better without the prev / next <link> tags (since it makes the bot assume each 'next' topic is the next page of one article.)

If this blows up in my face, I'll post back here.

I'll wait a day or two before marking the thread solved since I want to monitor the error log and make sure nothing new pops up from my change.


Kindred

Turn off the silly "sef" urls... They don't help and, can be troublesome in many cases.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

medicMe

Quote from: Kindred on July 18, 2013, 01:20:51 AM
Turn off the silly "sef" urls... They don't help and, can be troublesome in many cases.

Perhaps the kind people at SMF would fix the "sef" feature or remove it if that's the case.

Til then, I'm more interested in solutions to the options I've picked provided to me by SMF out of the box.

Kindred

the sef is left over from a time when "dynamic" urls were not parsed nicely by search engines. (hence the fake .html on the end)
it no longer matters...


and they are not turned on by default... which means that you turned it on
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

MrPhil

If mod_security is completely off, I wonder what it is about those <link> tags that could be triggering a 403. And furthermore, why do (apparently) so few people have trouble with them? Could it be some odd combination of settings (SEF, etc.)? I take it that you've gone over all your SMF-related files and checked their ownership and permissions, just in case these links are sending you off to some particular script that you don't have permission to run? That's the only other thing besides mod_security that I know would trigger a 403.

medicMe

Quote from: MrPhil on July 18, 2013, 09:24:40 AM
If mod_security is completely off, I wonder what it is about those <link> tags that could be triggering a 403. And furthermore, why do (apparently) so few people have trouble with them? Could it be some odd combination of settings (SEF, etc.)? I take it that you've gone over all your SMF-related files and checked their ownership and permissions, just in case these links are sending you off to some particular script that you don't have permission to run? That's the only other thing besides mod_security that I know would trigger a 403.

Is there any chance that the user agent could matter?

When I manually use the URLs (copying and pasting the href url from the link tag into my browser) it works as intended and returns a page (and Apache never tosses a fuss as I watch the logs with tail -f ) ... so I'm not sure how people are triggering the error myself, just that the errors were appearing frequently. (I say 'were' since I removed the <link> tags in question from the index.template file.)

Like, could it be a scraper or indexing bot that's hitting them in a way Apache doesn't like? It wouldn't make sense to me that one HTTP GET request would be different from another... but I can't explain why I'm unable to generate the errors myself by hitting the same links.

MrPhil

I can't imagine how the user agent (which browser, spider, etc.) could matter. A 403 means that you've requested an address (URI) of the browser that it can't satisfy because of either 1) you don't have permission to run that script, or 2) there are certain terms or words in GET or POST data that mod_security doesn't like.

A scraper or poorly written spider might well be doing something wrong with <link rel="xxxx" href="...." />. Those tags belong only in the <head> (have you seen them elsewhere?) and are hints to a spider that this page is part of a sequence of pages that might be better treated as a whole. It looks like SMF is using these tags correctly, but perhaps some spider is misreading them in some way, and generating bad URLs as a result.

SMF has its own "prev_next" URL Query String entry with "next" or "prev", but it looks legal to me. Your subject says you're seeing "prev_next,next" rather than "prevnext=next"? Is this only with SEF enabled? It sounds like it's doing something wrong with either producing the revised URI or the .htaccess isn't restoring the URI correctly. What is the history of the .htaccess file? Could the converter have failed to pick up some terms, and left an incomplete .htaccess? If you shut off SEF, leaving it as ;prevnext=next, does that make a difference? With SEF enabled, does SMF appear to be picking up all other URI terms, e.g.,
/forum/index.php/topic,62.0.html
expands to
/forum/index.php?topic=62.0
but
/forum/index.php/topic,62.0/prev_next,next.html
is expanding to something else? It would be expected to be
/forum/index.php?topic=62.0;prev_next=next

It almost sounds like .htaccess is failing to convert the SEF form back to regular dynamic form. Is this only when prev_next is in the SEF form?

I'd have to study the subject some more to see if I agree that SMF is misusing <link rel="prev|next">. It should only be a hint to spiders that it's part of a sequence of related pages that should be treated as one. I would also have to see how SMF is using this prev_next term.

MrPhil

After looking at it a bit more, I think SMF is misusing this in a couple of ways. Why do the "previous" and "next" links have the same topic number as this topic? And why does a topic that fits on one page still get prev/next links? Between those two, it's possible that some spiders are getting confused trying to follow links. Is 403 ever used to report an infinite loop? Is there any correlation between the listed 403 errors and topics with just one page (or only with multiple pages)?

I haven't looked into what SMF is using prev_next for, or what exactly the previous and next links are intended to be used for. I wouldn't mind links for the previous and next topic, and if applicable, the previous and next pages in this topic, and maybe even up a level to the board index:
<< previous topic    < previous page       ^ board ^     next page >    next topic >>
The page number selection
1  <  [  3][V]  >  7     Show as single page
could be on the adjacent line.

medicMe

Firstly, MrPhil, thank you for your thoughtful and detailed response. :) I really appreciate it.

Quote from: MrPhil on July 18, 2013, 03:09:57 PM
After looking at it a bit more, I think SMF is misusing this in a couple of ways. Why do the "previous" and "next" links have the same topic number as this topic? And why does a topic that fits on one page still get prev/next links?

To clarify: None of this is about the page number or next / previous links in the body of the forum (the << < 1 2 3 > >> stuff.) I don't want to confuse the two different "prev next" links in this case since they have the same name.

The only time the errors appeared were for the <link rel="prev" href="...  and <link rel="next" href="... links contained within the header of the page (and yes, the link tags were within the header, standard as per the default SMF install in that respect.) Usually this is placed to tell web crawlers that the next page is part of the same subject/article, or also known as pagination.

I could be wrong, but when I manually hit up the URL links for prev and next, it would bring me to the next topic entirely. The topics in this case didn't have multiple pages, so it could have just defaulted to the next topic since I was on the only page, but.. this is why I suggested SMF was using the links in the header incorrectly:   Pagination wise, a new topic entirely is not part of the previous post, and since topics change order based on when the last post was made, I thought this might confuse the hell out of indexing bots.

I mean, yes, they are all part of the same "board" so the high level subject matter is roughly the same. But I consider each topic to be its own "article" of sorts and only multiple pages within an article should be linked with prev/next references for the purposes of pagination.

I'm not the only one to have noticed this, as searching this board after showed that a few people have brought it up before, and I think one even made a bug report about it.

--

All that aside: Since removing the <link> tags the error stops showing up (expected,) and since the link tags were being used incorrectly by SMF (in my opinion) then there's no harm in removing them.

It's been a few days, and nothing has proven broken from this change (not that I expected it..) So effectively this is 'solved' from a 'no longer bothers me' point of view.

But to carry on figuring out why I was getting a 403 error that caused all this:

Yes, I'm using SEF URLs... I enabled this when I started, so I don't really want to turn them off now that google has already indexed plenty of pages based off the current URL scheme.

And yes, all other SMF URI terms are being handled properly under SEF URLs from what I can see.

I can't manually reproduce the 403 errors since when I follow the link from my browser it works fine (this is done by viewing the source, grabbing the href from the <link> tag, and pasting it into a new tab.) That's why I asked if the user agent would matter (I too thought it wouldn't matter, but I didn't know why I wasn't generating the same errors myself when I hit the same links.)

Because I can't manually reproduce it, I'd have to turn SEF off and just wait. I'd rather not do this since the URL structure has already been indexed by google and other search engines and that might muddle things up a bit.

I'm going to pull up the IPs that generated the errors and see if they are bots.. If the only IPs generating them are bots, then maybe the user agent or request type does matter (which would surprise me.) Be back later with this.




medicMe

I just discovered why I couldn't manually reproduce the errors. HTTP GET requests don't trigger it.... modern browsers pre-fetching the next part of the paginated article from the link in the header for "next" and "prev" trigger the error:



Found with firebug after I moved a copy of the install to a development box to continue plugging away at the issue (since I deleted the offending links on the production server as mentioned before.)

This is even more reason for SMF to stop using rel="next" and "prev" link tags in the header as they are currently in use. Since not only is it incorrect implemented now, but some browsers are assuming they link to the next and previous pages in a single document/article/subject and attempt to pre-load the link so when the user progresses through the article it may load faster.

SMF should seriously change the code so these reference links are only used to go to the next page in a multiple page thread/topic, and thus, properly do pagination which is what the links exist for in the first place (at least, as per Google.)

MrPhil

So, browsers are abusing the function of <link rel='prev|next'> by using them to prefetch the next page? (have it cached and ready to go should you ask for it) Is the URL any different than what they would send if you click the "next" link on the page? Are they somehow bypassing .htaccess and trying to directly use that SEF URL? I can't imagine why or even how a browser could cause .htaccess to be bypassed, but maybe there's a way. That would be bad news if an HTTP GET or whatever could tell the server to use the URL directly.

The fact that SMF is incorrectly using <link> for going between topics is just icing on the cake. If I understand the concept, <link> should only be used when a previous or next page exists in this topic. According to Themes/Sources/Display.php, prev_next is supposed to find the previous or next topic in this board. I will report this in the Bugs board.

medicMe

Quote from: MrPhil on July 21, 2013, 11:19:09 AM
The fact that SMF is incorrectly using <link> for going between topics is just icing on the cake. If I understand the concept, <link> should only be used when a previous or next page exists in this topic. According to Themes/Sources/Display.php, prev_next is supposed to find the previous or next topic in this board. I will report this in the Bugs board.

Exactly! That's what I've been trying to communicate since the start of this thread.

If you even look in index.template.php, the PHP comment above the code that generates the links says it's going to be linking to the next topic on the board. That means even if there was multiple pages for a topic, the <link> tags in the header would still skip them and go to the next topic.

medicMe

Quote from: MrPhil on July 21, 2013, 11:19:09 AM
Are they somehow bypassing .htaccess and trying to directly use that SEF URL?

Arguably, they aren't bypassing .htaccess at all, since .htaccess is read by the server upon getting a request, not the client.. and this could be used (.htaccess) to 403 a request if it was setup to do so.

Though, honestly, I'm confused as to why you brought .htaccess up, since I don't have such a file in my forum's root directory (which would govern the index.php file.)

MrPhil

You don't have an .htaccess file anywhere? What is translating your SEF form .html file into dynamic PHP? Check your hosting file manager and/or FTP to make sure it's showing you files starting with a period (like .htaccess). Note that .htaccess only applies to an Apache ("Linux") server. It will be ignored on a Windows (IIS) server. Could that be why you were getting the 403 error -- that .htaccess (if any) was being ignored, leaving URIs of the form
/forum/index.php/topic,62.0/prev_next,next.html
rather than translated into the dynamic PHP form
/forum/index.php?topic=62.0;prev_next=next
?

I think there is some sort of "mod_rewrite" module available for IIS, but I'm not familiar with it.

medicMe

Quote from: MrPhil on July 21, 2013, 04:48:07 PM
You don't have an .htaccess file anywhere? What is translating your SEF form .html file into dynamic PHP? Check your hosting file manager and/or FTP to make sure it's showing you files starting with a period (like .htaccess). Note that .htaccess only applies to an Apache ("Linux") server. It will be ignored on a Windows (IIS) server. Could that be why you were getting the 403 error -- that .htaccess (if any) was being ignored, leaving URIs of the form
/forum/index.php/topic,62.0/prev_next,next.html
rather than translated into the dynamic PHP form
/forum/index.php?topic=62.0;prev_next=next
?

I think there is some sort of "mod_rewrite" module available for IIS, but I'm not familiar with it.



*shrugs*

It's just worked after I turned on SEF.

Never had a .htaccess file there.

And it is SEF URLs, such as index.php/topic... etc..

I could be wrong, but I think the handlers are PHP based internally to SMF 2.0.4... Besides, ideally you wouldn't want .htaccess to have a large list of URL sequences to lookup and translate, Apache isn't the fastest when managing it through .htaccess, PHP is far more efficient for this. (This is the same reason why banning a huge list of IPs through .htaccess, or doing mass redirects through .htaccess, on a popular/busy site, can really slow apache down.)


MrPhil

*shrug*

If it works, it works, but that would be the first SEF I ever saw that could take /forum/index.php/topic,62.0/prev_new,new.html and have the server know what to do with it (without being first preprocessed by something into a regular URL).

I do see an "error_log" file. Have you looked in there to see what it's reporting as a problem?

Banning IP addresses in .htaccess is still much faster than having SMF ban them in the program.

medicMe

Quote from: MrPhil on July 21, 2013, 06:39:15 PM
*shrug*

If it works, it works, but that would be the first SEF I ever saw that could take /forum/index.php/topic,62.0/prev_new,new.html and have the server know what to do with it (without being first preprocessed by something into a regular URL).

I do see an "error_log" file. Have you looked in there to see what it's reporting as a problem?

Banning IP addresses in .htaccess is still much faster than having SMF ban them in the program.

error_log file is old from months ago when I was messing around with the forum before I took it live. Only has 3 lines and they aren't related.

Errors I reported were from Apache's main error log, where all problems are reported now.

Yes, banning IPs would be faster via .htaccess than SMF directly, but only because of the way SMF manages IP bans and that it still serves a full (200 response) page to the SMF-level banned IP.

For IP level bans, I put an entry directly into IPTables to drop the request if the offending party is hitting up the site frequently.

Not to get off track...

Advertisement: