Soft hyphen paste in editor is not correctly handled

Started by Kantis, December 30, 2023, 07:39:26 PM

Previous topic - Next topic

Kantis

If I copy and paste something from web site using soft hyphen, it leads to incorrect rendering.

Example web site: https://www.hs.fi/kotimaa/art-2000009953864.html

If I copy and paste a word with hyphen, it looks like this in the editor:

parveke�turma

Closer look reveals this is UTF-8 encoded soft hyphen:

00000000  70 61 72 76 65 6b 65 c2  ad 74 75 72 6d 61 0d 0a  |parveke..turma..|

Is this a bug or misconfiguration in my site?

Arantor

Holder of controversial views, all of which my own.



Arantor

I know what this is, but I want to know what you're using.

I'm also very curious how you got that soft hyphen there because I couldn't from the page you linked...
Holder of controversial views, all of which my own.


Kantis

Quote from: Arantor on December 30, 2023, 07:58:28 PMI know what this is, but I want to know what you're using.

Very funny ;D. I mean my forum, it is also 2.1.4.

Quote from: ArantorI'm also very curious how you got that soft hyphen there because I couldn't from the page you linked...

I just copy pasted two words from the top: "parveke-turman":

You cannot view this attachment.

I suppose the author didn't write this word with hyphen, but the publishing software just broke the words with soft hyphen, which is rendered by the browser. When this is copy pasted, it is up to the target program how to interpret this. For example, notepad shows it like it is in my attachment. But SMF 2.1.4 editor doesn't like it.


Kantis

Quote from: Arantor on December 30, 2023, 07:58:28 PMI couldn't from the page you linked

Looks like different browsers render this page differently BTW. Firefox doesn't use hyphen, but Chrome does.

Kantis

Also IMHO copying soft hyphen is something browsers should never do. But for some reason, it comes through with my browser (Edge on Windows), and then the pasting target program needs to deal with this somehow.

Kantis

This is probably better document to test with: https://jkorpela.fi/shytest.html

So just copy paste text from that to SMF 2.1.4 editor, and result should look like this:

You cannot view this attachment.

According to this PR, this may be feature, not bug: https://github.com/SimpleMachines/SMF/pull/7102


Aleksi "Lex" Kilpinen

#8
https://www.compart.com/en/unicode/U+00AD
https://en.wikipedia.org/wiki/Soft_hyphen

I'm not sure what to think of this, but my first thought is that turning that character visible is actually the best approach to this. The other one I could think of is simply stripping it away completely.
It's not a character we use for formatting, it makes searches near impossible, and it is a character that could even be used maliciously if actually left hidden.
It seems like that's also the reasoning behind https://github.com/SimpleMachines/SMF/pull/7102

Quote from: Arantor on December 30, 2023, 07:58:28 PMI know what this is, but I want to know what you're using.

I'm also very curious how you got that soft hyphen there because I couldn't from the page you linked...
Testing the appearance:
- the Ascii hyphen
‐ the Unicode hyphen
dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary dis�cretion�ary.

Un-Googled Chromium Version 120.0.6099.129, SMF source view editor, copy + paste.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

Kantis

Quote from: Aleksi on December 31, 2023, 02:10:01 AMI'm not sure what to think of this, but my first thought is that turning that character visible is actually the best approach to this. The other one I could think of is simply stripping it away completely.

Yes, after thinking about this more, I also think that the current approach is the best. The only confusing thing about this is that the conversion takes place only when writing to database (or in preview). But in the editor it is not visible initially, which is confusing to users.

I guess in the older versions (2.0) soft hyphen just went through as any other character?

Aleksi "Lex" Kilpinen

#10
Quote from: Kantis on December 31, 2023, 03:26:55 AMI guess in the older versions (2.0) soft hyphen just went through as any other character?
Possible, I'm not sure right now, but 2.0 did handle things quite differently.
I'm sure @Sesquipedalian or Arantor could answer in a lot more detail than I can.

EDIT:
Turns out, this is basically a Chromium bug and shouldn't happen when handled correctly https://bugs.chromium.org/p/chromium/issues/detail?id=767950
And it's been around in one way or another for a long time https://bugs.chromium.org/p/chromium/issues/detail?id=40378
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

Kantis

Quote from: Aleksi on December 31, 2023, 03:41:03 AMTurns out, this is basically a Chromium bug and shouldn't happen when handled correctly

Firefox behaves the same way. So if you load Korpela's test page to firefox, copy and paste to notepad, the hidden hyphens are copied and visible.

I don't see why the browsers copy these to clipboard. OTOH I don't know if it's their decision (when using ctrl+c). But they copy the hyphens even when using right-click copy. Maybe the reason is that in many cases these characters are really written by someone, so it is nice to also get that information to clipboard.

Aleksi "Lex" Kilpinen

There seems to be open reports on the same for FF too, such as https://bugzilla.mozilla.org/show_bug.cgi?id=1295912

I do agree with these reports, it would make sense to not include these in the copy operation to begin with since they are usually actually designed to be removed or hidden in text.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

Arantor

Yeah 2.0 didn't care and just let it through.

I'm personally on the fence, having read that PR, as to the value of this. I'm much more tempted to strip soft hyphens because they should only be there to hint to a browser that "here is a place you can word break safely" and I'm inclined to think browsers shouldn't be copy pasting it in the first place, even if visible.

The reason I think that - despite others taking the same view that it should be preserved - is that if a soft hyphen gets added, inevitably it's to fit the rendering of the text to a certain area on the page, and cross-posting that to the forum is inevitably going to change that render size.

But it's not my call to make, and Sesq clearly had an intent behind his PR to make it function as described (so it's not a bug, because it's working as designed, the question is whether that design is correct)
Holder of controversial views, all of which my own.


Kantis

I think the ultimate behavior of editor is correct (detect the hidden char and mark it visible). The only minor issue is when the character is initially pasted to editor, the hyphen is not visible. So when user posts, it is confusing to see the "buggy" looking question mark. So ideally it would be nice to see the question mark visible in the editor from the first paste. I'm sure there's a good technical explanation for this.

But in any case, this case is closed on my side. I'll just tell my users to be more careful with what they are pasting. That's a good thing to do anyways.

Arantor

Mostly that the sanitisation happens on the server, and doing it at paste-time is fraught with all kinds of peril because browsers don't really want you tampering with copy paste operations.
Holder of controversial views, all of which my own.


Sesquipedalian

Stripping out invisible characters is a security problem. Doing so makes it possible to inject malicious code under the right circumstances. The Unicode Standard's security annexes explicitly warn implementers not to do so. SMF follows the recommended practice on this matter.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Arantor

I have very many controversial views on things, this is no different ;)

That's not a statement that I think it should be changed in SMF, though. Merely that I can personally see a case for changing it and there are circumstances I definitely would (in spite of the Unicode recommendations)
Holder of controversial views, all of which my own.


Sesquipedalian

<a href="java-script:alert('XSS')">click me</a>
Suppose that hyphen were a soft hyphen and SMF stripped it out.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Arantor

This is why you need to strip the inline scripting markup and write a sensible CSP into the system...
Holder of controversial views, all of which my own.


Advertisement: