Advertisement:

Author Topic: Error Convert HTML-entities to UTF-8 characters  (Read 1566 times)

Offline pabloalcorta

  • Semi-Newbie
  • *
  • Posts: 48
Error Convert HTML-entities to UTF-8 characters
« on: June 15, 2019, 01:10:45 PM »
hello I'm following the tutorial https://wiki.simplemachines.org/smf/UTF-8_Readme  by running "Convert HTML-entities to UTF-8 characters" of I get this error

Incorrect string value: '\xF0\x9F\x98\xB3Archivo: /storage/ssd3/905/9716905/public_html/prueba12/Sources/ManageMaintenance.php
line: 950


I'm using another server to do tests

thanks
Theme:Boru
PHP: Version 7.0.33
SMF:2.0.15

Offline GigaWatt

  • The Smiley Guy
  • Support Specialist
  • SMF Hero
  • *
  • Posts: 2,187
  • Gender: Male
    • Macedonian electronics forum
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #1 on: June 17, 2019, 08:13:30 AM »
Could you paste what's around line 950 in ManageMaintenance.php (20 lines above and below line 950)?
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Offline pabloalcorta

  • Semi-Newbie
  • *
  • Posts: 48
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #2 on: June 17, 2019, 01:07:01 PM »
ok here here it is
Code: [Select]
if (empty($max_value))
continue;

while ($context['start'] <= $max_value)
{
// Retrieve a list of rows that has at least one entity to convert.
$request = $smcFunc['db_query']('', '
SELECT {raw:primary_keys}, {raw:columns}
FROM {db_prefix}{raw:cur_table}
WHERE {raw:primary_key} BETWEEN {int:start} AND {int:start} + 499
AND {raw:like_compare}
LIMIT 500',
array(
'primary_keys' => implode(', ', $primary_keys),
'columns' => implode(', ', $columns),
'cur_table' => $cur_table,
'primary_key' => $primary_key,
'start' => $context['start'],
'like_compare' => '(' . implode(' LIKE \'%&#%\' OR ', $columns) . ' LIKE \'%&#%\')',
)
);
while ($row = $smcFunc['db_fetch_assoc']($request))
{
$insertion_variables = array();
$changes = array();
foreach ($row as $column_name => $column_value)
if ($column_name !== $primary_key && strpos($column_value, '&#') !== false)
{
$changes[] = $column_name . ' = {string:changes_' . $column_name . '}';
$insertion_variables['changes_' . $column_name] = preg_replace_callback('~&#(\d{1,7}|x[0-9a-fA-F]{1,6});~', 'fixchar__callback', $column_value);
}

$where = array();
foreach ($primary_keys as $key)
{
$where[] = $key . ' = {string:where_' . $key . '}';
$insertion_variables['where_' . $key] = $row[$key];
}

// Update the row.
if (!empty($changes))
$smcFunc['db_query']('', '
UPDATE {db_prefix}' . $cur_table . '
SET
' . implode(',
', $changes) . '
WHERE ' . implode(' AND ', $where),
$insertion_variables
);
}
$smcFunc['db_free_result']($request);
$context['start'] += 500;

// After ten seconds interrupt.
if (time() - $context['start_time'] > 10)
{
// Calculate an approximation of the percentage done.
$context['continue_percent'] = round(100 * ($context['table'] + ($context['start'] / $max_value)) / $context['num_tables'], 1);
$context['continue_get_data'] = '?action=admin;area=maintain;sa=database;activity=convertentities;table=' . $context['table'] . ';start=' . $context['start'] . ';' . $context['session_var'] . '=' . $context['session_id'];
return;
}
}
$context['start'] = 0;
}

// Make sure all serialized strings are all right.
require_once($sourcedir . '/Subs-Charset.php');
fix_serialized_columns();

// If we're here, we must be done.
$context['continue_percent'] = 100;
$context['continue_get_data'] = '?action=admin;area=maintain;sa=database;done=convertentities';
$context['last_step'] = true;
$context['continue_countdown'] = -1;
}

thanks
Theme:Boru
PHP: Version 7.0.33
SMF:2.0.15

Offline GigaWatt

  • The Smiley Guy
  • Support Specialist
  • SMF Hero
  • *
  • Posts: 2,187
  • Gender: Male
    • Macedonian electronics forum
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #3 on: June 19, 2019, 04:21:45 AM »
Just looked it up. xF0 x9F x98 xB3 are emoji unicode characters, which are represented as either empty spaces or unknown symbols by a text editor if the text editor uses a font that can't display emojis. If you haven't done any manual edits to SMF's files, a mod probably made those edits and added those extra characters. It's best to just replace that whole section with copy/pasting the whole thing from a fresh set of files. Copy and overwrite the code you have around line 950 with this code.

Code: [Select]
// Update the row.
if (!empty($changes))
$smcFunc['db_query']('', '
UPDATE {db_prefix}' . $cur_table . '
SET
' . implode(',
', $changes) . '
WHERE ' . implode(' AND ', $where),
$insertion_variables
);
}
$smcFunc['db_free_result']($request);
$context['start'] += 500;
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Offline pabloalcorta

  • Semi-Newbie
  • *
  • Posts: 48
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #4 on: June 19, 2019, 10:03:50 PM »
I have replaced it with the code and the same error keeps coming up
Theme:Boru
PHP: Version 7.0.33
SMF:2.0.15

Offline albertlast

  • Development Contributor
  • Full Member
  • *
  • Posts: 589
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #5 on: June 26, 2019, 12:50:25 AM »
You could change the colum type to uftmb4,
this shouod fix the issue.

Offline shawnb61

  • Developer
  • SMF Hero
  • *
  • Posts: 1,522
    • sbulen on GitHub
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #6 on: August 07, 2019, 04:40:46 PM »
albertlast is correct.  That byte sequence \xF0\x9F\x98\xB3 is a 4-byte emoji, which a MySQL UTF8 DB cannot store natively. 

So you either convert your DB to UTF8MB4 or you find a way to convert those sequences to htmlentities prior to loading them. 


I *JUST* ran across this last night & remembered this recent thread...
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Offline shawnb61

  • Developer
  • SMF Hero
  • *
  • Posts: 1,522
    • sbulen on GitHub
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #7 on: August 11, 2019, 06:25:12 PM »
For the record, I've been experimenting doing EXACTLY what pabloalcorta initially reported.  I think this is a bug, and is easily reproducible. 

Background:
  • latin1_swedish_ci actually stores full UTF8 data, even 4-byte UTF8 data, without a problem.  It even displays properly.
  • MySQL's "UTF8" only stores 3-byte UTF8 data, and cannot store 4-byte UTF8 characters.
  • SMF's UTF8 conversion is aware of this, and rather than attempt to store 4-byte UTF8, which will fail, converts the data to htmlentities first.
  • Thus, the SMF UTF8 conversion completes successfully, no matter what data you have, with no data loss.
HOWEVER...  If you now attempt to convert the html entities to UTF8 using the SMF function, bad things happen, depending on whether your DB is in STRICT mode or not:
  • If your DB is in strict mode, conversion to 4-byte fails and you get the "Incorrect string value: '\xF0\x90\x8C\xBC\xF0\x90...' for column 'body' at row 1" message.  Your data is left alone & is OK.
  • If your DB is NOT in strict mode, MySQL will attempt the conversion by inserting - straight from the manual - 'adjusted values'.  'Adjusted values' usually means your message is truncated beyond the "invalid" 4-byte character, and your data is now lost.
SMF 2.0.x under some circumstances sets the sql_mode to '', i.e., NON-strict mode, so there is danger of data loss using the "Convert HTML-entities to UTF8 characters" function. 

To fix: We should have the entity-to-UTF8 conversion either leave 4-byte entities alone, or disable it altogether.  OR, cutover 2.0.x to STRICT mode.  Since 2.1 already uses STRICT mode, data is safe, but the entity conversion should be looked at as well, it likely produces the same error noted above. 

I think that 4-byte chars (emojis, CJK) are at the root of a lot of aborted UTF8 conversions for this very reason.
« Last Edit: August 11, 2019, 09:57:40 PM by shawnb61 »
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Offline pabloalcorta

  • Semi-Newbie
  • *
  • Posts: 48
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #8 on: August 16, 2019, 08:06:13 PM »
qshawnb61,albertlast thanks I am reading and I will try next week to see if I can do it ... my null php knowledge I will try to do what they say thank you
Theme:Boru
PHP: Version 7.0.33
SMF:2.0.15

Offline shawnb61

  • Developer
  • SMF Hero
  • *
  • Posts: 1,522
    • sbulen on GitHub
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #9 on: August 16, 2019, 08:32:32 PM »
pabloacorta -

You do not need to run that task.  If things look good, I would leave things as-is!
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Offline pabloalcorta

  • Semi-Newbie
  • *
  • Posts: 48
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #10 on: August 27, 2019, 11:21:12 AM »
shawnb61 ok thanks i still haven't tried it yet .... i'm trying to organize to do a test
Theme:Boru
PHP: Version 7.0.33
SMF:2.0.15

Offline shawnb61

  • Developer
  • SMF Hero
  • *
  • Posts: 1,522
    • sbulen on GitHub
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #11 on: August 27, 2019, 02:56:12 PM »
Ok!  Just to be clear: due to the defect you helped identify above, I do NOT suggest running "convert html entities to utf8 characters". 

Just convert to utf8 & leave the entities alone for now.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Offline pabloalcorta

  • Semi-Newbie
  • *
  • Posts: 48
Re: Error Convert HTML-entities to UTF-8 characters
« Reply #12 on: August 27, 2019, 03:23:06 PM »
ok thanks
Theme:Boru
PHP: Version 7.0.33
SMF:2.0.15