News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

Error Convert HTML-entities to UTF-8 characters

Started by pabloalcorta, June 15, 2019, 01:10:45 PM

Previous topic - Next topic

pabloalcorta

hello I'm following the tutorial https://wiki.simplemachines.org/smf/UTF-8_Readme  by running "Convert HTML-entities to UTF-8 characters" of I get this error

Incorrect string value: '\xF0\x9F\x98\xB3Archivo: /storage/ssd3/905/9716905/public_html/prueba12/Sources/ManageMaintenance.php
line: 950


I'm using another server to do tests

thanks
Theme:Wgame
PHP: Version 7.0.33
SMF:2.0.17

GigaWatt

Could you paste what's around line 950 in ManageMaintenance.php (20 lines above and below line 950)?
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

pabloalcorta

ok here here it is
if (empty($max_value))
continue;

while ($context['start'] <= $max_value)
{
// Retrieve a list of rows that has at least one entity to convert.
$request = $smcFunc['db_query']('', '
SELECT {raw:primary_keys}, {raw:columns}
FROM {db_prefix}{raw:cur_table}
WHERE {raw:primary_key} BETWEEN {int:start} AND {int:start} + 499
AND {raw:like_compare}
LIMIT 500',
array(
'primary_keys' => implode(', ', $primary_keys),
'columns' => implode(', ', $columns),
'cur_table' => $cur_table,
'primary_key' => $primary_key,
'start' => $context['start'],
'like_compare' => '(' . implode(' LIKE \'%&#%\' OR ', $columns) . ' LIKE \'%&#%\')',
)
);
while ($row = $smcFunc['db_fetch_assoc']($request))
{
$insertion_variables = array();
$changes = array();
foreach ($row as $column_name => $column_value)
if ($column_name !== $primary_key && strpos($column_value, '&#') !== false)
{
$changes[] = $column_name . ' = {string:changes_' . $column_name . '}';
$insertion_variables['changes_' . $column_name] = preg_replace_callback('~&#(\d{1,7}|x[0-9a-fA-F]{1,6});~', 'fixchar__callback', $column_value);
}

$where = array();
foreach ($primary_keys as $key)
{
$where[] = $key . ' = {string:where_' . $key . '}';
$insertion_variables['where_' . $key] = $row[$key];
}

// Update the row.
if (!empty($changes))
$smcFunc['db_query']('', '
UPDATE {db_prefix}' . $cur_table . '
SET
' . implode(',
', $changes) . '
WHERE ' . implode(' AND ', $where),
$insertion_variables
);
}
$smcFunc['db_free_result']($request);
$context['start'] += 500;

// After ten seconds interrupt.
if (time() - $context['start_time'] > 10)
{
// Calculate an approximation of the percentage done.
$context['continue_percent'] = round(100 * ($context['table'] + ($context['start'] / $max_value)) / $context['num_tables'], 1);
$context['continue_get_data'] = '?action=admin;area=maintain;sa=database;activity=convertentities;table=' . $context['table'] . ';start=' . $context['start'] . ';' . $context['session_var'] . '=' . $context['session_id'];
return;
}
}
$context['start'] = 0;
}

// Make sure all serialized strings are all right.
require_once($sourcedir . '/Subs-Charset.php');
fix_serialized_columns();

// If we're here, we must be done.
$context['continue_percent'] = 100;
$context['continue_get_data'] = '?action=admin;area=maintain;sa=database;done=convertentities';
$context['last_step'] = true;
$context['continue_countdown'] = -1;
}


thanks
Theme:Wgame
PHP: Version 7.0.33
SMF:2.0.17

GigaWatt

Just looked it up. xF0 x9F x98 xB3 are emoji unicode characters, which are represented as either empty spaces or unknown symbols by a text editor if the text editor uses a font that can't display emojis. If you haven't done any manual edits to SMF's files, a mod probably made those edits and added those extra characters. It's best to just replace that whole section with copy/pasting the whole thing from a fresh set of files. Copy and overwrite the code you have around line 950 with this code.

// Update the row.
if (!empty($changes))
$smcFunc['db_query']('', '
UPDATE {db_prefix}' . $cur_table . '
SET
' . implode(',
', $changes) . '
WHERE ' . implode(' AND ', $where),
$insertion_variables
);
}
$smcFunc['db_free_result']($request);
$context['start'] += 500;
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

pabloalcorta

I have replaced it with the code and the same error keeps coming up
Theme:Wgame
PHP: Version 7.0.33
SMF:2.0.17

albertlast

You could change the colum type to uftmb4,
this shouod fix the issue.

shawnb61

albertlast is correct.  That byte sequence \xF0\x9F\x98\xB3 is a 4-byte emoji, which a MySQL UTF8 DB cannot store natively. 

So you either convert your DB to UTF8MB4 or you find a way to convert those sequences to htmlentities prior to loading them. 


I *JUST* ran across this last night & remembered this recent thread...
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

shawnb61

#7
For the record, I've been experimenting doing EXACTLY what pabloalcorta initially reported.  I think this is a bug, and is easily reproducible. 

Background:
  • latin1_swedish_ci actually stores full UTF8 data, even 4-byte UTF8 data, without a problem.  It even displays properly.
  • MySQL's "UTF8" only stores 3-byte UTF8 data, and cannot store 4-byte UTF8 characters.
  • SMF's UTF8 conversion is aware of this, and rather than attempt to store 4-byte UTF8, which will fail, converts the data to htmlentities first.
  • Thus, the SMF UTF8 conversion completes successfully, no matter what data you have, with no data loss.
HOWEVER...  If you now attempt to convert the html entities to UTF8 using the SMF function, bad things happen, depending on whether your DB is in STRICT mode or not:
  • If your DB is in strict mode, conversion to 4-byte fails and you get the "Incorrect string value: '\xF0\x90\x8C\xBC\xF0\x90...' for column 'body' at row 1" message.  Your data is left alone & is OK.
  • If your DB is NOT in strict mode, MySQL will attempt the conversion by inserting - straight from the manual - 'adjusted values'.  'Adjusted values' usually means your message is truncated beyond the "invalid" 4-byte character, and your data is now lost.
SMF 2.0.x under some circumstances sets the sql_mode to '', i.e., NON-strict mode, so there is danger of data loss using the "Convert HTML-entities to UTF8 characters" function. 

To fix: We should have the entity-to-UTF8 conversion either leave 4-byte entities alone, or disable it altogether.  OR, cutover 2.0.x to STRICT mode.  Since 2.1 already uses STRICT mode, data is safe, but the entity conversion should be looked at as well, it likely produces the same error noted above. 

I think that 4-byte chars (emojis, CJK) are at the root of a lot of aborted UTF8 conversions for this very reason.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

pabloalcorta

qshawnb61,albertlast thanks I am reading and I will try next week to see if I can do it ... my null php knowledge I will try to do what they say thank you
Theme:Wgame
PHP: Version 7.0.33
SMF:2.0.17

shawnb61

pabloacorta -

You do not need to run that task.  If things look good, I would leave things as-is!
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

pabloalcorta

shawnb61 ok thanks i still haven't tried it yet .... i'm trying to organize to do a test
Theme:Wgame
PHP: Version 7.0.33
SMF:2.0.17

shawnb61

Ok!  Just to be clear: due to the defect you helped identify above, I do NOT suggest running "convert html entities to utf8 characters". 

Just convert to utf8 & leave the entities alone for now.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

pabloalcorta

Theme:Wgame
PHP: Version 7.0.33
SMF:2.0.17

Advertisement: