We’ve had prepared statements for years, but the WordPress ecosystem still relies on mysqli_real_escape_string() and it’s scary.
Can you find the security issue in this, deceivingly simple, code snippet?
$escaped = mysqli_real_escape_string(
$connection,
$_POST['new_content']
);
$sql = "UPDATE wp_posts SET wp_content='$escaped_string'";
It’s related to text encoding. Adding backslashes to string can corrupt the data at best, and open up a SQL injection vulnerability at worst. Here’s how.$_POST['new_content']
Say I typed this in the form and submitted it as new_content:
"成功
Depending on text encoding nuances, my browser can submit these four characters using different byte sequences:
- In UTF-8, it’s
0x22 0x48 0x54 0x68 0x41 0x56 0x89
- In Big5 (used for traditional chinese), it’s
.0xA6 0xA8 0xA5 0x5C0x22
The MySQL connection also has a text encoding – the database must know how to interpret the byte sequences received from the client. It’s called “session encoding” and mysqli_real_escape_string() behavior depends on its value:
- For UTF-8, all
0x22bytes are treated as double quotes and all0x5Cbytes are treated as backslashes. - For Big5, it’s more nuanced. A backslash is represented as
0x5Cand will be prefixed with another backslash. However, the multibyte sequence0xA5 0x5Cstands for the 功 character and the0x5Cbyte won’t be prefixed with a backslash in that context.
Depending on the combination of these two encodings, we’ll get different outputs from mysqli_real_escape_string().
When both encodings match, we get the boring, expected outcome:
UTF-8 string, UTF-8 MySQL connection
// mysqli_real_escape_string('"成功') as bytes
5C 22 48 54 68 41 56 89
// UTF-8 text representation
\"成功
Big5 string, Big5 MySQL connection
// mysqli_real_escape_string('"成功') as bytes
5C 22 A6 A8 A5 5C
// Big5 text representation
\"成功
However, things get spicy when the encodings are mismatched.
With UTF-8 string and Big5 MySQL connection we’re corrupting the data:
// mysqli_real_escape_string('"成功') as bytes
5C 22 5C E6 88 90 5C E5 8A 9F
// Big5 text representation
\"\���\���
With Big5 string and UTF-8 MySQL connection we’re in a dangerous territory:
// mysqli_real_escape_string('"成功') as bytes
5C 22 A6 A8 A5 5C 5C
// UTF-8 text representation
\"���\\// Big5 text representation
\"成功\
If we run our UPDATE with a UTF-8 session encoding, we’ll insert corrupted data. If we run it with a Big5 session encoding, we’re creating a SQL injection vector.
Yikes!
So – what to do?
Handling mismatched encodings securely
We’re looking at two problems here:
- SQL injection and data integrity
- Our ability to use the data
Let’s discuss them separately
Solving for SQL injection and data integrity
We just can’t reliably modify the string without knowing its encoding. We must, therefore, use the string bytes as we receive them – without any additional backslashes.
Prepared statements are one well-established way of doing just that. The SQL query and the data are sent to the database separately and the data is processed straight from the memory buffer. There’s no escaping step. Unfortunately, $wpdb doesn’t support them. Even the method called prepare() internally relies on mysqli_real_escape_string(). Migrating $wpdb to prepared statements would solve the problem, but it’s difficult for backwards compatibility reasons.
What we can do with $wpdb is encode the input string in a non-exploitable format, e.g. using bin2hex():
$escaped_string = bin2hex($_POST['new_content']);
$sql = "UPDATE wp_posts SET wp_content=UNHEX('$escaped_string')";
This has two advantages over mysqli_real_escape_string():
- The database receives the exact byte buffer provided by user. We don’t make any assumptions about the encoding.
- No backslashing is needed.
bin2hex()only returns characters from the[0-9A-F]range.
The downside is the additional processing to hex/unhex the string. It might still be faster than mysqli_real_escape_string() which needs a roundtrip to the MySQL server to determine the encoding, but I haven’t actually benchmarked them.
Solving for our ability to use the data
Unfortunately, by inserting our Big5 data into a UTF-8 column, we’ve only moved the problem to another place in our application.
First, that data is not searchable via UTF-8 queries. SELECT * FROM wp_posts WHERE wp_content='"成功' would find 0 rows. The UTF-8 and Big5 characters are equivalent, but the bytes stored in the database are different.
Second, if we SELECT the data and display on a UTF-8 HTML page, the users will see the replacement character � at best, and completely random characters at worst.
Ultimately, the only way to safely insert and consume string data in WordPress is to:
- Ensure every input string comes in as UTF-8
- Explicitly convert non-UTF-8 inputs to UTF-8. However, that’s difficult. How can you determine the encoding of a random string? Therefore, you could just…
- Reject non-UTF-8 strings. How? Just assume anything that fails to parse as UTF-8 is not a valid input. In PHP you might need to rely on
mb_*functions or this utf8_is_valid_byte_stream function by Dennis Snell.
Major thanks to Dennis Snell for reviewing this post and for all our text encoding conversations – I’ve learned so much during this last year.
Leave a ReplyCancel reply