Python 3 - If A String Contains Only Ascii, Is It Equal To The String As Bytes?

December 27, 2023 Post a Comment

Consider Python 3 SMTPD - the data received is contained in a string. http://docs.python.org/3.4/library/smtpd.html quote: 'and data is a string containing the contents of the e-ma

Solution 1:

if a string contains only ASCII, is it equal to the string as bytes?

No. It is not equal in Python 3:

>>> '1' == b'1'False

bytes object is not equal to str (Unicode string) object in a similar way that an integer is not equal to a string:

>>> '1' == 1False

In some programming languages the above comparisons are true e.g., in Python 2:

>>> b'1' == u'1'True

and 1 == '1' in Perl:

$ perl -e "print qq(True\n) if 1 == q(1)"True

Your question is a good example of why the stricter Python 3 behaviour is preferable. It forces programmers to confront their text/bytes misconceptions without waiting for their code to break for some input.

Strings in Python 3 are Unicode.

yes. Strings are immutable sequences of Unicode code points in Python 3.

Emails are always ASCII.

Most emails are transported as 7-bit messages (ASCII range: hex 00-7F). Though "virtually all modern email servers are 8-bit clean." i.e., 8-bit content won't be corrupted. And 8BITMIME extension sanctions the passing of some of 8-bit content.

In other words: emails are not always ASCII.

Pure ASCII is valid Unicode.

ASCII is a character encoding. You can decodesome byte sequences to Unicode using US-ASCII character encoding. Unicode strings have no associated character encoding i.e., you can encode them into bytes using any character encoding that can represent corresponding Unicode code points.

Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?

If input is in ascii range then data.decode('ascii', 'strict').encode('ascii') == data. Though Lib/smtpd.py does some conversions to the input data (according to RFC 5321) therefore the content that you get as data may be different even if the input is pure ASCII.

"How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?"
my goal is not to find malformed emails but to save inbound emails to disk in precisely the binary/bytes form that they arrived.

The bug that you've linked (smtpd.py should not decode utf-8) makes smptd.py non 8-bit clean.

You could override SMTPChannel.collect_incoming_data method from smtpd.py to save incoming bytes as is.

"A string of ASCII text is also valid UTF-8 text."

It is true. It is a nice property of UTF-8 encoding. If you can decode a byte sequence into Unicode using US-ASCII character encoding then you can also decode the bytes using UTF-8 character encoding (and the resulting Unicode code points are the same in both cases).

smptd.py should have used either latin1 (it decodes any byte sequence) or ascii (with 'strict' error handler to fail on any non-ascii byte) instead of utf-8 (it allows some non-ascii bytes -- bad).

Keep in mind:

some emails may have bytes outside ascii range
de-transparency according to RFC 5321 doesn't preserve input bytes as-is even if they are all in ascii range

alezinhacris

Python 3 - If A String Contains Only Ascii, Is It Equal To The String As Bytes?

Solution 1:

Post a Comment for "Python 3 - If A String Contains Only Ascii, Is It Equal To The String As Bytes?"

Widget HTML #3