Python 3 - If A String Contains Only Ascii, Is It Equal To The String As Bytes?
Solution 1:
if a string contains only ASCII, is it equal to the string as bytes?
No. It is not equal in Python 3:
>>> '1' == b'1'False
bytes
object is not equal to str
(Unicode string) object in a similar way that an integer is not equal to a string:
>>> '1' == 1False
In some programming languages the above comparisons are true e.g., in Python 2:
>>> b'1' == u'1'True
and 1 == '1'
in Perl:
$ perl -e "print qq(True\n) if 1 == q(1)"True
Your question is a good example of why the stricter Python 3 behaviour is preferable. It forces programmers to confront their text/bytes misconceptions without waiting for their code to break for some input.
- Strings in Python 3 are Unicode.
yes. Strings are immutable sequences of Unicode code points in Python 3.
- Emails are always ASCII.
Most emails are transported as 7-bit messages (ASCII range: hex 00-7F
). Though "virtually all modern email servers are 8-bit clean." i.e., 8-bit content won't be corrupted. And 8BITMIME extension sanctions the passing of some of 8-bit content.
In other words: emails are not always ASCII.
- Pure ASCII is valid Unicode.
ASCII is a character encoding. You can decodesome byte sequences to Unicode using US-ASCII character encoding. Unicode strings have no associated character encoding i.e., you can encode them into bytes using any character encoding that can represent corresponding Unicode code points.
Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?
If input is in ascii range then data.decode('ascii', 'strict').encode('ascii') == data
.
Though Lib/smtpd.py does some conversions to the input data (according to RFC 5321
) therefore the content that you get as data
may be different even if the input is pure ASCII.
"How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?"
my goal is not to find malformed emails but to save inbound emails to disk in precisely the binary/bytes form that they arrived.
The bug that you've linked (smtpd.py should not decode utf-8) makes smptd.py non 8-bit clean.
You could override SMTPChannel.collect_incoming_data
method from smtpd.py
to save incoming bytes as is.
It is true. It is a nice property of UTF-8 encoding. If you can decode a byte sequence into Unicode using US-ASCII character encoding then you can also decode the bytes using UTF-8 character encoding (and the resulting Unicode code points are the same in both cases).
smptd.py
should have used either latin1
(it decodes any byte sequence) or ascii
(with 'strict' error handler to fail on any non-ascii byte) instead of utf-8
(it allows some non-ascii bytes -- bad).
Keep in mind:
- some emails may have bytes outside ascii range
- de-transparency according to RFC 5321 doesn't preserve input bytes as-is even if they are all in ascii range
Post a Comment for "Python 3 - If A String Contains Only Ascii, Is It Equal To The String As Bytes?"