Why Is The Output Of Print In Python2 And Python3 Different With The Same String?
Solution 1:
Consider the following snippet of code:
import sys
for i in range(128, 256):
sys.stdout.write(chr(i))
Run this with Python 2 and look at the result with hexdump -C
:
00000000808182838485868788898a 8b 8c8d 8e 8f |................|
Et cetera. No surprises; 128 bytes from 0x80
to 0xff
.
Do the same with Python 3:
00000000 c2 80 c2 81 c2 82 c2 83 c2 84 c2 85 c2 86 c2 87 |................|
...
00000070 c2 b8 c2 b9 c2 ba c2 bb c2 bc c2 bd c2 be c2 bf |................|
00000080 c3 80 c3 81 c3 82 c3 83 c3 84 c3 85 c3 86 c3 87 |................|
...
000000f0 c3 b8 c3 b9 c3 ba c3 bb c3 bc c3 bd c3 be c3 bf |................|
To summarize:
- Everything from
0x80
to0xbf
has0xc2
prepended. - Everything from
0xc0
to0xff
has bit 6 set to zero and has0xc3
prepended.
So, what’s going on here?
In Python 2, strings are ASCII and no conversion is done. Tell it to write something outside the 0-127 ASCII range, it says “okey-doke!” and just writes those bytes. Simple.
In Python 3, strings are Unicode. When non-ASCII characters are written, they must be encoded in some way. The default encoding is UTF-8.
So, how are these values encoded in UTF-8?
Code points from 0x80
to 0x7ff
are encoded as follows:
110vvvvv 10vvvvvv
Where the 11 v
characters are the bits of the code point.
Thus:
0x80 hex
100000008-bit binary0001000000011-bit binary00010000000 divide into vvvvv vvvvvv
1100001010000000 resulting UTF-8 octets inbinary0xc20x80 resulting UTF-8 octets in hex
0xc0 hex
110000008-bit binary0001100000011-bit binary00011000000 divide into vvvvv vvvvvv
1100001110000000 resulting UTF-8 octets inbinary0xc30x80 resulting UTF-8 octets in hex
So that’s why you’re getting a c2
before 87
.
How to avoid all this in Python 3? Use the bytes
type.
Solution 2:
Python 2's default string type is byte strings. Byte strings are written "abc"
while Unicode strings are written u"abc"
.
Python 3's default string type is Unicode strings. Byte strings are written as b"abc"
while Unicode strings are written "abc"
(u"abc"
still works, too). since there are millions of Unicode characters, printing them as bytes requires an encoding (UTF-8 in your case) which requires multiple bytes per code point.
First use a byte string in Python 3 to get the same Python 2 type. Then, because Python 3's print
expects Unicode strings, use sys.stdout.buffer.write
to write to the raw stdout interface, which expects byte strings.
python3 -c 'import sys; sys.stdout.buffer.write(b"\x08\x04\x87\x18")'
Note that if writing to a file, there are similar issues. For no encoding translation, open files in binary mode 'wb'
and write byte strings.
Post a Comment for "Why Is The Output Of Print In Python2 And Python3 Different With The Same String?"