1 2 s = ‘中‘ 3 4 s.encode(‘utf-16‘) 5 # windows 命令窗口下返回 b‘\xff\xfe-N‘ 6 # 因为命令窗口自动做了转换 7 8 hex(ord(‘-‘)) # 0x2d 9 hex(ord(‘N‘)) # 0x4e 10 11 # 综上 b‘\xff\xfe\x2d\4e 12 # 默认采用了小端编码,这个和CPU有关,应该不只是Python的原因
Name | UTF-8 | UTF-16 | UTF-16BE | UTF-16LE | UTF-32 | UTF-32BE | UTF-32LE |
---|---|---|---|---|---|---|---|
Smallest code point | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 |
Largest code point | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF |
Code unit size | 8 bits | 16 bits | 16 bits | 16 bits | 32 bits | 32 bits | 32 bits |
Byte order | N/A | <BOM> | big-endian | little-endian | <BOM> | big-endian | little-endian |
Fewest bytes per character | 1 | 2 | 2 | 2 | 4 | 4 | 4 |
Most bytes per character | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
# 如果指定了大小端,那么字节里就没有了BOM信息 s.encode(‘utf-16be‘) # 返回b‘N-‘ # b‘\x4e\x2d‘ # 没有带BOM
with open(‘res.txt‘, ‘wb‘) as f: f.write(s.encode(‘utf-16be‘)) # 不带bom写入,但是不建议,utf-16 utf-32尽量带BOM with open(‘res.txt‘, ‘w‘, encoding=‘utf-16be‘) as f: f.write(s) # 因为s默认时unicode,所以以encoding编码成字节,写入文件中,当然如上所述,尽量带BOM
Bytes | Encoding Form |
---|---|
00 00 FE FF | UTF-32, big-endian |
FF FE 00 00 | UTF-32, little-endian |
FE FF | UTF-16, big-endian |
FF FE | UTF-16, little-endian |
EF BB BF | UTF-8 |
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?
A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is usedtransparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. [AF]
原文:https://www.cnblogs.com/zhouww/p/13544266.html