首页 > 解决方案 > 修复在 python 中从右到左书写

问题描述

我正在尝试将脚本的输出写在 file.txt 上,但是当我写阿拉伯字符时,文件上的输出是从右到左写的。
这是我的脚本:

import unicodedata
import sys
from tabulate import tabulate

headers=["Unicode Point", "Character in UTF-8 + length", "Character normalized + legth"]
data = []
f = open('multiplierNFD.txt', 'a', encoding='utf8')
for i in range (sys.maxunicode + 1):
  uni = chr(i)
  char8 = uni.encode('utf8', 'ignore').decode('utf8', 'ignore')
  char8norm = unicodedata.normalize('NFKC', char8)
  if len(char8) != len(char8norm):
    if i < 65535:
      str1 = "U+" + str(hex(i))[2:].rjust(4,'0')
    else:
      str1 = "U+" + str(hex(i))[2:].rjust(8,'0')
    str2 = char8 + " ---> " + str(len(char8))
    str3 = char8norm + " ---> " + str(len(char8norm))
    data.append([str1, str2, str3])
f.write(tabulate(data, headers=["Unicode Point", "Character in UTF-8 + length", "Character normalized + legth"]))

这是输出的示例:

U+fb16           ﬖ ---> 1                       վն ---> 2
U+fb17           ﬗ ---> 1                       մխ ---> 2
U+fb1d           יִ ---> 1                       יִ ---> 2
U+fb1f           ײַ ---> 1                       ײַ ---> 2
U+fb2a           שׁ ---> 1                       שׁ ---> 2

如何避免这种情况并像前两行一样打印/保存输出?

标签: pythonunicoderight-to-leftpython-unicode

解决方案


将字符包装在从左到右的覆盖中:

import unicodedata
import sys
from tabulate import tabulate

ltr = '\N{LEFT-TO-RIGHT OVERRIDE}'

headers=["Unicode", "Character + UTF-8 length", "NFKC + UTF-8 length"]
data = []
for i in range (sys.maxunicode + 1):
    uni = chr(i)
    nfkc = unicodedata.normalize('NFKC', uni)
    if len(uni) != len(nfkc):
        str1 = f'U+{i:04X}'
        str2 = f'{ltr}{uni}{ltr} ---> {len(uni.encode())}'
        str3 = f'{ltr}{nfkc}{ltr} ---> {len(nfkc.encode())}'
        data.append([str1, str2, str3])

with open('multiplierNFD.txt', 'w', encoding='utf8') as f:
    f.write(tabulate(data, headers=headers))

输出样本:

Unicode    Character + UTF-8 length    NFKC + UTF-8 length
---------  --------------------------  --------------------------
...
U+FB16     ‭ﬖ‭ ---> 3                    ‭վն‭ ---> 4
U+FB17     ‭ﬗ‭ ---> 3                    ‭մխ‭ ---> 4
U+FB1D     ‭יִ‭ ---> 3                    ‭יִ‭ ---> 4
U+FB1F     ‭ײַ‭ ---> 3                    ‭ײַ‭ ---> 4
U+FB2A     ‭שׁ‭ ---> 3                    ‭שׁ‭ ---> 4
...

我还稍微清理了代码,并像标题所说的那样输出 UTF-8 长度,而不是代码点长度。不要将 Unicode 代码点与 UTF-8 编码混淆。例如,这什么都不做:

char8 = uni.encode('utf8', 'ignore').decode('utf8', 'ignore')

所有代码点都可以用 UTF8 编码,因此没有什么可忽略的,解码会将其再次转换回原始字符,所以uni == char8在您的代码中。


推荐阅读