python - Python - 将 utf8 特殊字符(重音)转换为扩展的 ascii 等效字符
问题描述
我想使用 Python 将 utf8 特殊字符(重音等)转换为它们的扩展 ascii(纯粹主义者会说没有这样的东西,所以这里是我的意思的链接)等价物。
所以基本上我想读入一个 UTF-8 文件并写出一个扩展的 ascii 文件(如果需要该信息,类似于 Latin-1(我正在使用 Windows)。我已经阅读了所有的 Unicode 等博客并且仍然一个字都不懂),但我想尽可能多地保留信息。因此,对于 UTF-8 字符 á,我想将其转换为扩展的 ascii 等价 á。我不想忽略或失去角色,也不想使用 a。对于没有等效扩展 ascii 字符的字符,我只想使用我选择的字符,例如 ~,尽管如果扩展 ascii 中不存在 ß,我想将某些字符(如 ß)转换为 ss。
Python 3 中有什么可以做到这一点,或者你能给出一些我将如何做到这一点的示例代码吗?
有谁知道任何列出扩展 ascii 字符的 utf8 等价物的网站?
根据下面的评论,我想出了这段代码,遗憾的是它不能很好地工作,因为大多数特殊字符都返回为?而不是ê(不知道为什么):
# -*- coding: utf-8 -*-
f_in = open(r'E:/work/python/lyman.txt', 'rU', encoding='utf8')
raw = f_in.read()
f_out = open(r'E:/work/python/lyman_ascii.txt', 'w', encoding='cp1252', errors='replace')
retval = []
for char in raw:
codepoint = ord(char)
if codepoint < 0x80: # Basic ASCII
retval.append(str(char))
continue
elif codepoint > 0xeffff:
continue # Characters in Private Use Area and above are ignored
# ë
elif codepoint == 235:
retval.append(chr(137))
continue
# ê
elif codepoint == 234:
retval.append(chr(136))
continue
# ’
elif codepoint == 8217:
retval.append(chr(39)) # 146 gives ? for some reason
continue
else:
print(char)
print(codepoint)
print(''.join(retval))
f_out.write(''.join(retval))
解决方案
这似乎有效:
# -*- coding: utf-8 -*-
import sys
# Don't use codecs in Python 3.
f_in = open(r'af_massaged.txt', 'rU', encoding='utf8')
raw = f_in.read()
f_out = open(r'af_massaged_ascii.txt', 'w', encoding='cp1252', errors='replace')
retval = []
for char in raw:
codepoint = ord(char)
if codepoint < 0x80: # Basic ASCII.
retval.append(str(char))
continue
elif codepoint > 0xeffff:
continue # Characters in Private Use Area and above are ignored.
elif codepoint >= 128 and codepoint <= 159:
continue # Ignore control characters in Latin-1.
# Don't use unichr in Python 3, chr uses unicode. Get character codes from here: https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin-1_Supplement
# This was written on Windows 7 32 bit
# For 160 to 255 Latin-1 matches unicode.
elif codepoint >= 160 and codepoint <= 255:
retval.append(str(char))
continue
# –
elif codepoint == 8211:
retval.append(chr(45))
continue
# ’
elif codepoint == 8217:
retval.append(chr(180)) # 39
continue
# “
elif codepoint == 8220:
retval.append(chr(34))
continue
# ”
elif codepoint == 8221:
retval.append(chr(34))
continue
# €
elif codepoint == 8364:
retval.append('Euro')
continue
# Find missing mappings.
else:
print(char)
print(codepoint)
# Uncomment for debugging.
#for i in range(128, 256):
# retval.append(str(i) + ': ' + chr(i) + chr(13))
#print(''.join(retval))
f_out.write(''.join(retval))
推荐阅读
- debugging - LLDB 断点性能 - 我应该期待什么?
- kubernetes-helm - Helm: Get value from a Map where the key is variable
- swiftui - SwiftUI ForEach 获取二维数组中的元素和索引
- performance - Rasa NLU 模型加载需要大量时间
- python - 从多个部分数据框创建熊猫数据框
- sql - 在 postgresql 上使用合并进行选择附近的错误
- java - 在Android java中编写自己的exif标签
- ios - 在 iPad Air (IOS Swift) 中的 KeychainItemWrapper.m 中保存 uuid 时应用程序崩溃
- html - 在Datables中显示每行分页和3个元素的数据
- spring-boot - 无法在 thymeleaf 中添加客户错误页面