python - 如何将表情符号的 unicode 转换为 CLDR 短名称
问题描述
我正在使用 python 来提取评论并显示它们。当我打印它时,它看起来像这样。
This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f
如何将表情符号的 unicode 转换为其各自的 CLDR 短名称?例如,U+1F44D 将打印为竖起大拇指。
解决方案
编辑:我想我找到了代码问题的解决方案\ud83d\udc9c
text = text.encode('utf-16', 'surrogatepass').decode('utf-16')
它将代理值转换\ud83d\udc9c
为正确的表情符号值\U0001f49c
资料来源:如何在 Python 中使用代理对?
维基百科:代孕
使用谷歌我发现
print('\U0001F44D'.encode('ascii', 'namereplace').decode())
结果
\N{THUMBS UP SIGN}
和
import unicodedata
print(unicodedata.name('\U0001F44D'))
结果:
THUMBS UP SIGN
所以Google
在你在 Stackoverflow 上提问之前使用它是很好的。
https://docs.python.org/3/howto/unicode.html
文本也一样
text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
print(text.encode('ascii', 'namereplace').decode())
结果:
This was heart wrenching \N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}
\N{THUMBS UP SIGN}
现在您可能需要删除\N{
和}
但它有问题\ud83d\udc9c\ud83d\udc9c\ud83d\udc9c
您可以使用unicodedata
in for
-loop 来获取文本中每个字符的名称,但如果它没有名称,即可能有问题。'\n'
. 它还为普通字符提供名称,因此您可能必须使用unicodedata.category()
来决定要替换哪些字符。
这也有问题\ud83d\udc9c\ud83d\udc9c\ud83d\udc9c
import unicodedata
# http://www.unicode.org/reports/tr44/#General_Category_Values
for char in text:
try:
print(char, '|', unicodedata.category(char), '|', unicodedata.name(char))
except ValueError:
print(repr(char), '| (repr)')
结果:
T | Lu | LATIN CAPITAL LETTER T
h | Ll | LATIN SMALL LETTER H
i | Ll | LATIN SMALL LETTER I
s | Ll | LATIN SMALL LETTER S
| Zs | SPACE
w | Ll | LATIN SMALL LETTER W
a | Ll | LATIN SMALL LETTER A
s | Ll | LATIN SMALL LETTER S
| Zs | SPACE
h | Ll | LATIN SMALL LETTER H
e | Ll | LATIN SMALL LETTER E
a | Ll | LATIN SMALL LETTER A
r | Ll | LATIN SMALL LETTER R
t | Ll | LATIN SMALL LETTER T
| Zs | SPACE
w | Ll | LATIN SMALL LETTER W
r | Ll | LATIN SMALL LETTER R
e | Ll | LATIN SMALL LETTER E
n | Ll | LATIN SMALL LETTER N
c | Ll | LATIN SMALL LETTER C
h | Ll | LATIN SMALL LETTER H
i | Ll | LATIN SMALL LETTER I
n | Ll | LATIN SMALL LETTER N
g | Ll | LATIN SMALL LETTER G
| Zs | SPACE
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
'\n' | (repr)
A | Lu | LATIN CAPITAL LETTER A
m | Ll | LATIN SMALL LETTER M
a | Ll | LATIN SMALL LETTER A
z | Ll | LATIN SMALL LETTER Z
i | Ll | LATIN SMALL LETTER I
n | Ll | LATIN SMALL LETTER N
g | Ll | LATIN SMALL LETTER G
| Zs | SPACE
c | Ll | LATIN SMALL LETTER C
o | Ll | LATIN SMALL LETTER O
m | Ll | LATIN SMALL LETTER M
p | Ll | LATIN SMALL LETTER P
a | Ll | LATIN SMALL LETTER A
s | Ll | LATIN SMALL LETTER S
s | Ll | LATIN SMALL LETTER S
i | Ll | LATIN SMALL LETTER I
o | Ll | LATIN SMALL LETTER O
n | Ll | LATIN SMALL LETTER N
| Zs | SPACE
'\ud83d' | (repr)
'\udc9c' | (repr)
'\ud83d' | (repr)
'\udc9c' | (repr)
'\ud83d' | (repr)
'\udc9c' | (repr)
| Zs | SPACE
# | Po | NUMBER SIGN
t | Ll | LATIN SMALL LETTER T
e | Ll | LATIN SMALL LETTER E
a | Ll | LATIN SMALL LETTER A
r | Ll | LATIN SMALL LETTER R
s | Ll | LATIN SMALL LETTER S
'\n' | (repr)
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
❤ | So | HEAVY BLACK HEART
️ | Mn | VARIATION SELECTOR-16
因为它有问题\ud83d\udc9c\ud83d\udc9c\ud83d\udc9c
所以我用?
import unicodedata
text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
result = []
for char in text:
if unicodedata.category(char) in ('So', 'Mn'):
result.append(':{}:'.format(unicodedata.name(char)))
elif unicodedata.category(char) in ('Cs'):
result.append('?') #char)
else:
result.append(char)
print(''.join(result))
结果:
This was heart wrenching :HEAVY BLACK HEART::VARIATION SELECTOR-16:
Amazing compassion ?????? #tears
:HEAVY BLACK HEART::VARIATION SELECTOR-16::HEAVY BLACK HEART::VARIATION SELECTOR-16::HEAVY BLACK HEART::VARIATION SELECTOR-16:
编辑:再次使用谷歌,我发现外部模块表情符号可以转换一些名称,但它也有问题,\ud83d\udc9c
所以我曾经repr
显示它 - 但它也打印新行为\n
text = '''This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
import emoji
#print( repr(emoji.demojize(text, use_aliases=True)) )
print( repr(emoji.demojize(text)) )
结果:
'This was heart wrenching :heart:\nAmazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears\n:heart::heart::heart:'
http://www.unicode.org/emoji/charts/full-emoji-list.html
https://www.webfx.com/tools/emoji-cheat-sheet/
http://unicode.org/Public/emoji/12.0/emoji-test.txt
顺便说一句:我找到了可以找到表情符号并给出名称的模块demoji 。但它也有代码问题\ud83d\udc9c
import demoji
# run only once after installing module
demoji.download_codes()
print(demoji.findall(text))
它只需要demoji.download_codes()
一次 - 在安装模块之后。
结果:
{'❤️': 'red heart'}
如果您将其作为 JSON 数据获取,"\ud83d\udc9c"
那么您应该没有问题 - 它应该会自动转换它
import json
# escaped unicode in " "
data = r'"\ud83d\udc9c"'
print(json.loads(data))
在其他情况下,您必须将其转换
# convert to escaped unicode and put in " "
data = '"{}"'.format('\ud83d\udc9c'.encode('unicode-escape').decode())
print(json.loads(data))
推荐阅读
- c# - 如何不使用 DeveloperExceptionPageMiddleware
- python - 您如何选择将基于索引的信息放入 pandas DataFrame 的位置?
- kubernetes - ansible 动态库存 Kubernetes
- reactjs - 用 Babel.Transform 反应 SSR
- javascript - 在导入 ES6 模块之前定义全局变量
- google-assistant-sdk - google-auth-oauthlib 不是内部或外部命令、可运行程序或批处理文件
- mapping - 如何创建多行字符串 sf 对象以在 Leaflet 地图中创建时间滑块
- llvm - “ld:找不到-lLLVMExtensions”是什么意思?
- c++ - 创建按总和排序的所有可能组合
- r - 使用 pivot_longer 以及数字和字符数据的混合从宽到长