首页 > 解决方案 > 将十六进制转换为 UTF8 在 perl 中无法按预期工作

问题描述

我试图在 perl 中理解 UTF8。

我有以下字符串 Alizéh。如果我查找这个字符串的十六进制,我会从https://onlineutf8tools.com/convert-utf8-to-hexadecimal得到 416c697ac3a968 (这与这个字符串的原始来源匹配)。

所以我认为打包该十六进制并将其编码为 utf8 应该会产生 unicode 字符串。但它会产生一些非常不同的东西。

有人能解释我做错了什么吗?

这是一个简单的测试程序来展示我的工作。

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now         $hex\n";

print "=========================================== utf8 from code test finish\n\n";

这打印:

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now         416c697ae968
=========================================== utf8 from code test finish

关于如何获取 UTF8 字符串的十六进制值并将其转换为 perl 中有效的 UTF8 标量的任何提示?

我将在这个扩展版本中解释一些更奇怪的地方

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now         $hex\n";

print "=========================================== utf8 from code test finish\n\n";

print "=========================================== Unaccent test start\n";

my $plaintest = unac_string('utf8', "Alizéh");

print "Alizéh passed to the unaccent gives $plaintest\n";


my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as  $cleanpackedHexIntoPlainString\n";

my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Unaccenting the packed version gives $packedtest\n";

utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";

$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Now unaccenting the packed version gives $packedtest\n";

print "=========================================== Unaccent test finish\n\n";

这打印:

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now         416c697ae968
=========================================== utf8 from code test finish

=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as  Alizéh
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as Alizéh
Now unaccenting the packed version gives AlizA©h
=========================================== Unaccent test finish

在这个测试中,unaccent 库似乎接受了十六进制字符串的打包版本。我不知道为什么,有人可以帮我理解为什么会这样吗?

标签: perlutf-8

解决方案


Unicode 字符串是 Perl 中的一等值,您无需跳过这些环节。您只需要识别并跟踪何时有字节和何时有字符,Perl 不会为您区分,所有字节字符串也是有效的字符串。实际上,您正在对字符串进行双重编码,这些字符串仍然有效,因为 UTF-8 编码字节表示(对应的字符)您的 UTF-8 编码字节。

use utf8;将从 UTF-8 解码您的源代码,因此通过声明您的以下文字字符串已经是 unicode 字符串并且可以传递给任何正确接受字符的 API。要从一串 UTF-8 字节中获得相同的结果(正如您通过打包字节的十六进制表示来生成的那样),请使用来自 Encode 的 decode(或我更好的 wrapper)。

use strict;
use warnings;
use utf8;
use Encode 'decode';

my $str = 'Alizéh'; # already decoded
my $hex = '416c697ac3a968';
my $bytes = pack 'H*', $hex;
my $chars = decode 'UTF-8', $bytes;

Unicode 字符串需要编码为 UTF-8 才能输出到需要字节的东西,例如 STDOUT;可以将:encoding(UTF-8)层应用于此类句柄以自动执行此操作,并且可以从输入句柄自动解码。应该应用什么的确切性质完全取决于你的角色来自哪里以及他们要去哪里。有关可用选项的太多信息,请参阅此答案。

use Encode 'encode';
print encode 'UTF-8', "$chars\n";
binmode *STDOUT, ':encoding(UTF-8)'; # warning: global effect
print "$chars\n";

推荐阅读