perl - perl Encode::Guess 有和没有提示 - 检测 utf8
问题描述
我对 Encode::Guess 感到困惑。假设这是我的 perl 代码:
use strict;
use warnings;
use 5.18.2;
use Encode;
use Encode::Guess qw/utf8 iso-8859-1/;
use open IO => ':encoding(UTF-8)', ':std';
my $str1 = "1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o";
my $str2 = "2 = educa\x{e7}\x{e3}o";
say "A: ".&fixEnc($str1);
say "B: ".&fixEnc($str1,'hint');
say "C: ".&fixEnc($str2);
say "D: ".&fixEnc($str2,'hint');
say "";
sub fixEnc() {
my $data = $_[0];
my $enc = "";
if ($_[1]) {
$enc = guess_encoding($data,qw/utf8 iso-8859-1/);
} else {
$enc = guess_encoding($data);
};
if (!ref($enc)) {
return "ERROR: Can't guess: $enc for $data";
} else {
my $utf8 = decode($enc->name, $data);
$utf8 = "encoding guess: ".$enc->name."; result: $utf8";
return $utf8;
};
};
它产生:
A1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
B2: ERROR: Can't guess: utf8 or iso-8859-1 for 1 = educação
C1: encoding guess: iso-8859-1; result: 2 = educação
D1: encoding guess: iso-8859-1; result: 2 = educação
现在,如果我替换 'use Encode::Guess qw/utf8 iso-8859-1/;' 通过'使用编码::猜测;' 我明白了
A2: encoding guess: utf8; result: 1 = educação
B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação
D2: encoding guess: iso-8859-1; result: 2 = educação
造成差异的原因是什么?特别是,为什么我用utf8提示时没有猜到utf8?
编辑:我在下面发布了一个答案。基本上,人们意识到 Guess 使用字符编码并且不会说葡萄牙语!'educação',虽然不是葡萄牙语,但Guess 无法与 UTF8 版本的 educação 区分开来(与说葡萄牙语的人不同)。
解决方案
我认为这就是正在发生的事情。使用use Encode::Guess qw/utf8 iso-8859-1/;
“提示”没有区别(抱歉不清楚!),所以我们只有
A1/B1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
和 C1/D1:编码猜测:iso-8859-1;结果:2 = educação
对于 A1/B2,字符串可以是 UTF8 (educação) 或 latin1 (educação)。第二个看起来不正确,但 Encode::Guess 无法分辨 - Guess 使用字符编码并且不会说葡萄牙语!
现在,如果我替换 'use Encode::Guess qw/utf8 iso-8859-1/;' 通过'使用编码::猜测;' 我明白了
A2: encoding guess: utf8; result: 1 = educação
latin-1 不再是一个选项(它不是默认的一部分),所以结果是 utf8。
B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
在 B2 中,随着命中,我们又回到了上述场景,Guess 无法决定。
对于 C2:
C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação
这是有道理的,因为 latin-1 不是默认值的一部分。终于在D2
D2: encoding guess: iso-8859-1; result: 2 = educação
latin-1 被暗示,所以编码被检测到。
推荐阅读
- ruby-on-rails - Rails、ActiveJobs 和 AWS SQS:当一个工作实例被杀死时,我的工作会发生什么?
- sqlite - Flutter Sqflite 多表模型
- javascript - 在 Angular 4 打字稿上加载外部网络库
- pandas - ImportError:没有名为“pandas.testing”的模块
- javascript - 对象属性之间的 Angular FormArray 交叉验证
- arrays - 显示来自 API In Angular 6 页面的值
- java - 如何解析 asString() 唯一响应
- wordpress - 使用 Imagick 在 Wordpress 中自定义缩略图
- javascript - JavaScript 使用 for 循环创建新元素
- angular - 多张图片上传到 Spring Boot