首页 > 解决方案 > Gmail API 返回的文本/纯内容编码存在差异

问题描述

我正在尝试multipart/mixed使用 GMail API 阅读电子邮件。
目标是text/plain将电子邮件的每个部分multipart/mixed(可能有很多,采用不同的编码)正确解码为 C# 字符串(即 UTF-16):

public static string DecodeTextPart(Google.Apis.Gmail.v1.Data.MessagePart part)
{
    var content_type_header = part.Headers.FirstOrDefault(h => string.Equals(h.Name, "content-type", StringComparison.OrdinalIgnoreCase));

    if (content_type_header == null)
        throw new ArgumentException("No content-type header found in the email part");

    var content_type = new System.Net.Mime.ContentType(content_type_header.Value);

    if (!string.Equals(content_type.MediaType, "text/plain", StringComparison.OrdinalIgnoreCase))
        throw new ArgumentException("The part is not text/plain");

    return Encoding.GetEncoding(content_type.CharSet).GetString(GetAttachmentBytes(part.Body));
}

GetAttachmentBytes返回原始附件字节,未经转换,从GMail 使用的base64url 编码解码。

我发现在许多情况下,这会产生无效的字符串,因为我为附件内容获得的原始字节似乎总是采用 UTF-8 格式,即使content-type同一部分另有声明。

例如给定电子邮件:

Date: ...
From: ...
Reply-To: ...
Message-ID: ...
To: ...
Subject: Test 1 text file
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="----------0E50FC0802A2FCCAA"

------------0E50FC0802A2FCCAA
Content-Type: text/plain; charset=windows-1251
Content-Transfer-Encoding: 8bit


Content test: Cyrillic, Windows-1251 (à, ÿ, æ)
------------0E50FC0802A2FCCAA
Content-Type: TEXT/PLAIN;
 name="Irrelevant.txt"
Content-transfer-encoding: base64
Content-Disposition: attachment;
 filename="Irrelevant.txt"

VGhpcyBmaWxlIGRvZXMgbm90IGNvbnRhaW4gdXNlZnVsIGluZm9ybWF0aW9u
------------0E50FC0802A2FCCAA--

,我成功找到了第一部分,上面的代码表明它是charset=windows-1251在 的帮助下System.Net.Mime.ContentType,然后.GetString()返回垃圾,因为返回的实际原始字节GetAttachmentBytes对应的是 UTF-8 编码,而不是 Windows-1251。

完全相同的情况发生

Subject: Test 2 text file
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="----------0B716C1D8123D8710"

------------0B716C1D8123D8710
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 8bit


Content test: Cyrillic, koi-8 (Б, С, Ц)
------------0B716C1D8123D8710
Content-Type: TEXT/PLAIN;
 name="Irrelevant.txt"
Content-transfer-encoding: base64
Content-Disposition: attachment;
 filename="Irrelevant.txt"

VGhpcyBmaWxlIGRvZXMgbm90IGNvbnRhaW4gdXNlZnVsIGluZm9ybWF0aW9u
------------0B716C1D8123D8710--

请注意,编码名称后面括号中的三个测试字母在两封电子邮件中是相同的,并且在 Unicode 中看起来像(а, я, ж),但由于编码不同,在上面引用的电子邮件正文表示中(正确地)看起来是错误的。

如果我“修复”该函数以始终使用Encoding.UTF8而不是GetEncoding(content_type.CharSet),那么它似乎在我迄今为止所做的测试中有效。

At the same time, the GMail interface displays the letters correctly in both cases, so it must have correctly parsed the incoming emails using the correct declared encodings.

Is it the case that the GMail API re-encodes all text chunks into UTF-8 (wrapped in base64url), but reports the original charset for them?
Am I therefore supposed to always use UTF-8 with GMail API and disregard content-type's charset=?
Or is there a problem with my code?

标签: c#emailgmail-apimimecontent-encoding

解决方案


According to these two resources:

The Value is indeed a base-64 encoded representation of the part converted to UTF-8.

This is however not documented by Google, as far as I can find.


推荐阅读