c# - Gmail API 返回的文本/纯内容编码存在差异
问题描述
我正在尝试multipart/mixed
使用 GMail API 阅读电子邮件。
目标是text/plain
将电子邮件的每个部分multipart/mixed
(可能有很多,采用不同的编码)正确解码为 C# 字符串(即 UTF-16):
public static string DecodeTextPart(Google.Apis.Gmail.v1.Data.MessagePart part)
{
var content_type_header = part.Headers.FirstOrDefault(h => string.Equals(h.Name, "content-type", StringComparison.OrdinalIgnoreCase));
if (content_type_header == null)
throw new ArgumentException("No content-type header found in the email part");
var content_type = new System.Net.Mime.ContentType(content_type_header.Value);
if (!string.Equals(content_type.MediaType, "text/plain", StringComparison.OrdinalIgnoreCase))
throw new ArgumentException("The part is not text/plain");
return Encoding.GetEncoding(content_type.CharSet).GetString(GetAttachmentBytes(part.Body));
}
GetAttachmentBytes
返回原始附件字节,未经转换,从GMail 使用的base64url 编码解码。
我发现在许多情况下,这会产生无效的字符串,因为我为附件内容获得的原始字节似乎总是采用 UTF-8 格式,即使content-type
同一部分另有声明。
例如给定电子邮件:
Date: ...
From: ...
Reply-To: ...
Message-ID: ...
To: ...
Subject: Test 1 text file
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----------0E50FC0802A2FCCAA"
------------0E50FC0802A2FCCAA
Content-Type: text/plain; charset=windows-1251
Content-Transfer-Encoding: 8bit
Content test: Cyrillic, Windows-1251 (à, ÿ, æ)
------------0E50FC0802A2FCCAA
Content-Type: TEXT/PLAIN;
name="Irrelevant.txt"
Content-transfer-encoding: base64
Content-Disposition: attachment;
filename="Irrelevant.txt"
VGhpcyBmaWxlIGRvZXMgbm90IGNvbnRhaW4gdXNlZnVsIGluZm9ybWF0aW9u
------------0E50FC0802A2FCCAA--
,我成功找到了第一部分,上面的代码表明它是charset=windows-1251
在 的帮助下System.Net.Mime.ContentType
,然后.GetString()
返回垃圾,因为返回的实际原始字节GetAttachmentBytes
对应的是 UTF-8 编码,而不是 Windows-1251。
完全相同的情况发生
Subject: Test 2 text file
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----------0B716C1D8123D8710"
------------0B716C1D8123D8710
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 8bit
Content test: Cyrillic, koi-8 (Б, С, Ц)
------------0B716C1D8123D8710
Content-Type: TEXT/PLAIN;
name="Irrelevant.txt"
Content-transfer-encoding: base64
Content-Disposition: attachment;
filename="Irrelevant.txt"
VGhpcyBmaWxlIGRvZXMgbm90IGNvbnRhaW4gdXNlZnVsIGluZm9ybWF0aW9u
------------0B716C1D8123D8710--
请注意,编码名称后面括号中的三个测试字母在两封电子邮件中是相同的,并且在 Unicode 中看起来像(а, я, ж)
,但由于编码不同,在上面引用的电子邮件正文表示中(正确地)看起来是错误的。
如果我“修复”该函数以始终使用Encoding.UTF8
而不是GetEncoding(content_type.CharSet)
,那么它似乎在我迄今为止所做的测试中有效。
At the same time, the GMail interface displays the letters correctly in both cases, so it must have correctly parsed the incoming emails using the correct declared encodings.
Is it the case that the GMail API re-encodes all text chunks into UTF-8 (wrapped in base64url), but reports the original charset
for them?
Am I therefore supposed to always use UTF-8 with GMail API and disregard content-type
's charset=
?
Or is there a problem with my code?
解决方案
According to these two resources:
- Stack Overflow: Gmail API decoding messages in Javascript
- GitHub: Google API Python Client: Invalid message body size
The Value is indeed a base-64 encoded representation of the part converted to UTF-8.
This is however not documented by Google, as far as I can find.
推荐阅读
- node.js - nodeJS typescript的中间件中未定义req.headers
- php - 如何用 smarty 渲染 Html
- docker - 将 Docker 与官方 Progress OpenEdge RDBMS 映像一起使用
- regex - Regex a portion of string with a specific patten
- java - 我如何正确准备 JDBC PreparedStatment?
- ios - 如何在 NativeScript 7 中使用原生 SwiftUI 视图
- python - 加密 PEP 517 错误 ask-sdk-webservice-support
- javascript - 输入一个字母时,搜索栏不显示所有相关结果
- permissions - 如何使用 graphene-django 限制谁可以访问 GraphiQL API 浏览器?
- reactjs - 在域的 nginx 服务器上显示反应应用程序时出现问题