delphi - Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis

问题描述

First I get a TMemoryStream from an HTTP request, which contains the body of the response. Then I load it in a TStringList and save the text in a widestring (also tried with ansistring).

The problem is that I need to convert the string because the users language is spanish, so vowels with accent marks are very common and I need to store the info.

lServerResponse := TStringList.Create;
lServerResponse.LoadFromStream(lResponseMemoryStream);

lStringResponse := lServerResponse.Text;
lDecodedResponse := Utf8Decode(lStringResponse );

If the response (a part of it) is "Hólá Múndó", lStringResponse value will be "HÃ³lÃ¡ MÃºndÃ³", and lDecodedResponse will be "Hólá Múndó".

But if the user adds any emoji (lStringResponse value will be "HÃ³lÃ¡ MÃºndÃ³ ðŸ˜€" if the emoji is ) Utf8Decode fails and returns an empty string. Is there a way to get just the ANSI characters from a string (or MemoryStream)?, or removing whatever Utf8Decode can't convert?

Thanks for your time.

标签： delphiutf-8delphi-7emojiansi

TMemoryStream is just raw bytes. There is no reason to loading that stream into a TStringList just to extract a (Wide|Ansi)String from it. You can assign the bytes directly to an AnsiString/UTF8String using SetString() instead, eg:

var
  lStringResponse: UTF8String;
  lDecodedResponse: WideString;
begin
  SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
  lDecodedResponse := UTF8Decode(lStringResponse);
end;

Just make sure the HTTP content really is encoded as UTF-8, or else this approach will not work.

That being said - UTF8Decode() (and UTF8Encode()) in Delphi 7 DO NOT support Unicode codepoints above U+FFFF, which means they DO NOT support Emojis at all. That was fixed in Delphi 2009.

To work around that issue in earlier versions, you can use the Win32 API MultiByteToWideChar() function instead, eg:

uses
  ..., Windows;

function My_UTF8Decode(const S: UTF8String): WideString;
var
  WLen: Integer;
begin
  WLen := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), nil, 0);
  if WLen > 0 then
  begin
    SetLength(Result, WLen);
    MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), PWideChar(Result), WLen);
  end else
    Result := '';
end;

var
  lStringResponse: UTF8String;
  lDecodedResponse: WideString;
begin
  SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
  lDecodedResponse := My_UTF8Decode(lStringResponse);
end;

Alternatively:

uses
  ..., Windows;

function My_UTF8Decode(const S: PAnsiChar; const SLen: Integer): WideString;
var
  WLen: Integer;
begin
  WLen := MultiByteToWideChar(CP_UTF8, 0, S, SLen, nil, 0);
  if WLen > 0 then
  begin
    SetLength(Result, WLen);
    MultiByteToWideChar(CP_UTF8, 0, S, SLen, PWideChar(Result), WLen);
  end else
    Result := '';
end;

var
  lDecodedResponse: WideString;
begin
  lDecodedResponse := My_UTF8Decode(PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
end;

Or, use a 3rd party Unicode conversion library, like ICU or libiconv, which handle this for you.

delphi - Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis

问题描述

解决方案

推荐阅读