首页 > 解决方案 > 无法从 C++ std::string 中提取 Unicode 符号

问题描述

我正在寻找一个 C++ std::string,然后将该 std::string 传递给一个分析它的函数,然后从中提取 Unicode 符号和简单的 ASCII 符号。

我在网上搜索了很多教程,但都提到标准 C++ 不完全支持 Unicode 格式。他们中的许多人提到使用ICU C++

这是我的 C++ 程序,用于理解上述基本功能。它读取原始字符串,转换为 ICU Unicode 字符串并打印:

#include <iostream>
#include <string>
#include "unicode/unistr.h"

int main()
{
    std::string s="Hello☺";
    // at this point s contains a line of text
    // which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
}

预期输出:

Hello☺

实际输出:

Hello?

请建议我做错了什么。还建议任何替代/更简单的方法

谢谢

更新1(旧):工作代码如下:

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"

void f(const std::string & s)
{
  std::wcout << "Inside called function" << std::endl;
  constexpr char locale_name[] = "";
  setlocale( LC_ALL, locale_name );
  std::locale::global(std::locale(locale_name));
  std::ios_base::sync_with_stdio(false);
  std::wcin.imbue(std::locale());
  std::wcout.imbue(std::locale());

  // at this point s contains a line of text which may be ANSI or UTF-8 encoded

  // convert std::string to ICU's UnicodeString
  icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

  // convert UnicodeString to std::wstring
  std::wstring ws;
  for (int i = 0; i < ucs.length(); ++i)
    ws += static_cast<wchar_t>(ucs[i]);

  std::wcout << ws << std::endl;
}

int main()
{
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "Inside main function" << std::endl;

    std::string s=u8"hello☺";
    // at this point s contains a line of text which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
    std::wcout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

现在,预期输出和实际输出都相同,即:

Inside main function
hello☺
--------------------------------
Inside called function
hello☺

更新 2(最新):更新 1 中提到的代码不适用于 UTF32 符号,例如 . 因此,所有可能的 Unicode 符号的工作代码如下。特别感谢@Botje的解决方案。我希望我能给他的解决方案多打一个勾!!!:)

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"

void f(const std::u32string & s)
{
  std::wcout << "INSIDE CALLED FUNCTION:" << std::endl;

  icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
  std::cout << "Unicode string is: " << ustr << std::endl;

  std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

  std::cout << "Individual characters of the string are:" << std::endl;
  for(int i=0; i < ustr.countChar32(); i++)
    std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

  std::cout << "--------------------------------" << std::endl;
}

int main()
{
    std::cout << "--------------------------------" << std::endl;
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "INSIDE MAIN FUNCTION:" << std::endl;

    std::u32string s=U"hello☺";

    icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
    std::cout << "Unicode string is: " << ustr << std::endl;

    std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

    std::cout << "Individual characters of the string are:" << std::endl;
    for(int i=0; i < ustr.countChar32(); i++)
      std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

    std::cout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

现在,预期输出和实际输出都相同,即:

--------------------------------
INSIDE MAIN FUNCTION:
Unicode string is: hello☺
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺

--------------------------------
INSIDE CALLED FUNCTION:
Unicode string is: hello☺
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺

--------------------------------

标签: c++c++11unicodeicuicu4c

解决方案


要做到这一点,有许多绊脚石:

  • 首先,您的文件(以及其中的笑脸)应编码为 UTF-8。笑脸应该由文字字节组成0xE2 0x98 0xBA
  • u8您应该使用装饰器将字符串标记为包含 UTF-8 数据:u8"Hello☺"
  • 接下来,icu::UnicodeString说明它存储 Unicode 为 UTF-16 的文档。在这种情况下你很幸运,因为 U+263A 适合一个 UTF-16 字符。其他表情符号可能不会!您应该将其转换为 UTF-32,或者非常小心并使用该GetChar32At功能。
  • 最后,wcout应配置by 使用imbue的编码以匹配您的环境所期望的编码。请参阅此问题的答案。

推荐阅读