首页 > 解决方案 > 将 windows-1252 输入文件转换为 utf-8 输出文件的字符编码

问题描述

我正在处理从 Word 的保存选项(以编程方式)转换为 HTML 的 HTML 文档。这个 HTML 文本文件是 windows-1252 编码的。(是的,我已经阅读了很多关于字节和 Unicode 代码点的内容,我知道超过 128 的代码点可以是 2,3,最多可以是 6 个字节,等等。)我在我的 Word 文档模板中添加了很多不可打印的字符并编写代码来评估每个 CHARACTER(十进制等价物)。当然,我知道我不想允许十进制 #160,这是 MS Word 将不间断空格的 HTML 翻译。我预计在不久的将来人们会将更多这些“非法”构造放入模板中,我需要捕获它们并处理它们(因为它们会在浏览器中引起有趣的查看:(这是在转储到 Eclipse 控制台,我将所有文档行放入地图中)

 DataObj.paragraphMap  : {1=, 2=Introduction and Learning Objective, 3=? ©®™§¶…‘’“”????, 4=, 5=, 6=, 
   7=This is paragraph 1 no formula, 8=, 

我用#32(常规空格)替换了十进制#160,然后使用UTF-8编码将字符写入一个新文件——我的想法是这样,我可以使用这种技术来替换或决定不写回特定字符使用十进制等价?我想避免使用字符串,因为我可以处理多个文档并且不想耗尽内存......所以我在文件中做......

 public static void convert1252toUFT8(String fileName) throws IOException {   
    File f = new File(fileName);
    Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(f), "windows-1252"));
    OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileName + "x"), StandardCharsets.UTF_8); 
    List<Character> charsList = new ArrayList<>(); 
    int count = 0;

    try {
        int intch;
        while ((intch = r.read()) != -1) {   //reads a single character and returns integer equivalent
            int ch = (char)intch;
            //System.out.println("intch=" + intch + " ch=" + ch + " isValidCodePoint()=" + Character.isValidCodePoint(ch) 
            //+ " isDefined()=" + Character.isDefined(ch) + " charCount()=" + Character.charCount(ch) + " char=" 
            //+ (char)intch);

            if (Character.isValidCodePoint(ch)) {
                if (intch == 160 ) {
                    intch = 32;
                }
                charsList.add((char)intch);
                count++;
            } else {
                System.out.println("unexpected character found but not dealt with.");
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        System.out.println("Chars read in=" + count + " Chars read out=" + charsList.size());
        for(Character item : charsList) {
            writer.write((char)item);
        }
        writer.close();
        r.close();
        charsList = null;

        //check that #160 was replaced File 
        //f2 = new File(fileName + "x"); 
        //Reader r2 = new BufferedReader(new InputStreamReader(new FileInputStream(f2), "UTF-8")); 
        //int intch2;
        //while ((intch2 = r2.read()) != -1) { //reads a single character and returns integer equivalent 
        //int ch2 = (char)intch2; 
        //System.out.println("intch2=" + intch2 + " ch2=" + ch2 + " isValidCodePoint()=" +
        //Character.isValidCodePoint(ch2) + " char=" + (char)intch2); 
        //}

    }   
}

标签: javafilems-wordcharacter-encodingcharacter

解决方案


首先,HTML 页面采用不同于 UTF-8 的编码并没有错。实际上,文档很可能包含如下行

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

在其标题中,当您更改文件的字符编码而不调整此标题行时,这会使文档无效。

此外,没有理由替换文档中的 codepoint #160,因为它是 Unicode 的标准不间断空格字符,这就是为什么&#160;它是有效替代品的原因&nbsp;,如果文档的字符集支持这个 codepoint,直接使用它也是有效的.

您避免使用字符串的尝试是过早优化的典型案例。缺乏实际测量会导致解决方案ArrayList<Character>消耗两倍¹的内存String

如果要复制或转换文件,则不应将整个文件保存在内存中。只需在读取下一个之前将数据写回,但为了效率,使用一些缓冲区而不是一次读取和写入单个字符。此外,您应该使用try-with-resources 语句来管理输入和输出资源。

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
        BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
            StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

测试字符的有效性没有意义,因为解码器不会产生无效的代码点。解码器默认配置为在无效字节上抛出异常。其他选项是用替换字符(如 �)替换无效输入或跳过它们,但它永远不会产生无效字符。

操作期间所需的内存量由缓冲区大小决定,尽管上面的代码使用了各自拥有缓冲区的读取器和写入器。仍然用于操作的内存总量与文件大小无关。

仅使用您明确指定的缓冲区的解决方案看起来像

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(Reader br = Channels.newReader(Files.newByteChannel(in), "windows-1252");
        Writer bw = Channels.newWriter(
            Files.newByteChannel(out, WRITE, CREATE, TRUNCATE_EXISTING),
            StandardCharsets.UTF_8)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

这也是实现对无效输入进行不同处理的起点,例如,要删除所有无效输入字节,您只需将方法的开头更改为

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    CharsetDecoder dec = Charset.forName("windows-1252")
            .newDecoder().onUnmappableCharacter(CodingErrorAction.IGNORE);
    try(Reader br = Channels.newReader(Files.newByteChannel(in), dec, -1);
…

注意,对于一次成功的转换,读取和写入的字符数是相同的,但仅对于输入编码Windows-1252,字符数与字节数相同,即文件大小(当整个文件有效时) )。

此转换代码示例仅用于完成,如开头所述,在不调整标头的情况下转换 HTML 页面可能会使文件无效,甚至没有必要。

¹ 取决于实施,甚至四次


推荐阅读