首页 > 解决方案 > 打印 unicode 字符串不正确

问题描述

我使用 PyPDF2 读取 pdf 文件但得到一个 unicode 字符串。

我不知道编码是什么,然后尝试将前 8 个字符转储为十六进制:

0000  005b 00d7 00c1 00e8 00d4 00c5 00d5        [......

这些字节是什么意思?是 utf-16be/le 吗?

我尝试下面的代码,但输出错误:

print outStr.encode('utf-16be').decode('utf-16')
嬀휀섀퐀씀픀

如果直接打印,python会报错:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-7: ordinal not in range(128)

我正在按照如何在 Python 中从 Pdf 中提取文本的说明进行操作

代码部分如下:

import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

FILTER = ''.join([(len(repr(chr(x))) == 3) and chr(x) or '.' for x in range(256)])
def dumpUnicodeString(src, length=8):
    result = []
    for i in xrange(0, len(src), length):
       unichars = src[i:i+length]
       hex = ' '.join(["%04x" % ord(x) for x in unichars])
       printable = ''.join(["%s" % ((ord(x) <= 127 and FILTER[ord(x)]) or '.') for x in unichars])
       result.append("%04x  %-*s  %s\n" % (i*2, length*5, hex, printable))
    return ''.join(result)

def extractPdfText(filePath=''):
    fileObject = open(filePath, 'rb')
    pdfFileReader = PyPDF2.PdfFileReader(fileObject)
    totalPageNumber = pdfFileReader.numPages

    currentPageNumber = 0
    text = ''
    while(currentPageNumber < totalPageNumber ):
        pdfPage = pdfFileReader.getPage(currentPageNumber)
        text = text + pdfPage.extractText()
        currentPageNumber += 1

    if(text == ''):
        text = textract.process(filePath, method='tesseract', encoding='utf-8')       
    return text

if __name__ == '__main__': 
    pdfFilePath = 'a.pdf'
    pdfText = extractPdfText(pdfFilePath)
    #pdfText = pdfText[:7]
    print dumpUnicodeString(pdfText)
    print pdfText 

标签: pythonpython-2.7unicode

解决方案


推荐阅读