首页 > 解决方案 > Python - 搜索中不包含第一个单词

问题描述

为什么第一个单词正在打印但未包含在“dic”的搜索中。

谁能告诉我如何包含第一个单词的原因和解决方案?

这是我的代码:

my_dic = {
"a":"1", 
"b":"2", 
"c":"3", 
"d":"4", 
"e":"5", 
}

with open('c:\\english_text_file.txt',encoding = 'utf8') as file :
  for line in file:
    for word in line.split():
      print('word from line.split: ',word)
      if word in my_dic.keys():
       print('word from if word in ...',word)

测试文件在这里:

文本文件的内容是:

a b c d e

输出代码是:

word from line.split:  a
word from line.split:  b
word from if word in ... b
word from line.split:  c
word from if word in ... c
word from line.split:  d
word from if word in ... d
word from line.split:  e
word from if word in ... e

标签: pythonpython-3.6

解决方案


这是因为 txt 文件的 windows 的一种行为:它将添加BOM到 txt 文件的开头。

是什么BOM

这意味着Byte-order mark Description,值如下:

Byte-order mark Description 
EF BB BF UTF-8 
FF FE UTF-16 aka UCS-2, little endian 
FE FF UTF-16 aka UCS-2, big endian 
00 00 FF FE UTF-32 aka UCS-4, little endian. 
00 00 FE FF UTF-32 aka UCS-4, big-endian.

打开您的english_text_file.txt,并使用任何十六进制编辑器查看它,您将看到内容是:

efbb bf61 2062 2063 2064 2065 0d0a

这里,efbb bf是 BOM,61 2062 2063 2064 2065 0d0a是 ASCII 码a b c d e\r\n

所以对于 utf-8 文件,我们需要BOM在开始时检查它是否有,如果有,需要删除它。

接下来是一个示例代码供大家参考,如果不介意更改原文件,也可以直接覆盖旧的输入文件,这里我只是使用一个新的文件,BOM里面没有。

import codecs

my_dic = {
    "a":"1",
    "b":"2",
    "c":"3",
    "d":"4",
    "e":"5",
}

content = open('./english_text_file.txt', 'rb').read()
if content[:3] == codecs.BOM_UTF8:
    content = content[3:]
    open('./changed_english_text_file.txt', 'wb').write(content)
else:
    open('./changed_english_text_file.txt', 'wb').write(content)

with open('./changed_english_text_file.txt',encoding = 'utf8') as file :
    for line in file:
        for word in line.split():
            print('word from line.split: ',word)
            if word in my_dic.keys():
                print('word from if word in ...',word)

输出是:

word from line.split:  a
word from if word in ... a
word from line.split:  b
word from if word in ... b
word from line.split:  c
word from if word in ... c
word from line.split:  d
word from if word in ... d
word from line.split:  e
word from if word in ... e

推荐阅读