首页 > 解决方案 > 从文件访问大量数据时如何解决“列表索引超出范围”问题?

问题描述

我正在研究一个分类器,它将从数据集中访问 200000 个数据项,但它只能正确访问大约 1400 个数据并显示 list index out of range

如何访问数据集中的所有项目?

这里是数据集的结构。

investing: can you profit in agricultural commodities?
bad weather is one factor behind soaring food prices. can you make hay with farm stocks? possibly: but be prepared to harvest gains on a moment's ...
http://rssfeeds.usatoday.com/~r/usatodaycommoney-topstories/~3/qbhb22sut9y/2011-05-19-can-you-make-gains-in-grains_n.htm
0
20 May 2011 15:13:57
ut
business

no tsunami but fifa's corruption storm rages on
though jack warner's threatened soccer "tsunami" remains stuck in the doldrums, the corruption storm raging around fifa shows no sign of abating after another extraordinary week for the game's governing body.
http://feeds.reuters.com/~r/reuters/sportsnews/~3/ffa6ftdsudg/us-soccer-fifa-idustre7563p620110607
1
07 Jun 2011 17:54:54
reuters
sport

critic's corner weekend: 'fringe' wraps third season
joshua jackson's show goes out with a bang. plus: amazing race nears the finish line.
http://rssfeeds.usatoday.com/~r/usatoday-lifetopstories/~3/duk9oew5auc/2011-05-05-critics-corner_n.htm
2
06 May 2011 23:36:21
ut
entertainment

这是代码:

with open('news', 'r') as f:
    text = f.read()
    news = text.split("\n\n")
    count = {'sport': 0, 'world': 0, "us": 0, "business": 0, "health": 0, "entertainment": 0, "sci_tech": 0}
    for news_item in news:
        lines = news_item.split("\n")
        print(lines[6])
        file_to_write = open('data/' + lines[6] + '/' + str(count[lines[6]]) + '.txt', 'w+')
        count[lines[6]] = count[lines[6]] + 1
        file_to_write.write(news_item)  # python will convert \n to os.linesep
        file_to_write.close()

它显示以下输出。


IndexError                                Traceback (most recent call last)
<ipython-input-1-d04a79ce68f6> in <module>
      5     for news_item in news:
      6         lines = news_item.split("\n")
----> 7         print(lines[6])
      8         file_to_write = open('data/' + lines[6] + '/' + str(count[lines[6]]) + '.txt', 'w+')
      9         count[lines[6]] = count[lines[6]] + 1

IndexError: list index out of range

标签: pythonfiletext-classificationfilewriter

解决方案


您假设每个块中始终有 7 行或更多行。也许您的文件以 结尾\n\n,或者您有一些已损坏的块。

只需测试长度并跳过该块:

for news_item in news:
    lines = news_item.split("\n")
    if len(lines) < 7:
        continue

请注意,您实际上不需要在这里将整个文件读入内存,您还可以遍历文件对象并从文件对象中读取其他行。就个人而言,我会创建一个单独的生成器对象,从文件中挑选出特定的行:

def block_line_at_n(fobj, n):
    while True:
        for i, line in enumerate(fobj):
            if line == "\n":
                # end of block, start a new block
                break
            if i == n:
                yield line
        else:
            # end of the file, exit
            return

with open('news', 'r') as f:
    for line in block_line_at_n(f, 6):

推荐阅读