首页 > 解决方案 > 根据关键字对文本进行分类

问题描述

我有类似以下格式的文档,我想用 python 对其进行分类,例如

Outline: 
1. Lorem Ipsum 
2. Lorem Ipsum 

Preface: 
This is sample generated words of the documents

那些必须分类为数组,例如

[Outline: 1. Lorem Ipsum 2. Lorem Ipsum, Preface: This is sample generated words of the documents ]

或存储在不同的变量中,例如

outline = segment_by_word("outline")
preface = segment_by_word("preface")

print(preface )  #This is sample generated words of the documents  

标签: pythonpython-3.x

解决方案


我假设只有两个类别OulinePreface. 下面的代码将行作为元组添加到列表中,其中行 # 然后是行信息

lines_by_category = {'Outline': [], 'Preface': []}
category = None
count = 0

for line in lines:  # Assuming you know how to get to the point of reading lines
    if line.find(r'Outline:'):
        category = 'Outline'
    elif line.find(r'Preface:'):
        category = 'Preface'
    category_list = lines_by_category[category]
    category_list.append((count, line))  # Updates the original list because it is pointing to the same one

推荐阅读