首页 > 解决方案 > 优化我的 Python 代码,使其适用于大型数据集

问题描述

我有两个文件。第一个(24MO)包含两个关键字的分数,这是一个简单的:

{'civil_right': [], 'finance': [('spending', 8.420475110400954)],'free': [('free_transport', 10.466459719664448), ('free_principles', 10.466459719664448), ('free_administration', 10.466459719664448), ('free_services', 10.466459719664448), ('salary', 10.466459719664448)]}

第二个是 400MO 的文本文件,基本上是原始文本。这是一个示例:

chapter 5 the allotment
definition of the subdivision November 2014 buffer 180435 June 2019
1 presentation. for common sense, the housing estate evokes suburban housing, that of the suburbs of big cities after the first war, but also, today, that of new districts in small towns and villages that push back the frontier of natural space. these are places where houses are juxtaposed which, without being identical, have "a family resemblance", as professor bouyssou put it. for the candidate builder, the allotment is first of all the possibility to acquire a serviced land in order to build a house adapted to his tastes and needs, of which he will be able to "choose the plans", subject, possibly, to the respect of a regulation coming to specify the provisions of the town-planning plan of the commune. it can be, later, the obligation to take part in the maintenance of the common equipments as a member of a trade-union association of owners and to respect certain rules of collective life specified in a schedule of conditions.

我想为第一个文件中的每一对关键字查找它们从第二个文件中一起出现的最长句子。例如,最长的句子包含 (finance, consumption), (free,free_transport), (free,free_principles), (free,free_administration), (free,free_services), (free,salary)。我选择最长的句子,因为这两个词可能在我的文本中的几个句子中一起出现;所以我会选择最长的。

这是我所拥有的:

import numpy as np
import pandas as pd
import re
import ast
import itertools
from tqdm import tqdm

#Reading data as list
hp = [line.strip() for line in open('text_file.txt', encoding='utf8')] 
#filter None elements from list
hp = list(filter(None,hp))
hp = [i.split('.') for i in hp]
hp = [i.strip() for l in hp for i in l]
keywords = [line for line in open('keywords_score_file.txt', encoding='utf8')]
keywords = ast.literal_eval(keywords[0])
word_pairs = []

for k,v in keywords.items():
    if v:
        word_pairs.append((k,v[0][0]))

word_pairs = list(set([tuple(sorted(i)) for i in word_pairs]))

df = pd.DataFrame(hp)
df.columns = ['text']

#The treatment
final_dict = {}

for wp in tqdm(word_pairs):
    #finding the length of sentence which contains the word pair else return 0
    df['length'] = df.apply(lambda row:len(row['text']) if all(map(row['text'].__contains__,wp)) else 0,axis=1)
    #check if we have sentence which contains word pair
    if len(df[df['length']>0]):
        #insert the word pair as key and the longest sentence as value in the final dictionary
        final_dict[wp] = df['text'].iloc[df[df['length']>0].length.idxmax()]
    #drop the length column created for above purpose
    df.drop(['length'],axis=1,inplace=True)

这适用于我的数据样本,但我让它为我的整个文本运行整个晚上,它只处理 20507 中的 1167 对关键字。我该如何优化呢?

标签: pythonpandas

解决方案


你可以试试这个例子,看看它是否会加快速度:

除了您的示例之外,我在 file.txt 中添加了一句话,以查看脚本是否有效:

chapter 5 the allotment
definition of the subdivision November 2014 buffer 180435 June 2019
1 presentation. for common sense, the housing estate evokes suburban housing, that of the suburbs of big cities after the first war, but also, today, that of new districts in small towns and villages that push back the frontier of natural space. these are places where houses are juxtaposed which, without being identical, have "a family resemblance", as professor bouyssou put it. for the candidate builder, the allotment is first of all the possibility to acquire a serviced land in order to build a house adapted to his tastes and needs, of which he will be able to "choose the plans", subject, possibly, to the respect of a regulation coming to specify the provisions of the town-planning plan of the commune. it can be, later, the obligation to take part in the maintenance of the common equipments as a member of a trade-union association of owners and to respect certain rules of collective life specified in a schedule of conditions.

This is finance, salary, free.

剧本:

keywords = {'civil_right': [], 'finance': [('spending', 8.420475110400954)],'free': [('free_transport', 10.466459719664448), ('free_principles', 10.466459719664448), ('free_administration', 10.466459719664448), ('free_services', 10.466459719664448), ('salary', 10.466459719664448)]}

# generate all keyword-pairs
all_keywords = set()
word_pairs = []
for k, v in keywords.items():
    for vv in v:
        word_pairs.append((k, vv[0]))
        all_keywords.add(k)
        all_keywords.add(vv[0])

# load file
lines, lines_map = [], {}
with open('file.txt', 'r') as f_in:
    for line in map(str.strip, f_in):
        if not line:
            continue
        for l in map(str.strip, line.split('.')):
            if not l:
                continue
            lines.append(l)
            # generate `lines_map` for current line number:
            for k in all_keywords:
                if k in l:
                    lines_map.setdefault(k, {})[len(lines)-1] = len(l)


for a, b in word_pairs:
    common_keys = lines_map.get(a, {}).keys() & lines_map.get(b, {}).keys()
    if common_keys:
        max_k = max(common_keys, key=lambda k: lines_map[a][k])
        print(a, b, max_k)
        print(lines[max_k])
        print('-' * 80)

印刷:

free salary 7
This is finance, salary, free
--------------------------------------------------------------------------------

推荐阅读