首页 > 解决方案 > 如何针对数据框优化此方法?

问题描述

我会优化这个方法:

def get_sentiment_score(text):
    text = text.split()
    positive = 0
    negative = 0
    for word in text:
        if word in positive_words:
            positive += 1
        elif word in negative_words:
            negative += 1
    score = positive - negative
    if score == 0:
        return "UNCERTAIN"
    return "POSITIVE" if score > 0 else "NEGATIVE"

df["sentiment_polarity"] = df["text"].apply(lambda row: get_sentiment_score(str(row)))

变量positive_words 和negative_words 在列表中每个都包含超过2000 个元素,并且数据框有270K+ 行。

我目前花费的总时间是 1000 多秒。

我想把它降低到 100 秒以下。

提前致谢。

标签: pythonpandasdataframe

解决方案


我会使用和set拆分。numpypandas

import numpy as np
import pandas as pd


dataf = pd.DataFrame({"text":["hello world is over", "go big or go home"]})

# transform your lists to set
positive_set = {"world", "over"}
negative_set = {"is", "go"}

# use intersection and find length 
def check(s:set) -> int:

    positive = len(s.intersection(positive_set))
    negative = len(s.intersection(negative_set))
    
    return positive - negative
    

# apply it using pandas string slips
dataf["sentiment"] = dataf["text"].str.split().map(set).map(check)

# use numpy to tag
condition = [dataf["sentiment"].gt(0), dataf["sentiment"].lt(0), dataf["sentiment"].eq(0)]

choices = ["POSITIVE", "NEGATIVE", "UNCERTAIN"]

dataf["sentiment"] = np.select(condition, choices)


推荐阅读