首页 > 解决方案 > 匹配 Pandas DataFrame 时出现“KeyError: True”

问题描述

我打算设置一个简单的脚本来查看是否可以在 Pandas DataFrame 中找到单词列表中的单词common_words。如果匹配,我想返回相应的 DataFrame 条目,而 DF 的格式为life balance 14, long term 9, upper management 9,突出显示单词标记及其出现次数。

然而,下面的代码目前仅打印KeyError: True有关 line的错误print('Group 1:', df[df[i].loc[df[i].str.contains(x).any()]])。有谁知道返回 DataFrame 输出word而不是错误的聪明方法?

相关的代码部分是:

    # Check for matches between wordlist and Pandas dataframe
    def wordcheck():
        wordlist = ["work balance", "good management", "work life"]
        for x in wordlist:
            if df[i].str.contains(x).any():
                print('Group 1:', df[df[i].loc[df[i].str.contains(x).any()]])
    wordcheck()

完整的代码段如下所示:

import string
import json
import csv
import pandas as pd
from textblob import TextBlob

from sklearn.feature_extraction.text import CountVectorizer
import cufflinks as cf

import re
from typing import Iterable

# Loading and normalising the input file
file = open("glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)


# Datetime conversion
df['Date'] = pd.to_datetime(df['Date'])
# Adding of 'Quarter' column
df['Quarter'] = df['Date'].dt.to_period('Q')


# Word frequency analysis
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]


# Analysis loops through different qualitative sections
for i in ['Text_Pro','Text_Con','Text_Main']:
    common_words = get_top_n_bigram(df[i], 500)
    for word, freq in common_words:
        print(word, freq)


    # Check for matches between wordlist and Pandas dataframe
    def wordcheck():
        wordlist = ["work balance", "good management", "work life"]
        for x in wordlist:
            if df[i].str.contains(x).any():
                print('Group 1:', df[df[i].loc[df[i].str.contains(x).any()]])
    wordcheck()

根据要求,我在下面附上了 JSON 文件的摘录:

[
  {
    "No": "121",
    "Stock_Symbol": "A",
    "Date": "5/11/2017",
    "Author_Job_Title": "Current Employee - QA Chemist",
    "Author_Location": "Santa Clara, CA",
    "Text_Main": "I have been working at Agilent Technologies full-time (More than 3 years)",
    "Text_Pro": "Agilent prides itself on its emphasis of a great work/life balance. This is true. The general culture is one that values family time and allows you to more or less set your own schedule as long as it enables you and your team to work efficiently. If you need to cut out early because your kid is sick, that's fine. I like that nobody gives those with children a hard time. I myself don't have kids, but if I did I would appreciate the level of agency that this culture gives parents. Additionally, as a full time employee you start with 4 weeks of vacation. If you are already established in the valley, this is a great place to enjoy a stable work/life balance.",
    "Text_Con": "If you don't already have a home in Silicon Valley, you probably won't be able to afford to work here. This negates the great work-life balance, because if you can't afford to live... there's nothing to balance.\\nThe pay for Silicon valley is incredibly low. The Santa Clara Site of Agilent is on the same street as the new Apple Complex (The Spaceship). This makes it is incredibly expensive to live and work in this area. Agilent is a scientific hardware and software company, and even though they're operating in the tech capital of the world, they don't pay competitively. On average, for identical roles in the valley, Agilent pays 20% less. This is especially negative for entry-level employees who cannot and will never be able to afford a home in the valley. I've worked with many scientists here in 4 years and had to watch almost every non-home owner go on to a different company. Some of them left because they had inexperienced managers and low upward mobility, but for most that I keep in contact with, it really came down to low pay for a high-complexity position in a competitive field.",
    "Text_Advice_Mgmt": "Employees make a company.\\nThe highest cost comes from time lost due to turnover.\\nIf your people are good, work hard to keep them. Pay competitively.",
    "Rating_Recommend": "2",
    "Rating_Outlook": "2",
    "Rating_CEO": "2",
    "Scr_Avg": "4.0",
    "Scr_Balance": "5.0",
    "Scr_Values": "5.0",
    "Scr_Opportunities": "4.0",
    "Scr_Benefits": "2.0",
    "Scr_Management": "2.0"
  },
  {
    "No": "125",
    "Stock_Symbol": "A",
    "Date": "5/10/2017",
    "Author_Job_Title": "Current Employee - Anonymous Employee",
    "Author_Location": "Santa Clara, CA",
    "Text_Main": "I have been working at Agilent Technologies (Less than a year)",
    "Text_Pro": "All thinks are good and perfect.",
    "Text_Con": "There is only Manager monopoly. Manager can do anything easily and HR does not involve.",
    "Text_Advice_Mgmt": "HR involvement",
    "Rating_Recommend": "2",
    "Rating_Outlook": "2",
    "Rating_CEO": "2",
    "Scr_Avg": "4.0",
    "Scr_Balance": "5.0",
    "Scr_Values": "5.0",
    "Scr_Opportunities": "4.0",
    "Scr_Benefits": "4.0",
    "Scr_Management": "4.0"
  },
  {
    "No": "126",
    "Stock_Symbol": "A",
    "Date": "5/1/2017",
    "Author_Job_Title": "Current Employee - Computational Biologist",
    "Author_Location": "Santa Clara, CA",
    "Text_Main": "I have been working at Agilent Technologies full-time (More than 3 years)",
    "Text_Pro": "- Grate talented people\\n- Clear mission, powerful vision and passion for the customer\\n- Co-workers and managers really care about your well-being",
    "Text_Con": "- Sometime are resources and project scope not in sync\\n- Politics do occasionally take presidency over data driven decisions\\n- Poor career opportunities",
    "Text_Advice_Mgmt": "- A lateral promotion is also a promotion that might bring more career opportunities",
    "Rating_Recommend": "2",
    "Rating_Outlook": "2",
    "Rating_CEO": "1",
    "Scr_Avg": "4.0",
    "Scr_Balance": "4.0",
    "Scr_Values": "4.0",
    "Scr_Opportunities": "2.0",
    "Scr_Benefits": "4.0",
    "Scr_Management": "4.0"
  }
]

标签: pythonpandasdataframenlp

解决方案


我认为将代码分成块可能会有所帮助。如果我正确理解了代码,这应该可以工作:

filter_logic = df[i].str.contains(x)

df[filter_logic][i]

推荐阅读