python - 字典字典到数据框
问题描述
我有一个函数可以计算评论中中心词和上下文词之间的共现。
def get_coocs(x):
occurdict={}
# Pre-processing
tokens = nltk.word_tokenize(x)
tokenslower = list(map(str.lower, tokens))
# Save all the nouns in each review
allnouns=[word for word in tokenslower if word in cent_vocab]
# Save all the verbs/adjectives in each review
allverbs_adj=Counter(word for word in tokenslower if word in cont_vocab)
# Creating a dictionary of dictionaries
for noun in allnouns:
occurdict[noun]=dict(allverbs_adj)
return occurdict
coocs=df['comments'].apply(lambda x: get_coocs(x))
我的字典看起来像这样:
{'host': {'is': 3, 'most': 1, 'amazing': 1},
{'time': {'had': 1, 'such': 1, 'great': 1},
{'room': {'very': 2, 'professional': 1},
{'way': {'is': 3, 'recommended': 1, 'provided': 2}
但是,当我尝试将其转换为数据框时,将名词作为列,动词/形容词作为索引,并具有相应的共现值,我最终得到以下结果:
def cooc_dict2df(coocs):
coocdf=pd.DataFrame.from_dict({i:coocs[i] for i in coocs.keys()}, orient='index')
return coocdf
我尝试了其他解决方案,但我似乎仍然无法得到我想要的。
解决方案
你可以试试这个:
# Toy data
coocs = {
"host": [
{"is": 3, "most": 8, "amazing": 1},
{"had": 5, "such": 7, "great": 9},
{"very": 3, "recommended": 1, "provided": 2},
],
"time": [
{"is": 2, "most": 9, "amazing": 7},
{"had": 6, "such": 6, "great": 8},
{"very": 2, "recommended": 3, "provided": 4},
],
"room": [
{"is": 7, "most": 1, "amazing": 2},
{"had": 7, "such": 5, "great": 8},
{"very": 1, "recommended": 5, "provided": 4},
],
"way": [
{"is": 1, "most": 6, "amazing": 9},
{"had": 8, "such": 4, "great": 9},
{"very": 7, "recommended": 7, "provided": 1},
],
}
# Make a list of dataframes
dfs = [pd.DataFrame({colname: col}) for colname, cols in coocs.items() for col in cols]
# Merge dataframes
new_df = dfs[0]
for df in dfs[1:]:
new_df = new_df.merge(df, how="outer", left_index=True, right_index=True)
new_df.fillna(0, inplace=True)
# Add identical columns
for name in coocs.keys():
new_df[f"new_{name}"] = 0
for col in new_df.columns:
if col.startswith(name):
new_df[f"new_{name}"] = new_df[f"new_{name}"] + new_df[col]
# Drop useless columns and rename the remaining ones
new_df = new_df.drop(columns=[col for col in new_df.columns if "new_" not in col])
new_df.columns = [col[4:] for col in new_df.columns]
print(new_df)
# Outputs
host time room way
amazing 1.0 7.0 2.0 9.0
great 9.0 8.0 8.0 9.0
had 5.0 6.0 7.0 8.0
is 3.0 2.0 7.0 1.0
most 8.0 9.0 1.0 6.0
provided 2.0 4.0 4.0 1.0
recommended 1.0 3.0 5.0 7.0
such 7.0 6.0 5.0 4.0
very 3.0 2.0 1.0 7.0
推荐阅读
- python - 有没有比“for”更快的方法来比较列中的值来选择我想要的值?
- python - 无法在pycharm中安装curses包[找不到满足curses-win要求的版本(来自版本:无)]
- javascript - 您可以按数组值的字母顺序对对象进行排序吗?
- java - 使用 lombok 生成 param 接受附加参数的构造函数
- java - 使用 Google Admob 自动滚动片段和活动?
- css - 在原始 CSS 上使用正则表达式提取 DOM 元素数据
- redux - 处理可空类型时出现流错误
- here-olp - 是否可以检查版本化目录图层的早期版本?
- php - 在内容中添加小部件
- azure - Azure and File.CreateText: FileNotFoundException: Could not find file