python - 在 Pandas df 上进行双重 for 循环
问题描述
我有一个数据框,其中包含一列 subreddit,另一列包含在该 subreddit 中发表评论的作者。这是一个快照:
subreddit user
0xProject [7878ayush, Mr_Yukon_C, NomChompsky92, PM_ME_Y...
100sexiest [T10rock]
100yearsago [PM_ME_MII, Quisnam]
1022 [MikuWaifuForLaifu, ghrshow, johnnymn1]
1200isjerky [Rhiann0n, Throwaway412160987]
1200isplenty [18hourbruh, Bambi726, Cosmiicao, Gronky_Kongg...
1200isplentyketo [yanqi83]
12ozmouse [ChBass]
12thMan [8064r7, TxAg09, brb1515]
12winArenaLog [fnayr]
13ReasonsWhy [SawRub, _mw8, morbs4]
13or30 [BOTS_RISE_UP, mmcjjc]
14ers [BuccoFan8]
1500isplenty [nnowak]
15SecondStories [DANKY-CHAN, NORMIESDIE]
18650masterrace [Airazz]
18_19 [-888-, 3mb3r89, FuriousBiCurious, FusRohDoing...
1911 [EuphoricaI, Frankshungry, SpicyMagnum23, cnw4...
195 [RobDawg344, ooi_]
19KidsandCounting [Kmw134, Lvzv, mpr1011, runjanarun]
1P_LSD [420jazz, A1M8E7, A_FABULOUS_PLUM, BS_work, EL...
2007oneclan [J_D_I]
2007scape [-GrayMan-, -J-a-y-, -Maxy-, 07_Tank, 0ipopo, ...
2010sMusic [Vranak]
21savage [Uyghur1]
22lr [microphohn]
23andme [Nimushiru, Pinuzzo, Pugmas, Sav1025, TOK715, ...
240sx [I_am_a_Dan, SmackSmackk, jimmyjimmyjimmy_, pr...
24CarrotCraft [pikaras]
24hoursupport [GTALionKing, Hashi856, Moroax, SpankN, fuck_u...
...
youtubetv [ComLaw, P1_1310, kcamacho11]
yoyhammer [Emicrania, Jbugman, RoninXiC, Sprionk, jonow83]
ypp [Loxcam]
ypsi [FLoaf]
ytp [Profsano]
yugijerk [4sham, Exos_VII]
yugioh [1001puppys, 6000j, 8512332158, A_fiSHy_fish, ...
yumenikki [ripa9]
yuri [COMMENTS_ON_NSFW_PIC, MikuxLuka401, Pikushibu...
yuri_jp [Pikushibu]
yuruyuri [ACG_Yuri, KirinoNakano, OSPFv3, SarahLia]
zagreb [jocus985]
zcoin [Fugazi007]
zec [Corm, GSXP, JASH_DOADELESS_, PSYKO_Inc, infinis]
zedmains [BTZx2, EggyGG, Ryan_A121, ShacObama, Tryxi, m...
zelda [01110111011000010111, Aura64, AzaraAybara, BA...
zen [ASAMANNAMMEDNIGEL, Cranky_Kong, Dhammakayaram...
zerocarb [BigBrain007, Manga-san, vicinius]
zetime [xrnzrx]
zfs [Emachina, bqq100, fryfrog, michio_kakus_hair,...
ziftrCOIN [GT712]
zoemains [DrahaKka, OJSaucy, hahAAsuo, nysra, x3noPLEB,...
zombies [carbon107, rjksn]
zomby [jwccs46]
zootopia [BCRE8TVE, Bocaj1000, BunnyMakingAMark, Far414...
zumba [GabyArcoiris]
zyramains [Dragonasaur, Shaiaan]
zyzz [Xayv]
我正在尝试遍历每个 subreddit,然后遍历其下的每个 subreddit 以查找共享评论者。最终目标是包含 subreddit 1、subreddit 2 和共享评论者数量的数据框。
我什至无法想象如何使用 apply 来执行此操作,并且不确定如何使用 pandas df 执行双重 for 循环。
这是正确的想法吗?
for i in df2.index:
subreddit = df2.get_value(i,'subreddit')
for i+1 in df2.index:
...
这是输入和预期输出的示例:
df = pd.DataFrame({'subreddit': ['sub1', 'sub2', 'sub3', 'sub4'],
'user': [['A', 'B', 'C'], ['A', 'F', 'C'], ['F', 'E', 'D'], ['X', 'Y', 'Z']]})
第一个 subreddit 的输出:
subreddit_1 subreddit_2 shared_users
sub1 sub2 2
sub1 sub3 0
sub1 sub4 0
解决方案
我不知道你是否可以使用循环。这似乎与计算相关矩阵的方式非常相似,相关矩阵在pandas 文档中使用循环。至少它是对称的,所以你只需要比较其中的一半。
您不想计算相关性,而是要查找在两个列表之间共享的元素的数量lst1
,lst2
即len(set(lst1) & set(lst2))
import pandas as pd
import numpy as np
df = pd.DataFrame({'subreddit': ['sub1', 'sub2', 'sub3', 'sub4'],
'user': [['A', 'B', 'C'], ['A', 'F', 'C'], ['F', 'E', 'D'], ['X', 'Y', 'Z']]})
mat = df.user
cols = df.subreddit
idx = cols.copy()
K = len(cols)
correl = np.empty((K, K), dtype=int)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
c = len(set(ac) & set(bc))
correl[i, j] = c
correl[j, i] = c
overlap_df = pd.DataFrame(correl, index=idx, columns=cols)
#subreddit sub1 sub2 sub3 sub4
#subreddit
#sub1 3 2 0 0
#sub2 2 3 1 0
#sub3 0 1 3 0
#sub4 0 0 0 3
如果你想让那些更小DataFrames
,那么你只需要一点点操作。例如:
overlap_df.index.name='subreddit_1'
overlap_df[['sub1']].stack().reset_index().rename(columns={0: 'shared_users'})
subreddit_1 subreddit shared_users
0 sub1 sub1 3
1 sub2 sub1 2
2 sub3 sub1 0
3 sub4 sub1 1
推荐阅读
- vba - DIR function retrieving the file, but not the file name
- php - php将mysql结果存储到memcache
- cypher - geohashTrie 用于位置节点
- docker - Docker 群网络 DNS
- cakephp - 如何在 cakephp 分页中设置 maxLimit
- c# - 无法通过派生类使用 ObservableCollection 方法更新 ListView
- r - 如何在 R 中使用 apply 来设置所需列的子集,其余的则用空格填充?
- neural-network - Caffe:这两个网络有什么区别?
- c# - TransactionScope 中的 DbContext.SaveChanges() 可见性
- sql - 如何在没有关系的情况下加入基地?