python - 如何根据句子长度在 2 个数据帧之间建立连接,并根据 ID 将其屏蔽到另一个数据帧
问题描述
我有 2 个 csv 文件,其中包含其中的句子,并尝试制作一个程序来根据单词检查句子长度,如果句子有超过 3 个单词,则将其添加到另一个 csv 并从第二个 csv 获取相同的 ID 到一个新的 csv 也是如此,据我所知,我需要在第二部分使用掩码,但它对我不起作用这是我正在尝试的
我的代码返回 true 和 false 而不是长度为 3 个或更多单词的句子
fdata = pd.read_csv(firstinput, names=['sentences'], skiprows=skip)
firstdata= fdata['sentences'].str.split().str.len().gt(3)
sdata = pd.read_csv(secondtinput, names=['sentences'], skiprows=skip)
seconddata=sdata[sdata.index.isin(firstdata.index)]
firstdata.to_csv("new_data.csv", index=False, header=False)
seconddata.to_csv("new_data2.csv", index=False, header=False)
----------------------
#first dataframe example
----------------------
#bye
#how are you
#I want to die
#I was home
#I went to sleep at work
#he have a bad reputation
#it was me who went to him
#have good sleep home
#hi you
#hi
----------------------
#second dataframe example
----------------------
#bye
#halaw kuy bashii
#damawe bmrm
#la malawa bum
#la
#aw kabraya bash nya
#awa mn bum chum bo lay
#xaweki xosh basar bba la malawa
# halaw you
#hi
----------------------
#first dataframe output
----------------------
#how are you
#I want to die
#I was home
#I went to sleep at work
#he have a bad reputation
#it was me who went to him
#have good sleep home
----------------------
#second dataframe output
----------------------
#halaw kuy bashii
#damawe bmrm
#la malawa bum
#la
#aw kabraya bash nya
#awa mn bum chum bo lay
#xaweki xosh basar bba la malawa
解决方案
我认为这行代码有问题:
firstdata= fdata['sentences'].str.split().str.len().gt(3)
试试这个:
firstdata = fdata.loc[df1['sentences'].str.split().str.len().gt(2)]
firstdata
输出:
sentences
1 how are you
2 I want to die
3 I was home
4 I went to sleep at work
5 he have a bad reputation
6 it was me who went to him
7 have good sleep home
seconddata
输出:
sentences
1 halaw kuy bashii
2 damawe bmrm
3 la malawa bum
4 la esh nustm
5 aw kabraya bash nya
6 awa mn bum chum bo lay
7 xaweki xosh basar bba la malawa
推荐阅读
- python - QTreeWidget中的拖放操作不复制拖放的项目
- powerbi - 如何在 Power BI 中使用 DAX 获得累积回报
- testing - Codeception:列出一个组的所有测试
- javascript - Materialize CSS datepicker 在 iPhone X 中选择了错误的年份
- html - VBA - HTML 解析 HTTP 响应
- ajax - 在 JSF 中包含 cdata 块时,ajax 请求出现问题
- hyperledger-fabric - Hyperledger Fabric 多通道多链码
- hashicorp-vault - HashiCorp Vault 错误作为生产运行
- python-3.x - 在 Keras 中预测我的简单训练模型的问题
- python - 为什么 ThreadPoolExecutor 比 for 循环慢?