python - 使用大型数据集匹配同一列中的字符串?
问题描述
我想知道为什么我的代码只返回每行的第一个字母而不是最长的匹配字符串?我使用一个包含 1 列和 15,500 行的大型数据集
import csv
import pandas as pd
import numpy as np
df = pd.read_csv('newproducts.csv',error_bad_lines=False)df
df['onkey'] = 1
df1 pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey')
df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1)
from os.path import commonprefix
df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x))
df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x))
df1 = df1[(df1['COL1_num']!=0)]
df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()]
df = df.rename(columns ={'name':'name_x'})
df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')
df['len'] = df['COL1'].apply(lambda x: len(x))
df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1)
df['COL1'] = df['COL1'].apply(lambda x: x.strip())
df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x)
df['other'] = df['other'].apply(lambda x:x.split('-'))
df = df[['COL1','other']]
输入所以这将是您开始的列:我想找到最长的公共字符串并将不匹配的部分放入单独的列中
product name
10 funniest Silicone Emperor - Ivory
10 funniest Stud 7 Inches - Hot Pink
10 funny elephant Hummer - Pink
10 funny elephant Hummer - Purple
10 Inch Realistic Dual Density Squirting snake
10 Inch Silicone Comfort Nozzle Attachment
10" comforter snake & comforter Bit Set - Black
10" comforter Jelly & comforter Bit Set - Pink
10" comforter Jelly & comforter Bit Set - Purple
10" Thick ladder W/balls & Suction - Black
100 insect magnets
1000 cloud Games
10-funniest Adonis Conqueror - Black
10-funniest Adonis Explorer - Red
10-funniest Adonis Vibrating Probe - Red
10-funniest Adonis Vibrating Strokers - Red
10-funniest Charisma Bliss - Black
10-funniest Charisma Bliss - Pink
10-funniest Charisma Kiss - Pink
10-funniest Charisma Tryst - Black
10-funniest Risque G-Vibe - Black
10-funniest Risque G-Vibe - Blue
10-funniest Risque G-Vibe - Purple
10-funniest Risque Slim - Black
10-funniest Risque Slim - Blue
10-funniest Risque Slim - Purple
10-funniest Risque Tulip - Black
10-funniest Risque Tulip - Blue
10-funniest Risque Tulip - Purple
输出 - 输出将是 1 列中的匹配项和另一列中不匹配的部分
new product name
10 funniest Silicone Emperor Ivory
10 funniest Stud 7 Inches Hot Pink
10 funny elephant Hummer Pink
10 funny elephant Hummer Purple
10 Inch Realistic Dual Density Squirting snake
10 Inch Silicone Comfort Nozzle Attachment
10" comforter snake & comforter Bit Set Black
10" comforter Jelly & comforter Bit Set Pink
10" comforter Jelly & comforter Bit Set Purple
10" Thick ladder W/balls & Suction Black
100 insect magnets
1000 cloud Games
10-funniest Adonis Conqueror Black
10-funniest Adonis Explorer Red
10-funniest Adonis Vibrating Probe Red
10-funniest Adonis Vibrating Strokers Red
10-funniest Charisma Bliss Black
10-funniest Charisma Bliss Pink
10-funniest Charisma Kiss Pink
10-funniest Charisma Tryst Black
10-funniest Risque G-vibe Black
10-funniest Risque G-vibe Blue
10-funniest Risque G-vibe Purple
10-funniest Risque Slim Black
10-funniest Risque Slim Blue
10-funniest Risque Slim Purple
10-funniest Risque Tulip Black
10-funniest Risque Tulip Blue
10-funniest Risque Tulip Purple
解决方案
推荐阅读
- javascript - 如何在 Parse Server 中处理苹果回调
- java - MongoDB:StackOveflow错误
- azure - 在 azure windows vm 上预下载的基础映像
- mysql - 显示查询中的所有行
- c# - 该进程无法访问文件“File Name.xls”,因为它正被另一个进程使用
- c++ - 为什么我不能在 FOR LOOP、C++ 中使用 i/10?
- opencv - 读△为减号
- java - 在 lambda 中更改 Goolge 的 Place 对象
- sql-server - SQL Server 中出现意外的重复行
- serialization - 在 .NET Core 3.0 中调用 JsonConvert.DeserializeObject 时缺少 System.Security.Permissions