首页 > 解决方案 > 使用大型数据集匹配同一列中的字符串?

问题描述

我想知道为什么我的代码只返回每行的第一个字母而不是最长的匹配字符串?我使用一个包含 1 列和 15,500 行的大型数据集

 import csv
 import pandas as pd
 import numpy as np
 df = pd.read_csv('newproducts.csv',error_bad_lines=False)df 
 df['onkey'] = 1
 df1 pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey')
 df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1)
 from os.path import commonprefix
 df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x))
 df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x))
 df1 = df1[(df1['COL1_num']!=0)]
 df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()]
 df = df.rename(columns ={'name':'name_x'})
 df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')


 df['len'] = df['COL1'].apply(lambda x: len(x))
 df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1)
 df['COL1'] = df['COL1'].apply(lambda x: x.strip())
 df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x)
 df['other'] = df['other'].apply(lambda x:x.split('-'))
 df = df[['COL1','other']]

输入所以这将是您开始的列:我想找到最长的公共字符串并将不匹配的部分放入单独的列中

product name
10 funniest Silicone Emperor - Ivory
10 funniest Stud 7 Inches - Hot Pink
10 funny elephant Hummer - Pink
10 funny elephant Hummer - Purple
10 Inch Realistic Dual Density Squirting snake
10 Inch Silicone Comfort Nozzle Attachment
10" comforter snake & comforter Bit Set - Black
10" comforter Jelly & comforter Bit Set - Pink
10" comforter Jelly & comforter Bit Set - Purple
10" Thick ladder W/balls & Suction - Black
100 insect magnets
1000 cloud Games
10-funniest Adonis Conqueror - Black
10-funniest Adonis Explorer - Red
10-funniest Adonis Vibrating Probe - Red
10-funniest Adonis Vibrating Strokers - Red
10-funniest Charisma Bliss - Black
10-funniest Charisma Bliss - Pink
10-funniest Charisma Kiss - Pink
10-funniest Charisma Tryst - Black
10-funniest Risque G-Vibe - Black
10-funniest Risque G-Vibe - Blue
10-funniest Risque G-Vibe - Purple
10-funniest Risque Slim - Black
10-funniest Risque Slim - Blue
10-funniest Risque Slim - Purple
10-funniest Risque Tulip - Black
10-funniest Risque Tulip - Blue
10-funniest Risque Tulip - Purple

输出 - 输出将是 1 列中的匹配项和另一列中不匹配的部分

new product name    
10 funniest Silicone Emperor     Ivory
10 funniest Stud 7 Inches    Hot Pink
10 funny elephant Hummer     Pink
10 funny elephant Hummer     Purple
10 Inch Realistic Dual Density Squirting snake  
10 Inch Silicone Comfort Nozzle Attachment  
10" comforter snake & comforter Bit Set      Black
10" comforter Jelly & comforter Bit Set      Pink
10" comforter Jelly & comforter Bit Set      Purple
10" Thick ladder W/balls & Suction   Black
100 insect magnets  
1000 cloud Games    
10-funniest Adonis Conqueror     Black
10-funniest Adonis Explorer      Red
10-funniest Adonis Vibrating Probe   Red
10-funniest Adonis Vibrating Strokers    Red
10-funniest Charisma Bliss   Black
10-funniest Charisma Bliss   Pink
10-funniest Charisma Kiss    Pink
10-funniest Charisma Tryst   Black
10-funniest Risque G-vibe    Black
10-funniest Risque G-vibe    Blue
10-funniest Risque G-vibe    Purple
10-funniest Risque Slim      Black
10-funniest Risque Slim      Blue
10-funniest Risque Slim      Purple
10-funniest Risque Tulip     Black
10-funniest Risque Tulip     Blue
10-funniest Risque Tulip     Purple

标签: pythonpandasdataframe

解决方案


推荐阅读