python - 根据条件在一个数据帧中拆分和替换熊猫中的另一个数据帧
问题描述
我有两个数据框,都包含 sql 表。
这是我的第一个数据框
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc
AAREN SCIENTIFIC AAREN SCIENTIFIC
我的第二个数据框是:
Name_Extension Company_Type Priority
co llc Company LLC 2
Pvt ltd Private Limited 8
Corp Corporation 4
CO Ltd Company Limited 3
inc Incorporated 5
CO Company 1
我删除了标点符号、ASCII 和数字,并将这些数据cleansed_input
放在df1
.
中的该cleansed_input
列df1
需要与 的Name_Extension
列进行检查df2
。如果 from 的值在末尾cleansed_input
有任何值,Name_Extension
那么应该将其拆分并放入,type_input column
而df1
不仅仅是这样,而是缩写。
例如,如果CO
存在cleansed_column
,则应缩写为Company
并放入,type_input column
其余文本应在core_type
列中df1
。还有优先级,不确定是否需要。
预期输出:
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc AMG Mdicale Incorporated
AAREN SCIENTIFIC AAREN SCIENTIFIC
我尝试了很多方法,例如 isin、mask、contains 等,但不知道在哪里放什么。
我得到一个错误说"Series are mutable, they cannot be hashed"
。当我尝试使用数据框时,我不确定为什么会出现该错误。
我没有该代码,并且正在使用 jupiter notebook 和 sql server 并且 isin 似乎在 jupiter 中不起作用。
同样,还有另一个拆分要做。要拆分为 parent_compnay 名称和别名名称的 original_input 列。
Here is my code:
import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re
#Connecting SQL with Python
server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'
engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'?
driver=SQL+server')
#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size()
#print(gp)
#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join([''
if ord(i) < 32 or ord(i) > 126 else i for i in x]))
#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.punctuation]))
#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.digits]))
#4.Removing trialing and leading spaces
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())
df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)
df2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
解决方案
这是您可以实施的可能解决方案:
df = pd.DataFrame({
"Original_Input": ["TECHNOLOGIES S.A",
"A & J INDUSTRIES, LLC",
"A&S DENTAL SERVICES",
"A.M.G Médicale Inc",
"AAREN SCIENTIFIC"],
"Cleansed_Input": ["TECHNOLOGIES SA",
"A J INDUSTRIES LLC",
"AS DENTAL SERVICES",
"AMG Mdicale Inc",
"AAREN SCIENTIFIC"]
})
df_2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()
# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0])
for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
.apply(lambda p: next(( priority
for priority, extension in extensions_list
if p.endswith(extension)), None))
# Merging both dataframes based on priority. This step can be ignored if you only need
# one column from the df_2. In that case, just give the column you require instead of
# `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")
# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
df["Core_Input"] = df.apply(
lambda p: p["Cleansed_Input"]
if p["aux"] == 0
else p["Cleansed_Input"][:p["aux"]].strip(),
axis=1
)
# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]
我假设“优先级”列将具有唯一值。但是,如果不是这种情况,只需对优先级进行排序并根据该顺序创建一个索引,如下所示:
df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))
另外,下次以任何人都可以轻松加载的格式给出数据示例。处理您发送的格式很麻烦。
编辑:与问题无关,但可能会有所帮助。您可以使用以下方法简化从 1 到 4 的步骤:
data['Cleansed_Input'] = data["Original_Input"] \
.str.replace("[^\w ]+", "") \ # removes non-alpha characters
.str.replace(" +", " ") \ # removes duplicated spaces
.str.strip() # removes spaces before or after the string
编辑 2:解决方案的 SQL 版本(我使用的是 PostgreSQL,但我使用了标准 SQL 运算符,因此差异不应该那么大)。
SELECT t.Original_Name,
t.Cleansed_Input,
t.Name_Extension,
t.Company_Type,
t.Priority
FROM (
SELECT df.Original_Name,
df.Cleansed_Input,
df_2.Name_Extension,
df_2.Company_Type,
df_2.Priority,
ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
ON lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
) t
WHERE rn = 1
推荐阅读
- java - 尝试获取连接字符串java时发生文件未找到错误
- python - ValueError:您在优化器 RMSprop 上调用了“set_weights(weights)”,权重列表长度为 3,但优化器期望权重为 0
- python - 我可以在 C 盘中创建我的 Django 项目吗?
- java - Spring Boot @Mapper Bean 创建问题:应用程序无法启动。错误:考虑定义一个 bean 类型
- python - Python:我如何连续添加一个每个 for 循环周期增加 10 的数字?
- java - ArrayAdapter() 中的上下文
- javascript - 将自定义 getter 添加到接收到的 json
- python - 如何将 DataFrame 保存为列表而不是字符串
- r - 在 r 中按组从每个索引中的一行迭代组合
- python - 如何使用 unicodedata 打印 unicode 字符的值?