首页 > 解决方案 > 在 Pandas 中搜索 df 并在特定字符串之后查找数字

问题描述

我正在寻找从数据框中提取特定字符串之后的数字。我需要扫描整个数据框并查找名为“Concession Type:”的特定字符串,然后获取结果(通常是 Concession Type:CC 或 None)并基于该字符串创建一列。此列将填充“CC”或“None”。如果它具有 CC 特许类型,我想创建另一列并拉出一个字符串(框架中的另一个字符串,文本为“总金额:x”。我想从中拉出“x”。这些文本被埋在数据框中的各种列,因此我无法调用一列(数据框是通过从 pdf 中提取文本创建的,每个换行符都会创建一列)。

我在下面的内容,查看该数据框中的所有文本并查找特许类型:无并创建特许类型列,特许类型相同:$,然后检查它是否满足下面列出的某些条件,然后创建“让步检查”列 这是数据框的示例。

6/9/2020 1 Per Page - Listing Report**IRES MLS  : 91 PRICE: $59,900**12 Warrior Way**ATTACHED DWELLING ACTIVE / BACKUP**Locale: Lafa County: Bould**Area/SubArea: 3/0**Subdivision: Lafayett Greens Townhomes**School District: Bould Vall Dist New Const: No**Builder: Model:**Lot SqFt: 625 Approx. Acres: 0.01**New Const Notes:**Elec: Xcel Water: City of Lafay**Gas: Xcel Taxes: $1,815/2019 Listing Comments: Bright, Modern and Cozy!
6/9/2020 1 Per Page - Listing Report**IRES MLS : 906 PRICE: $350,000**15 Calks Ave, Long 80501**RESIDENTIAL-DETACHED SOLD**Locale: Longmont County: Bould**Area/SubArea: 4/6**Sold Date: 04/01/2020 Sold Price: $360,000**Bedrooms: 3 Baths: 2 Rough Ins: 0**Terms: VA FIX DOM: 1 DTO: 1 DTS: 24**Baths Bsmt Lwr Main Upr Addl Total Down Pmt Assist: N**Full 0 0 0 1 0 1 Concession Type: None**3/4 0 1 0 0 0 1****https://www.iresis.com/MLS/Search/index.cfm?Action=LaunchReports 249/250
6/9/2020 1 Per Page - Listing Report**IRES MLS : 908 PRICE: $360,000**7 S Roosevelt Ave, Lafa 80026**RESIDENTIAL-DETACHED SOLD**Locale: Lafay County: Boul**Area/SubArea: 3/0**Sold Date: 05/08/2020 Sold Price: $360,000**Bedrooms: 2 Baths: 1 Rough Ins: 0**Terms: CONV FIX DOM: 5 DTO: 5 DTS: 34**Baths Bsmt Lwr Main Upr Addl Total Down Pmt Assist: N**Full 0 0 1 0 0 1 Concession Type: None**3/4 0 0 0 0 0 0**Property Features**1/2 0 0 0 0 0 0 Style: 1 Story/Ranch Construction: Wood/Frame, Metal Siding Roof:**https://www.iresis.com/MLS/Search/index.cfm?Action=LaunchReports 250/250

df = pd.DataFrame([sub.split("**") for sub in df])
df[['MLS #', 'Price']] = df[1].str.split('PRICE:', n=1, expand=True)
df[['Prop Type', 'Status']] = df[3].str.rsplit(' ', n=1, expand=True)
df['Concession Type'] = df.apply(lambda row: row.astype(str).str.contains('Concession Type: None', regex=False).any(), axis=1)
df['Concession Type'] = df.apply(lambda row: row.astype(str).str.contains('Concession Type: $', regex=False).any(), axis=1)
conditions = [(df['Concession Type'] == True) & (df['Status'] == 'SOLD'),
             (df['Concession Type'] == False) & (df['Status'] == 'SOLD')]
choices = ['no concession', 'concession']
df['Concession_check'] = np.select(conditions, choices, default='Active/Pending/Withdrawn')

标签: pythonpandas

解决方案


我没有足够的关于输入数据结构的信息。我假设每一行都是数组中的一个元素:

df = ["row1" , "row2" , "row3"] # First code block in your question
df = pd.DataFrame([sub.split("**") for sub in df])
dx =  [df[i].str.contains("Concession") for i in df]
df[pd.DataFrame(dx).T.fillna(False)] # Fill None values because it errors out without boolean values

从这里您可以添加更多检查。


推荐阅读