首页 > 解决方案 > 从 Python 数据框中的文本列中的特定单词创建虚拟变量和分类变量

问题描述

我正在尝试使用 Python 从数据框中的文本列生成虚拟变量和分类变量。想象一下名为“Cars_listing”的数据框中的文本列“Cars_notes”:

- "This Audi has ABS braking, leather interior and bucket seats..."
- "The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."
- "Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."
- "This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."
- "The Renault Le Car has been sitting in the garage, a little rust..."
- "The Kia Sorento for sale has a CD player, new tires..."
- "Red Dodge Viper convertible for sale, ceramic brakes, low miles..."

如何制作新变量:

- car_type: American [Ford] (1), European [Audi, Renault] (2), Asian [Toyota, Kia] (3)
- ABS_brakes: description includes 'ABS brak' (1), or not (0)
- imperfection: description includes 'rust' or 'scratches' (1) or not (0)
- sporty: description includes 'convertible' (1) or not (0) 

我已经开始尝试 re.search()(不是 re.match()),例如:

sporty = re.search("convertible",'Cars_notes')

我刚刚开始学习 Python 文本操作和 NLP。我已经在这里搜索了信息以及其他来源(Data Camp、Udemy、Google 搜索),但我还没有找到解释如何操作文本来创建此类分类或虚拟变量的内容。帮助将不胜感激。谢谢!

标签: pythonstringpandasvariablestext

解决方案


这是我对此的看法。

由于您正在处理文本,pandas.Series.str.contains因此应该很多(无需使用re.search.

np.where并且np.select在根据条件分配新变量时很有用。

import pandas as pd
import numpy as np

Cars_listing = pd.DataFrame({
    'Cars_notes': 
    ['"This Audi has ABS braking, leather interior and bucket seats..."',
    '"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
    '"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
    '"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
    '"The Renault Le Car has been sitting in the garage, a little rust..."',
    '"The Kia Sorento for sale has a CD player, new tires..."',
    '"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})


# 1. car_type
Cars_listing['car_type'] = np.select(
    condlist=[ # note you could use the case-insensitive search with `case=False`
        Cars_listing['Cars_notes'].str.contains('ford', case=False),
        Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
        Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
    ],
    choicelist=[1, 2, 3], # dummy variables
    default=0 # you could set it to `np.nan` etc
)

# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
    Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)

# 3. imperfection
Cars_listing['imperfection'] = np.where(
    Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)

# 4. sporty
Cars_listing['sporty'] = np.where(
    Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)
    Cars_notes              car_type    ABS_brakes  imperfection    sporty
0   """This Audi has ..."   2           1           0               0
1   """The Ford F150 ..."   1           0           0               0
2   """Our Nissan Sen..."   0           1           0               0
3   """This Toyota Co..."   3           0           1               0
4   """The Renault Le..."   2           0           1               0
5   """The Kia Sorent..."   3           0           0               0
6   """Red Dodge Vipe..."   0           0           0               1

推荐阅读