python - 从 Python 数据框中的文本列中的特定单词创建虚拟变量和分类变量
问题描述
我正在尝试使用 Python 从数据框中的文本列生成虚拟变量和分类变量。想象一下名为“Cars_listing”的数据框中的文本列“Cars_notes”:
- "This Audi has ABS braking, leather interior and bucket seats..."
- "The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."
- "Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."
- "This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."
- "The Renault Le Car has been sitting in the garage, a little rust..."
- "The Kia Sorento for sale has a CD player, new tires..."
- "Red Dodge Viper convertible for sale, ceramic brakes, low miles..."
如何制作新变量:
- car_type: American [Ford] (1), European [Audi, Renault] (2), Asian [Toyota, Kia] (3)
- ABS_brakes: description includes 'ABS brak' (1), or not (0)
- imperfection: description includes 'rust' or 'scratches' (1) or not (0)
- sporty: description includes 'convertible' (1) or not (0)
我已经开始尝试 re.search()(不是 re.match()),例如:
sporty = re.search("convertible",'Cars_notes')
我刚刚开始学习 Python 文本操作和 NLP。我已经在这里搜索了信息以及其他来源(Data Camp、Udemy、Google 搜索),但我还没有找到解释如何操作文本来创建此类分类或虚拟变量的内容。帮助将不胜感激。谢谢!
解决方案
这是我对此的看法。
由于您正在处理文本,pandas.Series.str.contains
因此应该很多(无需使用re.search
.
np.where
并且np.select
在根据条件分配新变量时很有用。
import pandas as pd
import numpy as np
Cars_listing = pd.DataFrame({
'Cars_notes':
['"This Audi has ABS braking, leather interior and bucket seats..."',
'"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
'"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
'"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
'"The Renault Le Car has been sitting in the garage, a little rust..."',
'"The Kia Sorento for sale has a CD player, new tires..."',
'"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})
# 1. car_type
Cars_listing['car_type'] = np.select(
condlist=[ # note you could use the case-insensitive search with `case=False`
Cars_listing['Cars_notes'].str.contains('ford', case=False),
Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
],
choicelist=[1, 2, 3], # dummy variables
default=0 # you could set it to `np.nan` etc
)
# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)
# 3. imperfection
Cars_listing['imperfection'] = np.where(
Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)
# 4. sporty
Cars_listing['sporty'] = np.where(
Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)
Cars_notes car_type ABS_brakes imperfection sporty
0 """This Audi has ..." 2 1 0 0
1 """The Ford F150 ..." 1 0 0 0
2 """Our Nissan Sen..." 0 1 0 0
3 """This Toyota Co..." 3 0 1 0
4 """The Renault Le..." 2 0 1 0
5 """The Kia Sorent..." 3 0 0 0
6 """Red Dodge Vipe..." 0 0 0 1
推荐阅读
- git - 使用 git bash 通过 HTTPS 在特定机器上的身份验证失败 - 后续问题
- python - 如何在散景中同时使用 ColumnDataSource 和 LegendItem?
- python - 无法理解如何在 python 中使用全局变量
- ffmpeg - 如何在终端中使用 ffprobe?
- python - 这个 json 参考有什么问题吗?
- java - 在 Java 中通过引用传递对象
- angularjs - Angular 8 混合应用程序无法识别 AngularJS 组件
- javascript - 如何使用流类型中的类组件注释传递附加道具的 React HOC
- highcharts - 除了 mousemove 之外,还捕获触摸事件(touchstart 和 touchmove)以同步 highcharts
- apache-spark - 在写入 CSV 时停止数据集列的排序