python - Creating a new row whenever a comma appears in the column
问题描述
I'm trying to create a mini program that will calculate the closest, open restaurant closest to my location. I have a dataset that includes restaurant names, locations, stars, and hours. However, there is a problem: Sometimes a restaurant will have multiple open/close times in a day.
For example:
Name, location, type, and hours
Blue Duck Tavern, 1201 24th St NW, American Restaurant, 6:30-10:30AM, 11:30AM-2PM,5:30-10:30PM
I'm trying to get the data into a CSV, but for restaurants with multiple hours (like in the example), it can't properly parse it.
The easiest solution for this would (I think) create another line with the same information, but the next set of hours. So, the example would then read:
Blue Duck Tavern, 1201 24th St NW, American Restaurant, 6:30-10:30AM
Blue Duck Tavern, 1201 24th St NW, American Restaurant, 11:30AM-2PM
Blue Duck Tavern, 1201 24th St NW, American Restaurant, 5:30-10:30PM
So the program wouldn't show the restaurant if it wasn't open.
So I have three general questions. 1) Is there a better way to go about this than the solution I mentioned above (creating a new row for every iteration of multiple open/close hours) 2) Below, I'm having trouble with the following implementation:
import pandas as pd
import numpy as np
data = pd.import_csv(data.csv)
for row in data:
if data['hours'].str.contains(',') == 'True':
count = data['hours'].str.count(',')
data.append..
<create new row with Name[row], location[row], type[row], and hours[row] for the # of count>
I've tried google-ing around, and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So I tried to switch it up to:
if data['Monday'].any('Monday').str.contains(',') == 'True':
which results in: ValueError: No axis named Monday for object type
And I'm a bit unclear on the next steps here, or what I'm doing wrong, because if I just do:
print data[data['Monday'].astype(str).str.contains(',')]
It works and returns the result. But I can't do any kind of conditional without it throwing an error.
3) I'm also a bit confused on what to do if there are more than one comma in the row.. I have a vague idea, but if you have any hints, I'd love to hear them :)
Thanks for reading!
解决方案
如果我理解正确,您可以使用正则表达式作为分隔符加载数据,确保逗号前面的内容不是AM
或PM
(使用否定的lookbehind)。然后,您可以在将所有不想修改的列设置为索引后使用str.split
and 。stack
例如:
data = pd.read_csv('data.csv', sep='(?<!AM|PM),')
# Get rid of spaces in your column names
data.columns = data.columns.str.strip(' ')
>>> data
Name location type hours
0 Blue Duck Tavern 1201 24th St NW American Restaurant 6:30-10:30AM, 11:30AM-2PM,5:30-10:30PM
new_data = (data.set_index(['Name', 'location', 'type'])
.hours.str.split(',', expand=True)
.stack()
.reset_index(level=['Name', 'location', 'type']))
>>> new_data
Name location type 0
0 Blue Duck Tavern 1201 24th St NW American Restaurant 6:30-10:30AM
1 Blue Duck Tavern 1201 24th St NW American Restaurant 11:30AM-2PM
2 Blue Duck Tavern 1201 24th St NW American Restaurant 5:30-10:30PM
推荐阅读
- jquery - jQuery,隐藏一个前元素
- node.js - 传递给 findOne 的参数必须是选项对象,如果您希望传递单个主键值,请使用 findById 我在 Sequelize 中已完成此操作
- python - python为每一行添加数字
- actions-on-google - 如何在对话流和谷歌操作上实际设置 final_response
- java - java机器人按键不起作用
- php - 数据库中的多个长度值,查询 SQL 以获得最可能的匹配
- google-apps-script - 如何在 GMAIL ADDONS 中获取当前用户的电子邮件 ID
- javascript - 飞溅没有得到整个页面
- json - 小写的 Powershell ConvertTo-Json 键
- tcl - 管道命令 TCL