python - 如何根据条件删除 Pandas 数据框中的重复项
问题描述
我正在制作一个小型计数软件,它基本上可以计算一个场所内的总人数。我从微控制器数据库(允许人们进出)获得的数据框存在人为错误,有时用户在进入之前退出。因此,数据框中存在一个条目在另一个后续条目之前有多个出口的情况。df 是这样的:
date timestamp type cardno status
**20201006 55737 PC010 117016056 Valid Card Exit**
20201006 55907 PC010 117016056 Valid Card Entry
20201006 60312 PC006 100024021 Valid Card Entry
20201006 61311 PC006 100024021 Valid Card Exit
20201006 61445 PC006 100024021 Valid Card Entry
20201006 61538 PC006 100024021 Valid Card Exit
20201006 61646 PC010 117016056 Valid Card Exit
20201006 61933 PC006 100024021 Valid Card Entry
20201006 61938 PC010 117016056 Valid Card Entry
20201006 62025 PC006 100024021 Valid Card Exit
20201006 62041 PC010 117016056 Valid Card Exit
20201006 62042 PC006 100024021 Valid Card Entry
20201006 62225 PC010 117016056 Valid Card Entry
20201006 62527 PC006 100024021 Valid Card Exit
20201006 63018 PC006 100024021 Valid Card Entry
20201006 64832 PC007 116057383 Valid Card Entry
20201006 64834 PC011 117016074 Valid Card Entry
**20201006 64952 PC012 116054003 Valid Card Exit**
带有 ** 的条目基本上是员工在进入之前点击退出(无论出于何种原因),这会打乱计数。我想摆脱数据框中的所有此类实例。我很难做到这一点。到目前为止,我制作的计数软件基本上是读取一个 firebird 数据库,然后从中制作不同的数据框,继续计算其形状,然后将输出作为简单的 HTML 显示在放置在场所内的大屏幕上。我上面描述的有问题的数据框在我在生产(测试)中运行的程序中称为“contractorDf”,如下所示:
import subprocess
from datetime import datetime
from datetime import date
import pandas as pd
import re
import os
import sys
#------------------------------------------------------PRODUCTION-----------------------------------------#
# Generating a Temporary Date for Production Environment
tempDate = date(2020, 10, 6)
tempDate = str(tempDate)
tempDate = tempDate.replace('-', '')
#------------------------------------------------------PRODUCTION----------------------------------------#
################################################################################################################################
# Getting Current Day (This will be used in real environment)
currentDay = datetime.now().day
if currentDay < 10:
currentDay = str(currentDay)
currentDay = '0'+ currentDay
else:
currentDay = str(currentDay)
# Getting Current Year & Month
currentYear = datetime.now().year
currentMonth = datetime.now().month
currentYear = str(currentYear)
currentMonth = str(currentMonth)
currentYearMonth = currentYear+currentMonth
currentYearMonthDay = currentYearMonth+currentDay
# Getting Variable for After FROM
currentTableName = 'ST'+currentYearMonth
# Getting Final Query (Commented Right now because Testing)
query = "SELECT * FROM " + currentYearMonth + " " + "WHERE TRDATE=" + currentYearMonthDay + ";"
finalQuery = bytes(query, 'ascii')
#############################################################################################################################
#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Making a temporary Table Name and Query for Production Environment
tempTableName = 'ST'+currentYearMonth
nonByteQuery = "SELECT * FROM " + tempTableName + " " + "WHERE TRDATE=" + tempDate + ";"
tempQuery = bytes(nonByteQuery, 'ascii')
#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Generating record.csv file from command prompt (Before initiating this, C:\\Program Files (x86)\\FireBird\\FireBird_2_1\\bin should be in the environment variables)
p = subprocess.Popen('isql', shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
p.stdin.write(b'CONNECT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\TRANS.fdb";') #The italicized b is because its a Byte size code and we can't
p.stdin.write(b'OUTPUT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv";')
p.stdin.write(tempQuery)
p.stdin.write(b'OUTPUT;')
p.communicate()
p.terminate()
# Terminating the Command Prompt Window
# Reading the record file that is just generated above
tempdf = pd.read_csv('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv', sep='delimeter', engine='python', header=None, skipinitialspace=True)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)
#tempdf = tempdf[0].astype(str)
columns = ["TRDATE", "TRTIME", "TRCODE", "TRDESC", "CTRLTAG", "CTRLNAME", "CTRLIP", "CARDNO", "STAFFNO", "STAFFNAME", "DEPTNAME", "JOBNAME", "SHIFTNAME", "DEVTYPE", "DEVNAME", "DEVNO", "TRID", "ISCAP", "RCGROUP", "POLLTIME", "SENDSEQ", "RECSEQ", "IOBNO", "IOBNAME", "ZONENO", "ZONENAME", "POINTNO", "POINTNAME", "ISSNAPRET", "PROTRAG"]
header = tempdf.iloc[0]
linespace = tempdf.iloc[1]
header = str(header)
header = header[5:]
header = header[:-24]
linespace = str(linespace)
linespace = linespace[7:]
linespace = linespace[:-23]
tempdf = tempdf[~tempdf[0].str.contains(header)]
tempdf = tempdf[~tempdf[0].str.contains(linespace)]
tempdf = tempdf[0].str.replace(' ', ',')
df = tempdf.str.split(",", n=400, expand=True)
df = df[[0,1,7,8,9,10,31,41,42,43,52,53,54]]
df[100] = df[7].map(str) + ' ' + df[8].map(str) + ' ' + df[9].map(str) + ' ' + df[10].map(str)
df = df.drop([7,8,9,10], axis=1)
df[101] = df[31].map(str) + df[41].map(str)
df = df.drop([31,41], axis=1)
df[102] = df[43].map(str) + df[52].map(str) + df[53].map(str) + df[54].map(str)
df = df.drop([43,52,53,54], axis=1)
def newblock(column):
if column[42].startswith('VIS'):
return column[42]
else:
pass
df = df.assign(newblock=df.apply(newblock, axis=1))
df[42] = df[42].str.replace('VIS_\d\d\d\d\d\d\d\d\d\d', '')
df[105] = df[42].map(str) + df[101].map(str)
df = df.drop([42,101], axis=1)
df[106] = df[102].map(str) + df['newblock'].map(str)
df = df.drop(['newblock', 102], axis=1)
df[106] = df[106].str.replace('None', '')
df = df[[0,1,106,105,100]]
columns = ['date', 'timestamp', 'type', 'cardno', 'status']
df.columns = df.columns.map(str)
df.columns = columns
df = df.reset_index()
df = df.drop(['index'], axis=1)
#Making Visitor Counter
visitorDf = df[df['type'].str.startswith('VIS')]
#visitorDf = visitorDf[~visitorDf['status'].str.contains('Unknown')]
visitorIn1 = len(visitorDf[visitorDf['status'].str.contains('Unknown')])
VisitorIn1 = int(visitorIn1)
visitorDf = visitorDf.reset_index()
visitorDf = visitorDf.drop(('index'), axis=1)
visitorIn = len(visitorDf[visitorDf['status'].str.contains('Valid Card Entry')])
visitorOut = len(visitorDf[visitorDf['status'].str.contains('Valid Card Exit')])
visitorIn = int(visitorIn)
visitorOut = int(visitorOut)
totalVisitor = visitorIn1 + visitorIn - visitorOut
#Making Contractor Counter
contractorDf = df[df['type'].str.startswith('PC')]
#contractorDf = contractorDf[~contractorDf['status'].str.contains('Unknown')]
contractorIn1 = len(contractorDf[contractorDf['status'].str.contains('Unknown')])
contractorIn1 = int(contractorIn1)
contractorDf = contractorDf.reset_index()
contractorDf = contractorDf.drop(('index'), axis=1)
contractorIn = len(contractorDf[contractorDf['status'].str.contains('Valid Card Entry')])
contractorOut = len(contractorDf[contractorDf['status'].str.contains('Valid Card Exit')])
contractorIn = int(contractorIn)
contractorOut = int(contractorOut)
totalContractor = contractorIn1 + contractorIn - contractorOut
#Making Employee Counter
employeeDf = df[df['type'].str.contains('^\d', regex=True)]
#employeeDf = employeeDf[~employeeDf['status'].str.contains('Unknown')]
employeeIn1 = len(employeeDf[employeeDf['status'].str.contains('Unknown')])
employeeIn1 = int(employeeIn1)
employeeDf = employeeDf.reset_index()
employeeDf = employeeDf.drop(('index'), axis=1)
employeeIn = len(employeeDf[employeeDf['status'].str.contains('Valid Card Entry')])
employeeOut = len(employeeDf[employeeDf['status'].str.contains('Valid Card Exit')])
employeeIn = int(employeeIn)
employeeOut = int(employeeOut)
totalEmployee = employeeIn1 + employeeIn - employeeOut
os.remove('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv')
visitor = totalVisitor
employee = totalEmployee
contractor = totalContractor
if os.path.exists('C:\\Apache24\\htdocs\\counter\\index.html'):
os.remove('c:\\Apache24\\htdocs\\counter\\index.html')
else:
pass
f = open('C:\\Apache24\\htdocs\\counter\\index.html', 'w')
message = """
<html lang="en-US" class="hide-scroll">
<head>
<title>Emhart Counter</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" crossorigin="anonymous">
<style>
body {{
background-color: lightblue;
}}
.verticalCenter {{
margin: 0;
top: 100%;
-ms-transform: translateY(25%);
transform: translateY(25%);
}}
</style>
</head>
<body>
<center>
<div class=“verticalCenter">
<h1 style=font-size:100px>VISITORS:        {visitor}</h1><br></br><br></br>
<h1 style=font-size:100px>EMPLOYEES:        {employee}</h1><br></br><br></br>
<h1 style=font-size:100px>CONTRACTORS:        {contractor}</h1><br></br><br></br><br></br><br></br>
<h3 style=font-size: 50px>THIS IS A TEST RUN<h3>
</div>
</center>
</body>
</html>"""
new_message = message.format(visitor=visitor, employee=employee, contractor=contractor)
f.write(new_message)
f.close()
sys.exit()
剩下的唯一问题是我如何在它在contractorDf中有相应的条目之前摆脱cardno / type的出口。我真的很感激在这件事上的任何帮助。
解决方案
对于您的示例,开始于和结束于将起作用。对于更复杂的正则表达式模式,请使用contains。
mask = df.date.str.startswith("**")
print(df[mask])
# or
mask = df.status.str.endswith("**")
print(df[mask])
输出:
date timestamp type cardno status
0 **20201006 55737 PC010 117016056 Valid_Card_Exit**
3 **20201006 64952 PC012 116054003 Valid_Card_Exit**
设置:
columns = ['date','timestamp','type','cardno','status']
data = [el.split(",") for el in ['**20201006,55737,PC010,117016056,Valid_Card_Exit**',
'20201006,55907,PC010,117016056,Valid_Card_Entry',
'20201006,64834,PC011,117016074,Valid_Card_Entry',
'**20201006,64952,PC012,116054003,Valid_Card_Exit**']]
df = pd.DataFrame(data, columns=columns)
推荐阅读
- javascript - 在点击位置对齐弹出框 React antd
- c# - 使用令牌 C# 发布
- docker - 如何在 MacOS 上不打开 Docker Desktop 的情况下启动应用程序 Docker?
- c# - 异步友好,跨进程读/写锁 - .NET
- html - 使用 CSS/HTML 圆角正方形的角并在其中插入线条
- neo4j - Neo4j cypher:返回连接到两个节点和仅两个节点的所有节点
- python - 使用 Trie 在标记化的句子中搜索短语
- c++ - 尝试使用指针遍历数组,但出现浮点指针错误/无法编译
- ios - 如何使用 XCode 12.4 从 Swift 3 迁移?
- postgresql - azure postgres pg_restore 无法在容器中找到转储文件