首页 > 解决方案 > 如何根据条件删除 Pandas 数据框中的重复项

问题描述

我正在制作一个小型计数软件,它基本上可以计算一个场所内的总人数。我从微控制器数据库(允许人们进出)获得的数据框存在人为错误,有时用户在进入之前退出。因此,数据框中存在一个条目在另一个后续条目之前有多个出口的情况。df 是这样的:

date     timestamp  type    cardno      status
**20201006  55737   PC010   117016056   Valid Card Exit**
20201006    55907   PC010   117016056   Valid Card Entry
20201006    60312   PC006   100024021   Valid Card Entry
20201006    61311   PC006   100024021   Valid Card Exit
20201006    61445   PC006   100024021   Valid Card Entry
20201006    61538   PC006   100024021   Valid Card Exit
20201006    61646   PC010   117016056   Valid Card Exit
20201006    61933   PC006   100024021   Valid Card Entry
20201006    61938   PC010   117016056   Valid Card Entry
20201006    62025   PC006   100024021   Valid Card Exit
20201006    62041   PC010   117016056   Valid Card Exit
20201006    62042   PC006   100024021   Valid Card Entry
20201006    62225   PC010   117016056   Valid Card Entry
20201006    62527   PC006   100024021   Valid Card Exit
20201006    63018   PC006   100024021   Valid Card Entry
20201006    64832   PC007   116057383   Valid Card Entry
20201006    64834   PC011   117016074   Valid Card Entry
**20201006  64952   PC012   116054003   Valid Card Exit**

带有 ** 的条目基本上是员工在进入之前点击退出(无论出于何种原因),这会打乱计数。我想摆脱数据框中的所有此类实例。我很难做到这一点。到目前为止,我制作的计数软件基本上是读取一个 firebird 数据库,然后从中制作不同的数据框,继续计算其形状,然后将输出作为简单的 HTML 显示在放置在场所内的大屏幕上。我上面描述的有问题的数据框在我在生产(测试)中运行的程序中称为“contractorDf”,如下所示:

import subprocess
from datetime import datetime
from datetime import date
import pandas as pd
import re
import os
import sys
   
#------------------------------------------------------PRODUCTION-----------------------------------------#
# Generating a Temporary Date for Production Environment
tempDate = date(2020, 10, 6)
tempDate = str(tempDate)
tempDate = tempDate.replace('-', '')
#------------------------------------------------------PRODUCTION----------------------------------------#
   
################################################################################################################################
# Getting Current Day (This will be used in real environment)
currentDay = datetime.now().day

if currentDay < 10:
    currentDay = str(currentDay)
    currentDay = '0'+ currentDay
else:
    currentDay = str(currentDay)


# Getting Current Year & Month
currentYear = datetime.now().year
currentMonth = datetime.now().month
currentYear = str(currentYear)
currentMonth = str(currentMonth)
currentYearMonth = currentYear+currentMonth
currentYearMonthDay = currentYearMonth+currentDay

# Getting Variable for After FROM
currentTableName = 'ST'+currentYearMonth

# Getting Final Query (Commented Right now because Testing)
query = "SELECT * FROM " + currentYearMonth + " " + "WHERE TRDATE=" + currentYearMonthDay + ";"
finalQuery = bytes(query, 'ascii')
#############################################################################################################################


#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Making a temporary Table Name and Query for Production Environment
tempTableName = 'ST'+currentYearMonth
nonByteQuery = "SELECT * FROM " + tempTableName + " " + "WHERE TRDATE=" + tempDate + ";"
tempQuery = bytes(nonByteQuery, 'ascii')
#-------------------------------------------------------PRODUCTION------------------------------------------------------#



# Generating record.csv file from command prompt (Before initiating this, C:\\Program Files (x86)\\FireBird\\FireBird_2_1\\bin should be in the environment variables)
p = subprocess.Popen('isql', shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
p.stdin.write(b'CONNECT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\TRANS.fdb";') #The italicized b is because its a Byte size code and we can't 
p.stdin.write(b'OUTPUT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv";')
p.stdin.write(tempQuery)
p.stdin.write(b'OUTPUT;')
p.communicate()
p.terminate()
# Terminating the Command Prompt Window



# Reading the record file that is just generated above
tempdf = pd.read_csv('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv', sep='delimeter', engine='python', header=None, skipinitialspace=True)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)

#tempdf = tempdf[0].astype(str)
columns = ["TRDATE", "TRTIME", "TRCODE", "TRDESC", "CTRLTAG", "CTRLNAME", "CTRLIP", "CARDNO", "STAFFNO", "STAFFNAME", "DEPTNAME", "JOBNAME", "SHIFTNAME", "DEVTYPE", "DEVNAME", "DEVNO", "TRID", "ISCAP", "RCGROUP", "POLLTIME", "SENDSEQ", "RECSEQ", "IOBNO", "IOBNAME", "ZONENO", "ZONENAME", "POINTNO", "POINTNAME", "ISSNAPRET", "PROTRAG"]
header = tempdf.iloc[0]
linespace = tempdf.iloc[1]
header = str(header)
header = header[5:]
header = header[:-24]
linespace = str(linespace)
linespace = linespace[7:]
linespace = linespace[:-23]

tempdf = tempdf[~tempdf[0].str.contains(header)]
tempdf = tempdf[~tempdf[0].str.contains(linespace)]
tempdf = tempdf[0].str.replace(' ', ',')
df = tempdf.str.split(",", n=400, expand=True)
df = df[[0,1,7,8,9,10,31,41,42,43,52,53,54]]
df[100] = df[7].map(str) + ' ' + df[8].map(str) + ' ' + df[9].map(str) + ' ' + df[10].map(str)
df = df.drop([7,8,9,10], axis=1)
df[101] = df[31].map(str) + df[41].map(str)
df = df.drop([31,41], axis=1)
df[102] = df[43].map(str) + df[52].map(str) + df[53].map(str) + df[54].map(str)
df = df.drop([43,52,53,54], axis=1)

def newblock(column):
    if column[42].startswith('VIS'):
        return column[42]
    else:
        pass


df = df.assign(newblock=df.apply(newblock, axis=1))

df[42] = df[42].str.replace('VIS_\d\d\d\d\d\d\d\d\d\d', '')

df[105] = df[42].map(str) + df[101].map(str)
df = df.drop([42,101], axis=1)
df[106] = df[102].map(str) + df['newblock'].map(str)
df = df.drop(['newblock', 102], axis=1)
df[106] = df[106].str.replace('None', '')
df = df[[0,1,106,105,100]]
columns = ['date', 'timestamp', 'type', 'cardno', 'status']
df.columns = df.columns.map(str)
df.columns = columns
df = df.reset_index()
df = df.drop(['index'], axis=1)




#Making Visitor Counter
visitorDf = df[df['type'].str.startswith('VIS')]
#visitorDf = visitorDf[~visitorDf['status'].str.contains('Unknown')]
visitorIn1 = len(visitorDf[visitorDf['status'].str.contains('Unknown')])
VisitorIn1 = int(visitorIn1)
visitorDf = visitorDf.reset_index()
visitorDf = visitorDf.drop(('index'), axis=1)
visitorIn = len(visitorDf[visitorDf['status'].str.contains('Valid Card Entry')])
visitorOut = len(visitorDf[visitorDf['status'].str.contains('Valid Card Exit')])
visitorIn = int(visitorIn)
visitorOut = int(visitorOut)
totalVisitor = visitorIn1 + visitorIn - visitorOut

#Making Contractor Counter
contractorDf = df[df['type'].str.startswith('PC')]
#contractorDf = contractorDf[~contractorDf['status'].str.contains('Unknown')]
contractorIn1 = len(contractorDf[contractorDf['status'].str.contains('Unknown')])
contractorIn1 = int(contractorIn1)
contractorDf = contractorDf.reset_index()
contractorDf = contractorDf.drop(('index'), axis=1)
contractorIn = len(contractorDf[contractorDf['status'].str.contains('Valid Card Entry')])
contractorOut = len(contractorDf[contractorDf['status'].str.contains('Valid Card Exit')])
contractorIn = int(contractorIn)
contractorOut = int(contractorOut)
totalContractor = contractorIn1 + contractorIn - contractorOut


#Making Employee Counter
employeeDf = df[df['type'].str.contains('^\d', regex=True)]
#employeeDf = employeeDf[~employeeDf['status'].str.contains('Unknown')]
employeeIn1 = len(employeeDf[employeeDf['status'].str.contains('Unknown')])
employeeIn1 = int(employeeIn1)
employeeDf = employeeDf.reset_index()
employeeDf = employeeDf.drop(('index'), axis=1)
employeeIn = len(employeeDf[employeeDf['status'].str.contains('Valid Card Entry')])
employeeOut = len(employeeDf[employeeDf['status'].str.contains('Valid Card Exit')])
employeeIn = int(employeeIn)
employeeOut = int(employeeOut)
totalEmployee = employeeIn1 + employeeIn - employeeOut


os.remove('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv')

visitor = totalVisitor
employee = totalEmployee
contractor = totalContractor

if os.path.exists('C:\\Apache24\\htdocs\\counter\\index.html'):
    os.remove('c:\\Apache24\\htdocs\\counter\\index.html')
else:
    pass

f = open('C:\\Apache24\\htdocs\\counter\\index.html', 'w')

message = """
<html lang="en-US" class="hide-scroll">
    <head>
        <title>Emhart Counter</title>
        <meta charset="utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" crossorigin="anonymous">
        <style>
        body {{
            background-color: lightblue;
        }}

        .verticalCenter {{
            margin: 0;
            top: 100%;
            -ms-transform: translateY(25%);
            transform: translateY(25%);
        }}
        </style>
    </head>
    <body>
        <center>
            <div class=“verticalCenter">
                <h1 style=font-size:100px>VISITORS: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {visitor}</h1><br></br><br></br>
                <h1 style=font-size:100px>EMPLOYEES: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {employee}</h1><br></br><br></br>
                <h1 style=font-size:100px>CONTRACTORS: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {contractor}</h1><br></br><br></br><br></br><br></br>
                <h3 style=font-size: 50px>THIS IS A TEST RUN<h3>
            </div>
        </center>
    </body>
</html>"""


new_message = message.format(visitor=visitor, employee=employee, contractor=contractor)
f.write(new_message)
f.close()


sys.exit()

剩下的唯一问题是我如何在它在contractorDf中有相应的条目之前摆脱cardno / type的出口。我真的很感激在这件事上的任何帮助。

标签: pythonpandas

解决方案


对于您的示例,开始于和结束于将起作用。对于更复杂的正则表达式模式,请使用contains

mask = df.date.str.startswith("**")
print(df[mask])

# or

mask = df.status.str.endswith("**")
print(df[mask])

输出:

         date timestamp   type     cardno             status
0  **20201006     55737  PC010  117016056  Valid_Card_Exit**
3  **20201006     64952  PC012  116054003  Valid_Card_Exit**

设置:

columns = ['date','timestamp','type','cardno','status']
data = [el.split(",") for el in ['**20201006,55737,PC010,117016056,Valid_Card_Exit**',
'20201006,55907,PC010,117016056,Valid_Card_Entry',
'20201006,64834,PC011,117016074,Valid_Card_Entry',
'**20201006,64952,PC012,116054003,Valid_Card_Exit**']]
df = pd.DataFrame(data, columns=columns)

推荐阅读