mysql - Mysql /忽略多列上的重复项
问题描述
下午的人
继上一个问题之后,
我创建了一个小型 python 应用程序,它可以从网站上为我抓取招聘信息。
我将它们保存在一个小的 sql 数据库中
列是 Job_id / job_title / job company / job_salary / job_location / job_post_date
我计划每天运行一次我的脚本,并希望忽略重复的条目。
什么样的查询可以检查 2 列 IE Title 和 company 以确保它不会再次插入,发布的日期将始终每天增加 1
Python代码
import mysql.connector
import requests
from bs4 import BeautifulSoup
my_db = mysql.connector.connect(
host="192.168.1.1",
user='job_user',
password='job1',
database='job_db'
)
my_cursor = my_db.cursor()
radius = "10"
val1 = input("Enter Job: ")
val2 = input("Enter Location: ")
val3 = input("Enter Radius default:(10): ")
url = "https://www.indeed.co.uk/jobs?q={}&l={}&radius={}".format(val1, val2, val3)
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result1 = soup.find_all(class_="jobsearch-SerpJobCard")
for results in result1:
job = results.find('a', attrs={'data-tn-element': 'jobTitle'})
company = results.find('span', attrs={'class': 'company'})
location = results.find('span', attrs={'class': 'location accessible-contrast-color-location'})
salary = results.find('span', attrs={'class': 'salaryText'})
date_pos = results.find('span', attrs={'class': 'date'})
i1 = job.text.strip()
i2 = company.text.strip()
if location is not None:
i3 = location.text.strip()
else:
i3 = "N/A"
if salary is not None:
i4 = salary.text.strip()
else:
i4 = "N/A"
i5 = date_pos.text.strip()[:1]
print(i1)
print(i2)
print(i3)
print(i4)
print("\n")
sql = "INSERT INTO job_tbl (job_title, job_company, job_salary, job_location, job_posted) \
VALUES (%s, %s, %s, %s, %s)"
val = (i1, i2, i3, i4, i5)
my_cursor.execute(sql, val)
my_db.commit()
SQL 查询
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
| job_id | job_title | job_company | job_salary | job_location | job_posted |
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
| 1 | IT Technician | Strathallan School | N/A | £19,000 - £23,000 a year | 3 |
| 2 | English into Romanian IT/Technical Translator (relocation to... | Alpha CRC Ltd. | N/A | £22,000 - £25,000 a year | 7 |
| 3 | IT/Trainee IT Support Analyst | CJ Lang & Son Limited | Dundee DD4 8JU | N/A | 3 |
| 4 | IT Technical Support Apprentice | GP Strategies Training Ltd | Dundee | £10,000 - £12,000 a year | 1 |
| 5 | IT Operations Manager - IRC84524 | Scottish Government | Dundee DD1 | £48,930 - £61,006 a year | 3 |
| 6 | Temporary IT Buyer | brightsolid | Dundee | N/A | 7 |
| 7 | IT Site Support Analyst | Thermo Fisher Scientific | Perth | N/A | 6 |
| 8 | Network and System Administrator | Solutions Driven | Forfar | £30,000 - £35,000 a year | 3 |
| 9 | IT Service Desk Team Leader | Cross Resourcing | Dundee | N/A | 3 |
| 10 | Senior Network Engineer | Raytheon Intelligence & Space | Glenrothes | N/A | 3 |
| 11 | Solutions Architect | NCR | Dundee | N/A | 3 |
| 12 | Technical Support Specialist | The Army | N/A | £15,985 - £20,400 a year | 3 |
| 13 | Pre-Sales Solutions Architect – 12 Month Graduate Internship... | DELL | N/A | N/A | 3 |
+--------+-------------------------------------------------------------------+-------------------------------+----------------+----------------------------+------------+
13 rows in set (0.002 sec)
如果我再次运行相同的应用程序,它将添加相同的结果,我想要的是匹配一个 Title & Company 并检查它是否已经被添加
解决方案
您通常会对该列元组设置唯一约束,并update ... on duplicate key
在插入时使用,因此不会插入重复项,而是更新当前行的日期。
所以像:
create table mytable (
id int primary key auto_increment,
title varchar(50), -- ajust the size as needed
company varchar(50),
salary int,
location varchar(50),
post_date datetime
default current_timestamp, -- not mandatory, but maybe helpful?
unique (title, company)
);
然后:
insert into mytable (title, company, salary, location)
values (?, ?, ?, ?)
on duplicate key update post_date = current_timestamp;
推荐阅读
- lua - 对表数组进行排序
- java - 在 .jar 包中找不到 Java / Mvn 本地资源
- javascript - 用 javascript 编写的数字时钟代码在 HTML 中显示错误的时间和偏移量
- android - 如何为我的整个颤振应用程序提供线性渐变?
- java - 没有合格的 bean - FxWeaver 和 Spring Boot
- regex - 正则表达式位于路径中间的 Nginx 位置
- angular - 使用 Jasmine 和 Karma 进行角度单元测试时出错
- javascript - nuxtjs + express,为什么返回html?
- azure - 通过 Azure API 管理下载 Azure Blob 文件失败并出现错误 - BackendConnectionFailure: at transfer-response
- python - NumPy 获取多维数组中的索引