python - 从 JSON 文件中删除重复条目 - BeautifulSoup
问题描述
我正在运行一个脚本来浏览网站以获取教科书信息,并且该脚本正在运行。但是,当它写入 JSON 文件时,它会给我重复的结果。我试图弄清楚如何从 JSON 文件中删除重复项。这是我的代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json
urls = ['https://open.bccampus.ca/find-open-textbooks/',
'https://open.bccampus.ca/find-open-textbooks/?start=10']
data = []
#opening up connection and grabbing page
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.findAll("h4")
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.parent.a.text
item['author'] = container.nextSibling.findNextSibling(text=True)
item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
item['source'] = "BC Campus"
data.append(item) # add the item to the list
with open("./json/bc.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
这是 JSON 输出的示例
{
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}
解决方案
弄清楚了。这是其他人遇到此问题的解决方案:
textbook_list = []
for item in data:
if item not in textbook_list:
textbook_list.append(item)
with open("./json/bc.json", "w") as writeJSON:
json.dump(textbook_list, writeJSON, ensure_ascii=False)
推荐阅读
- php - 为什么 php ini 中的 PHP memory_limit 指令被忽略?
- python - Tensorflow 改善图像分割结果
- markdown - hugo-markdown 链接文本中的文字“<”和“>”
- php - 在 laravel 8 中使用 HTTP guzzle 通过 API 将多个文件从客户端上传到服务器
- ibm-mq - IBM MQ 调用失败,compcode '2' ('MQCC_FAILED') 原因 '2549' ('MQRC_CALL_INTERRUPTED')
- java - Web 服务 API 的意外行为
- swift - 使用 ARKit 锚点放置 RealityKit 场景
- r - 如何在 R 中的函数 stan_glm() 中更改我的间隔?
- ruby-on-rails - flatpickr 在移动或平板电脑模式下被 rails date-select 取代
- git - Git 别名:在各种差异别名中使用通用功能