html - 如何将每个页面的数据保存到 csv
问题描述
我正在做一个抓取项目,我试图从 13 页中抓取信息。页面的结构是相同的,唯一改变的是网址。
我可以使用 for 循环抓取每个页面,并且可以在终端中查看每个页面的信息。但是当我将它保存到 csv 时,保存的只是最后一页第 13 页的信息。
我确定我错过了一些东西,但似乎无法弄清楚是什么。谢谢!
我正在使用 python 3.7 和 BeautifulSoup 来抓取。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
pages = [str(i) for i in range (1,14)]
for page in pages:
my_url ='Myurl/=' + page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("table", {"class":"hello"})
container = containers[0]
filename = "Full.csv"
f = open(filename, "w")
headers= "Aa, Ab, Ac, Ad, Ba, Bb, Bc, Bd\n"
f.write(headers)
for container in containers:
td_tags = container.find_all('td')
A = td_tags[0]
B=td_tags[2]
Aa = A.a.text
Ab = A.span.text
Ac = A.find('span', attrs = {'class' :'boxes'}).text.strip()
Ad = td_tags[1].text
Ba = B.a.text
Bb = B.span.text
Bc = B.find('span', attrs = {'class' :'boxes'}).text.strip()
Bd = td_tags[3].text
print("Aa:" + Aa)
print("Ab:" + Ab)
print("Ac:" + Ac)
print("Ad:" + Ad)
print("Ba:" + Ba)
print("Bb:" + Bb)
print("Bc:" + Bc)
print("Bd:" + bd)
f.write(Aa + "," + Ab + "," + Ac.replace(",", "|") + "," + Ad + "," + Ba + "," + Bb + "," + Bc.replace(",", "|") + "," + Bd + "\n")
f.close()
编辑*此外,如果有人对如何确认和记录每个容器的页码有一个好主意,那也会很有帮助。再次感谢!
解决方案
执行此操作以附加到文件,而不是覆盖它:
with open(filename, "a") as myfile:
myfile.write(Aa + "," + Ab + "," + Ac.replace(",", "|") + "," + Ad + "," + Ba + "," + Bb + "," + Bc.replace(",", "|") + "," + Bd + "\n")
推荐阅读
- python - 在 Telegram 中获取用户列表
- python - 从多个数据框创建一个新的条件数据框系列
- javascript - Javascript 从 json 结构中的所有子项读取
- sql - 参数类型的运算符 > 没有匹配的签名:STRING、INT64。支持的签名:BigQuery 中的 ANY > ANY
- php - Laravel 7 w/ Laravel Excel:您的需求无法解决为一组可安装的软件包
- python - networkx 中的 all_simple_paths 运行时间过长
- javascript - Reactjs - 在被函数告知渲染后按钮不显示
- google-cloud-platform - 如何在 Google Cloud Platform 中以编程方式(Python)附加静态 IP 地址
- python - 读取也是对象的关键节点时出现熊猫错误
- scala - 如何折叠光滑的列的总和?