python - 使用 BeautifulSoup 在 Python 中进行网页抓取
问题描述
我是新来的抓取,我被困在抓取包含我想要提取的一些引号的网页上。
您能否检查一下将抓取的数据复制到 CSV 文件的代码?
import requests
from bs4 import BeautifulSoup
import csv
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
quotes=[] # a list to store quotes
table = soup.find('div', attrs = {'id':'container'})
for row in table.findAll('div', attrs = {'class':'quote'}):
quote = {}
quote['theme'] = row.h5.text
quote['url'] = row.a['href']
quote['img'] = row.img['src']
quote['lines'] = row.h6.text
quote['author'] = row.p.text
quotes.append(quote)
filename = 'inspirational_quotes.csv'
with open(filename, 'wb') as f:
w = csv.DictWriter(f,['theme','url','img','lines','author'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
我在"findAll"
函数中遇到错误。
for row in table.findAll('div', attrs = {'class':'quote'}):
AttributeError: 'NoneType' object has no attribute 'findAll
解决方案
该站点的 html 与您在脚本中定义的不同。我已经纠正了前三个字段。我想你可以做剩下的。以下内容应该适合您。
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://www.passiton.com/inspirational-quotes?page={}"
quotes = []
page = 1
while True:
r = requests.get(URL.format(page))
print(r.url)
soup = BeautifulSoup(r.content, 'html5lib')
if not soup.select_one("#all_quotes .text-center > a"):break
for row in soup.select("#all_quotes .text-center"):
quote = {}
try:
quote['quote'] = row.select_one('a img.shadow').get("alt")
except AttributeError: quote['quote'] = ""
try:
quote['url'] = row.select_one('a').get('href')
except AttributeError: quote['url'] = ""
try:
quote['img'] = row.select_one('a img.shadow').get('src')
except AttributeError: quote['img'] = ""
quotes.append(quote)
page+=1
with open('inspirational_quotes.csv', 'w', newline="", encoding="utf-8") as f:
w = csv.DictWriter(f,['quote','url','img'])
w.writeheader()
for quote in quotes:
w.writerow(quote)