首页 > 解决方案 > 使用 BeautifulSoup 在 Python 中进行网页抓取

问题描述

我是新来的抓取,我被困在抓取包含我想要提取的一些引号的网页上。

您能否检查一下将抓取的数据复制到 CSV 文件的代码?

这里

import requests
from bs4 import BeautifulSoup
import csv

URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')

quotes=[] # a list to store quotes

table = soup.find('div', attrs = {'id':'container'})

for row in table.findAll('div', attrs = {'class':'quote'}):
    quote = {}
    quote['theme'] = row.h5.text
    quote['url'] = row.a['href']
    quote['img'] = row.img['src']
    quote['lines'] = row.h6.text
    quote['author'] = row.p.text
    quotes.append(quote)

filename = 'inspirational_quotes.csv'
with open(filename, 'wb') as f:
    w = csv.DictWriter(f,['theme','url','img','lines','author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

我在"findAll"函数中遇到错误。

for row in table.findAll('div', attrs = {'class':'quote'}):    
AttributeError: 'NoneType' object has no attribute 'findAll

标签: pythonweb-scrapingbeautifulsoup

解决方案


该站点的 html 与您在脚本中定义的不同。我已经纠正了前三个字段。我想你可以做剩下的。以下内容应该适合您。

import csv
import requests
from bs4 import BeautifulSoup

URL = "https://www.passiton.com/inspirational-quotes?page={}"

quotes = []
page = 1

while True:
    r = requests.get(URL.format(page))
    print(r.url)
    soup = BeautifulSoup(r.content, 'html5lib')

    if not soup.select_one("#all_quotes .text-center > a"):break
    for row in soup.select("#all_quotes .text-center"):
        quote = {}
        try:
            quote['quote'] = row.select_one('a img.shadow').get("alt")
        except AttributeError: quote['quote'] = ""
        try:
            quote['url'] = row.select_one('a').get('href')
        except AttributeError: quote['url'] = ""
        try:
            quote['img'] = row.select_one('a img.shadow').get('src')
        except AttributeError: quote['img'] = ""
        quotes.append(quote)

    page+=1

with open('inspirational_quotes.csv', 'w', newline="", encoding="utf-8") as f:
    w = csv.DictWriter(f,['quote','url','img'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

推荐阅读