首页 > 解决方案 > 爬取结果链接的页面打不开

问题描述

这是我的谷歌搜索结果抓取代码。

class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']

start_urls = [
    f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']

def parse(self, response):
    titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
    links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
    items = []

    for idx in range(len(titles)):
        item = GoogleScraperItem()
        item['title'] = titles[idx]
        item['link'] = links[idx].lstrip("/url?q=")
        items.append(item)
        df = pd.DataFrame(items, columns=['title', 'link'])
        writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
        df.to_excel(writer, sheet_name='test1.xlsx')
        writer.save()
    return items

我可以获得每个标题/链接的九个项目结果。

https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0

当我打开 excel 文件 (test1.xlsx) 时,所有链接都无法正确打开。 在此处输入图像描述 在“settings.py”上添加如下。

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"

ROBOTSTXT_OBEY = 假

标签: pythonscrapy

解决方案


如果您密切注意提取的 url,它们都具有sa,vedusg查询参数。显然,这些不是目标站点 url 的一部分,而是 google 搜索结果查询参数。

要仅获取目标 url,您应该使用urllib库解析 url,并仅提取q查询参数。

from urllib.parse import urlparse, parse_qs

parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)
target_url = query_params["q"][0]

完整的工作代码:

from urllib.parse import urlparse, parse_qs

class GoogleBotsSpider(scrapy.Spider):
    name = 'GoogleScrapyBot'
    allowed_domains = ['google.com']

    start_urls = [
        f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']

    def parse(self, response):
        titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
        links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
        items = []

        for idx in range(len(titles)):
            item = GoogleScraperItem()
            item['title'] = titles[idx]
    
            # Parsing item url
            parsed_url = urlparse(links[idx])
            query_params = parse_qs(parsed_url.query)
            item['link'] = query_params["q"][0]

            items.append(item)
            df = pd.DataFrame(items, columns=['title', 'link'])
            writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
            df.to_excel(writer, sheet_name='test1.xlsx')
            writer.save()
        return items

推荐阅读