web-scraping - 使用 Beautiful Soup 抓取一个人在 Quora 上回答的所有问题

问题描述

我如何编写漂亮的汤来抓取特定用户已回答的所有问题？

输入：
作者的 URL
示例：https ://www.quora.com/profile/AUTHOR/answers )

输出：
第 1 列：作者回答的问题
示例：“Lorem Ipsum 问题”

第 2 列：已回答问题的 URL
示例：https

://www.quora.com/lorem-ipsum-question 第 3 列：已回答问题的 URL
示例：https ://www.quora.com/lorem-ipsum-question

标签： web-scrapingbeautifulsoupquora

我认为最简单的方法是使用硒：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
import time
url = 'https://www.quora.com/profile/Nana-Bello-Shehu/answers'

driver.get(url)

SCROLL_TIME = 2


last_height = driver.execute_script("return document.body.scrollHeight")

while True:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")


    time.sleep(SCROLL_TIME)


    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

qbox = driver.find_elements_by_css_selector('.qu-pb--medium')
for qb in qbox:
    print(qb.find_element_by_css_selector('span.qu-userSelect--text').text)
    print('https://www.quora.com' + qb.find_element_by_css_selector('a.q-box.qu-cursor--pointer.qu-hover--textDecoration--underline').get_attribute('href'))
    print('\n')

输出：

Do pictures speak louder than words?
https://www.quora.comhttps://www.quora.com/profile/Nana-Bello-Shehu


Does true love exist?
https://www.quora.comhttps://www.quora.com/profile/Nana-Bello-Shehu


What picture made your blood boil?
https://www.quora.comhttps://www.quora.com/profile/Nana-Bello-Shehu


What are the before and after pics of people who are drug addicts for several years?
https://www.quora.comhttps://www.quora.com/profile/Nana-Bello-Shehu


What was the funniest thing you saw/heard today?
https://www.quora.comhttps://www.quora.com/profile/Nana-Bello-Shehu


Are there any truly selfless acts, motives, or people?
https://www.quora.comhttps://www.quora.com/profile/Nana-Bello-Shehu

等等...

此脚本滚动到页面末尾并复制所有问题。您可以尝试设置较低的SCROLL_TIME以使脚本更快，但有时脚本会在页面结束之前以较短的滚动时间结束。

笔记：

你需要硒
你需要火狐
您需要geckodriver，现在脚本从中导入它，c:/program/geckodriver.exe因此如果您将 geckodriver 添加到其他路径，则需要更改executable_path

web-scraping - 使用 Beautiful Soup 抓取一个人在 Quora 上回答的所有问题

问题描述

解决方案

推荐阅读