python - 无法使用请求抓取 graphql 页面
问题描述
我正在尝试使用请求模块从网页中抓取公司名称及其相应的链接。
尽管内容是动态的,但我可以注意到它们在window.props
.
所以,我想挖出那部分并使用 json 处理它,但我看到\u0022
周围的字符而不是引号"
。这就是我的意思:
{\u0022firms\u0022: [{\u0022index\u0022: 1, \u0022slug\u0022: \u0022zjjz\u002Datelier\u0022, \u0022name\u0022:
我试过:
import re
import json
import requests
from bs4 import BeautifulSoup
link = 'https://architizer.com/firms/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
r = s.get(link)
items = re.findall(r'window.props[^"]+(.*?);',r.text)[0].strip('"').replace('\u0022', '\'')
print(items)
如何使用请求从该网页中获取遍历多个页面的不同公司的名称和链接?
解决方案
嗯,那很有趣。
您正在处理由GraphQL提供支持的页面,因此您必须正确模拟请求。
此外,他们希望您发送一个令牌Referer Header
和一个csfr
令牌。这可以很容易地从初始请求中提取出来HTML
并在后续请求中重用。
这是我对此的看法:
import time
import requests
from bs4 import BeautifulSoup
link = 'https://architizer.com/firms/'
query = """{ allFirmsWithProjects( first: 6, after: "6", firmType: "Architecture / Design Firm", firmName: "All Firm Names", projectType: "All Project Types", projectLocation: "All Project Locations", firmLocation: "All Firm Locations", orderBy: "recently-featured", affiliationSlug: "", ) { firms: edges { cursor node { index id: firmId slug: firmSlug name: firmName projectsCount: firmProjectsCount lastProjectDate: firmLastProjectDate media: firmLogoUrl projects { edges { node { slug: slug media: heroUrl mediaId: heroId isHiddenFromListings } } } } } pageInfo { hasNextPage endCursor } totalCount } }"""
def query_graphql(page_number: int = 6) -> dict:
q = query.replace(f'after: "6"', f'after: "{str(page_number)}"')
return s.post(
"https://architizer.com/api/v3.0/graphql",
json={"query": q},
).json()
def has_next_page(graphql_response: dict) -> bool:
return graphql_response["data"]["allFirmsWithProjects"]["pageInfo"]["hasNextPage"]
def get_next_page(graphql_response: dict) -> int:
return graphql_response["data"]["allFirmsWithProjects"]["pageInfo"]["endCursor"]
def get_firms_data(graphql_response: dict) -> list:
return graphql_response["data"]["allFirmsWithProjects"]["firms"]
def parse_firms_data(firms: list) -> str:
return "\n".join(firm["node"]["name"] for firm in firms)
def wait_a_bit(wait_for: float = 1.5):
time.sleep(wait_for)
with requests.Session() as s:
s.headers["user-agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
s.headers["referer"] = "https://architizer.com/firms/"
csrf_token = BeautifulSoup(
s.get(link).text, "html.parser"
).find("input", {"name": "csrfmiddlewaretoken"})["value"]
s.headers.update({"x-csrftoken": csrf_token})
response = query_graphql()
while True:
if not has_next_page(response):
break
print(parse_firms_data(get_firms_data(response)))
wait_a_bit()
response = query_graphql(get_next_page(response))
出于示例的目的,这应该输出公司名称:
Brooks + Scarpa Architects
Studio Saxe
NiMa Design
Best Practice Architecture
Gensler
Inca Hernandez
kaa studio
Taller Sintesis
Coryn Kempster and Julia Jamrozik
Franklin Azzi Architecture
Wittman Estes
Masfernandez Arquitectos
MATIAS LOPEZ LLOVET
SRG Partnership, Inc.
GANA Arquitectura
Meyer & Associates Architects, Urban Designers
Steyn Studio
BGLA architecture | urban design
and so on ...
推荐阅读
- android - 如何在回收站视图中添加 onclciklistner 及其位置
- javascript - 尝试使用 VueJS 过滤数组并仅显示结果
- powershell - 在 PS 脚本中处理 PSSession
- c - "%13s" 或 "%13d" 在 C 中代表什么
- javascript - 是否可以在不了解移动应用程序前端的情况下构建 REST API
- c# - 使用“TAP”设计模式执行许多长时间运行的任务
- sql - 在 WHERE 子句中使用短路会提高速度吗
- spring - 如何在spring boot的主项目(有两个依赖项目)中使用message.properties文件?
- sql - 一行代码中断否则工作查询
- python - 在python中按多个键对字典进行分组