python - 使用 Python Beautifulsoup 从特定页面提取数据
问题描述
我对 python 和 BeautifulSoup 很陌生。我编写了下面的代码来尝试调用网站(https://www.fangraphs.com/depthcharts.aspx?position=Team),抓取表格中的数据并将其导出到 csv 文件。我能够编写代码来从网站上的其他表中提取数据,但不是这个特定的表。它不断返回:AttributeError:NoneType'对象没有属性'find'。我一直在绞尽脑汁想弄清楚我做错了什么。我有错误的“类”名称吗?再次,我很新,并试图自学。我一直在通过反复试验和逆向工程他人的代码来学习。这个让我难住了。有什么指导吗?
import requests
import csv
import datetime
from bs4 import BeautifulSoup
# static urls
season = datetime.datetime.now().year
URL = "https://www.fangraphs.com/depthcharts.aspx?position=Team".format(season=season)
# request the data
batting_html = requests.get(URL).text
def parse_array_from_fangraphs_html(input_html, out_file_name):
"""
Take a HTML stats page from fangraphs and parse it out to a CSV file.
"""
# parse input
soup = BeautifulSoup(input_html, "lxml")
table = soup.find("table", {"class": "tablesoreder, depth_chart tablesorter tablesorter-default"})
# get headers
headers_html = table.find("thead").find_all("th")
headers = []
for header in headers_html:
headers.append(header.text)
print(headers)
# get rows
rows = []
rows_html = table.find("tbody").find_all("tr")
for row in rows_html:
row_data = []
for cell in row.find_all("td"):
row_data.append(cell.text)
rows.append(row_data)
# write to CSV file
with open(out_file_name, "w") as out_file:
writer = csv.writer(out_file)
writer.writerow(headers)
writer.writerows(rows)
parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')
解决方案
回溯看起来像
AttributeError Traceback (most recent call last)
<ipython-input-4-ee944e08f675> in <module>()
41 writer.writerows(rows)
42
---> 43 parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')
<ipython-input-4-ee944e08f675> in parse_array_from_fangraphs_html(input_html, out_file_name)
20
21 # get headers
---> 22 headers_html = table.find("thead").find_all("th")
23 headers = []
24 for header in headers_html:
AttributeError: 'NoneType' object has no attribute 'find'
所以是的,问题出在
table = soup.find("table", {"class": "tablesoreder, depth_chart tablesorter tablesorter-default"})
操作说明。
您可以修改它以便按照其他用户的建议在空格上拆分类属性。但是你会再次失败,因为解析的表没有 tbody。
固定的脚本看起来像
import requests
import csv
import datetime
from bs4 import BeautifulSoup
# static urls
season = datetime.datetime.now().year
URL = "https://www.fangraphs.com/depthcharts.aspx?position=Team".format(season=season)
# request the data
batting_html = requests.get(URL).text
def parse_array_from_fangraphs_html(input_html, out_file_name):
"""
Take a HTML stats page from fangraphs and parse it out to a CSV file.
"""
# parse input
soup = BeautifulSoup(input_html, "lxml")
table = soup.find("table", class_=["tablesoreder,", "depth_chart", "tablesorter", "tablesorter-default"])
# get headers
headers_html = table.find("thead").find_all("th")
headers = []
for header in headers_html:
headers.append(header.text)
print(headers)
# get rows
rows = []
rows_html = table.find_all("tr")
for row in rows_html:
row_data = []
for cell in row.find_all("td"):
row_data.append(cell.text)
rows.append(row_data)
# write to CSV file
with open(out_file_name, "w") as out_file:
writer = csv.writer(out_file)
writer.writerow(headers)
writer.writerows(rows)
parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')
推荐阅读
- arrays - [Powershell/Sharepoint]有人可以查看我的脚本以删除超过 7 天的 Sharepoint 列表项吗?获取'无法索引到空数组'
- python - 如何找到用于 kivy 的特定颜色的代码?
- python - gcc c++ stl pretty printer - 手动调用并仅询问大小
- html - 页面容器不与页面对齐中心问题居中
- pascal - 函数在循环时不会创建随机序列 - Pascal
- python - Python mysql-connector 将数字 TEXT 值转换为整数
- c - epoll_wait 不等待超时时间
- r - 在ggplot2中为连续变量自定义图例标签分成几个类别
- python - 如何使用 Django 用数据填充 HTML 选择下拉列表?
- java - 使用 QueryDsl 将 Rails Active Record 范围转换为 Java Spring Boot