python - I am scraping Html table they show me the error 'AttributeError: 'NoneType' object has no attribute 'select''
问题描述
I am scraping Html table they show me the error 'AttributeError: 'NoneType' object has no attribute 'select' try to solve it
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
r = requests.get("https://capitalonebank2.bluematrix.com/sellside/Disclosures.action")
soup = BeautifulSoup(r.content, "lxml")
table = soup.find('table',attrs={'style':"border"})
all_data = []
for row in table.select("tr:has(td)"):
tds = [td.get_text(strip=True) for td in row.select("td")]
all_data.append(tds)
df = pd.DataFrame(all_data, columns=header)
print(df)
解决方案
您试图抓取的网站似乎阻止了requests
图书馆发送的请求。为了解决这个问题,我使用Selenium
了自动浏览网站的库。下面的代码收集了表中给出的标题。
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
browser = webdriver.Chrome()
browser.get("https://capitalonebank2.bluematrix.com/sellside/Disclosures.action")
soup = BeautifulSoup(browser.page_source, "lxml")
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"}
all_data = [i.text.strip() for i in soup.select("option")]
df = pd.DataFrame(all_data, columns=["Titles"])
print(df)
输出:
Titles
0 Agree Realty Corporation (ADC)
1 American Campus Communities, Inc. (ACC)
2 Antero Midstream Corporation (AM)
3 Antero Resources Corporation (AR)
4 Apache Corp. (APA)
.. ...
126 W. P. Carey Inc. (WPC)
127 Washington Real Estate Investment Trust (WRE)
128 Welltower Inc. (WELL)
129 Western Midstream Partners, LP (WES)
130 Whiting Petroleum Corporation (WLL)
如果您之前没有使用Selenium
过,请不要忘记安装chromedriver.exe
并将其添加到 PATH 环境变量中。您也可以手动将驱动程序的位置提供给构造函数。
更新代码以提取额外信息
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
browser = webdriver.Chrome()
browser.get("https://capitalonebank2.bluematrix.com/sellside/Disclosures.action")
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"}
for title in browser.find_elements_by_css_selector('option'):
title.click()
time.sleep(1)
browser.switch_to.frame(browser.find_elements_by_css_selector("iframe")[1])
table = browser.find_element_by_css_selector("table table")
soup = BeautifulSoup(table.get_attribute("innerHTML"), "lxml")
all_data = []
ratings = {"BUY":[], "HOLD":[], "SELL":[]}
lists_ = []
for row in soup.select("tr")[-4:-1]:
info_list = row.select("td")
count = info_list[1].text
percent = info_list[2].text
IBServ_count = info_list[4].text
IBServ_percent = info_list[5].text
lists_.append([count, percent, IBServ_count, IBServ_percent])
ratings["BUY"] = lists_[0]
ratings["HOLD"] = lists_[1]
ratings["SELL"] = lists_[2]
print(ratings)
browser.switch_to.default_content()
推荐阅读
- typescript - 使用开始时间和持续时间值获取结束时间
- java - 是否有一种通用的方法来反序列化杰克逊中的单值值对象(没有自定义反序列化器)
- python - 使用 keras.preprocessing.tokenizer 或 nltk.tokenize 哪个更好
- javascript - 世博互动推送通知
- javascript - 将“必需”添加到选择框
- iis - mapbox 文件 pbf 阻止了 IIS 服务器
- c# - 如何将 REST API 中的数据导入 SQL Server?
- c++ - OpenCL C++ Bindings:如何为 enqueueWriteBuffer 竞争实现回调
- ios - 如何获取已通过 snapkit 设置的 UIView 的框架
- javascript - 如何停止使用 jsp 或 php 为两个标签选择重复选项?