python - 使用名称从网站上抓取数据表
问题描述
在尝试抓取网站时,我遇到了一个独特的情况。我正在通过搜索栏搜索数百个名称,然后抓取表格。但是,与网站相比,我的列表中的某些名称是唯一的并且拼写不同。在这种情况下,我在网站上手动查找了几个名字,它仍然直接将我带到各个页面。其他时候,如果有多个名字相同或相似的人,它会进入名字列表(在这种情况下,我想要在 nba 打球的人。我已经考虑过了,但我认为有必要提及)。我该如何继续进入这些玩家的个人页面,而不必每次都运行脚本并点击错误以查看哪个玩家的拼写略有不同?再次,即使拼写略有不同或名称列表(需要 NBA 中的名称),数组中的名称也会直接将您带到单个页面。一些例子是 Georgios Papagiannis(在网站上列为 George Papagiannis)、Ognjen Kuzmic(列为 Ognen Kuzmic)、Nene(列为 Maybyner Nene,但会带您进入名单——https://basketball.realgm.com/search?q=nene)。这似乎很难,但我觉得这可能是可能的。此外,似乎不是将所有抓取的数据写入 csv,而是每次都被下一个玩家覆盖。万分感谢。
我得到的错误:
AttributeError: 'NoneType' object has no attribute 'text'
import requests
from bs4 import BeautifulSoup
import pandas as pd
playernames=['Carlos Delfino', 'Nene', 'Yao Ming', 'Marcus Vinicius', 'Raul Neto', 'Timothe Luwawu-Cabarrot']
result = pd.DataFrame()
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
if soup.find('a',text=name).text==name:
url="https://basketball.realgm.com"+soup.find('a',text=name)['href']
print(url)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
try:
table1 = soup.find('h2',text='International Regular Season Stats - Per Game').findNext('table')
table2 = soup.find('h2',text='International Regular Season Stats - Advanced Stats').findNext('table')
df1 = pd.read_html(str(table1))[0]
df2 = pd.read_html(str(table2))[0]
commonCols = list(set(df1.columns) & set(df2.columns))
df = df1.merge(df2, how='left', on=commonCols)
df['Player'] = name
print(df)
except:
print ('No international table for %s.' %name)
df = pd.DataFrame([name], columns=['Player'])
result = result.append(df, sort=False).reset_index(drop=True)
cols = list(result.columns)
cols = [cols[-1]] + cols[:-1]
result = result[cols]
result.to_csv('international players.csv', index=False)
解决方案
我对名字相似的 NBA 球员使用了循环。您可以在下面的 css 选择器中找到从搜索表中获取 NBA 球员的信息:
.tablesaw tr:has(a[href*="/nba/teams/"]) a[href*="/player/"]
CSS选择器含义:按tablesaw
类查找table,查找table的children ,其tr
children包含text,然后查找其包含a
href
/nba/teams/
a
href
/player/
我添加了Search Player Name和Real Player Name列,您可以查看如何找到玩家。使用此列作为第一列和第二列放置insert
(请参阅代码中的注释)。
import requests
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
base_url = 'https://basketball.realgm.com'
player_names = ['Carlos Delfino', 'Nene', 'Yao Ming', 'Marcus Vinicius', 'Raul Neto', 'Timothe Luwawu-Cabarrot']
result = pd.DataFrame()
def def get_player_stats(search_name = None, real_name = None, player_soup = None):
table_per_game = player_soup.find('h2', text='International Regular Season Stats - Per Game')
table_advanced_stats = player_soup.find('h2', text='International Regular Season Stats - Advanced Stats')
if table_per_game and table_advanced_stats:
print('International table for %s.' % search_name)
df1 = pd.read_html(str(table_per_game.findNext('table')))[0]
df2 = pd.read_html(str(table_advanced_stats.findNext('table')))[0]
common_cols = list(set(df1.columns) & set(df2.columns))
df = df1.merge(df2, how='left', on=common_cols)
# insert name columns for the first positions
df.insert(0, 'Search Player Name', search_name)
df.insert(1, 'Real Player Name', real_name)
else:
print('No international table for %s.' % search_name)
df = pd.DataFrame([[search_name, real_name]], columns=['Search Player Name', 'Real Player Name'])
return df
for name in player_names:
url = f'{base_url}/search?q={name.replace(" ", "+")}'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
if url == response.url:
# Get all NBA players
for player in soup.select('.tablesaw tr:has(a[href*="/nba/teams/"]) a[href*="/player/"]'):
response = requests.get(base_url + player['href'])
player_soup = BeautifulSoup(response.content, 'lxml')
player_data = get_player_stats(search_name=player.text, real_name=name, player_soup=player_soup)
result = result.append(player_data, sort=False).reset_index(drop=True)
else:
player_data = get_player_stats(search_name=name, real_name=name, player_soup=soup)
result = result.append(player_data, sort=False).reset_index(drop=True)
result.to_csv('international players.csv', index=False)
# Append to existing file
# result.to_csv('international players.csv', index=False, mode='a')
推荐阅读
- archlinux - 添加一个 makepkg dlagent,它接受非零退出代码
- c++ - 使用 FSCTL_MOVE_FILE 重新分配文件集群太慢
- ruby-on-rails - Puma 发现了这个错误:nil:NilClass (NoMethodError) 的未定义方法 `first'
- php - 使用 Xdebug 跟踪时出现持续错误
- flutter - 如何在显示另一个小部件之前显示一段时间?
- spring-boot - Spring jaeger 将跨度传播到异步方法
- javascript - 试图从对象数组中获取 id 并将这些 id 推送到数组中
- java - 如果玩家键入 /launch,如何让我的代码向玩家返回“无效号码”消息
或 /launch 字符串 - c# - 为什么尝试向面板添加标签时此代码不起作用
- python - Python如何添加标签以在一个衬里中绘图