python - 我正在尝试将大学橄榄球队名册刮到一个 excel 文件中,需要帮助来组织数据
问题描述
我正在尝试使用 Python 构建一个程序,将 NCAA 足球名册刮到 Excel 文件中,但我不知道如何以我想要的方式组织数据。
目前,我能够从我想要的所有球员那里刮掉所有的文字,包括姓名、身高和体重、家乡等,但所有这些都是一大堆。我希望名称在一列中,身高和体重在另一列中,依此类推。当它不在表格中时,我找不到任何有关如何执行此操作的信息。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.select import Select
from tkinter import *
window = Tk()
window.title("Roster Scraper v1.0")
window.configure(background="light grey")
window.geometry('300x250')
TeamRoster = Label(window, text="Roster URL: ", font=("Arial"), fg="gray17")
TeamRoster.grid(column=0, row=0, sticky='e')
TeamRoster.configure(background="light grey")
URLEntry = Entry(window, width=20)
URLEntry.configure(background="light grey")
URLEntry.grid(column=1, row=0)
def ScrapeScript():
DesiredRoster = (URLEntry.get())
driver = webdriver.Firefox()
driver.get(DesiredRoster)
PlayerCard = driver.find_element_by_class_name('sidearm-roster-players').text
print(PlayerCard)
SearchButton = Button(window, text="Scrape", command=ScrapeScript)
SearchButton.grid(column=1, row=3)
SearchButton.configure(background = "light grey")
window.mainloop()
我试图从中抓取的网站来自阿拉巴马州的团队网站:https ://rolltide.com/roster.aspx?roster=226&path=football
许多大学团队都使用这种精确风格的网站,因此不必手动输入所有这些数据真的很有帮助。任何帮助将不胜感激。
解决方案
您应该创建更复杂的规则来仅抓取行中的部分数据。
首先,您可以使用find_elements_by_class_name
(with s
in word elements
) 获取所有带有 class 的元素,sidearm-roster-players-name
并分别使用 class sidearm-roster-player-position
,sidearm-roster-player-class-hometown
等。
all_names = driver.find_elements_by_class_name('sidearm-roster-player-name')
all_pozitions = driver.find_elements_by_class_name('sidearm-roster-player-position')
all_hometowns = driver.find_elements_by_class_name('sidearm-roster-player-class-hometown')
然后你可以zip()
用来创建对(name, size, hometown, etc.)
for name, position, hometown in zip(all_names, all_positions, all_hometowns):
print(name.text, "|", position.text, "|", hometown.text)
from selenium import webdriver
url = 'https://rolltide.com/roster.aspx?roster=226&path=football'
driver = webdriver.Firefox()
driver.get(url)
all_names = driver.find_elements_by_class_name('sidearm-roster-player-name')
all_positions = driver.find_elements_by_class_name('sidearm-roster-player-position')
all_hometowns = driver.find_elements_by_class_name('sidearm-roster-player-class-hometown')
for name, position, hometown in zip(all_names, all_positions, all_hometowns):
print(name.text, "|", position.text, "|", hometown.text)
对于更详细的抓取,您可以使用更复杂的规则,您可以使用xpath
( find_elements_by_xpath
)。
您甚至可以先抓取所有行,然后使用for
-loop 分别抓取每一行中的元素。
from selenium import webdriver
import csv
url = 'https://rolltide.com/roster.aspx?roster=226&path=football'
driver = webdriver.Firefox()
driver.get(url)
all_rows = driver.find_elements_by_xpath('//ul[@class="sidearm-roster-players"]//li')
fh = open('output.csv', 'w')
csvwriter = csv.writer(fh)
#write headers
csvwriter.writerow(["Number", "Name", "Position", "Height", "Weight", "Hometown", "Highschool", "Academic year"])
for row in all_rows: #[:10]:
number = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-name"]//span').text
print('number:', number)
name = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-name"]//p').text
#print('name:', name)
position = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-position"]/span').text
#print('position:', position)
height = row.find_element_by_class_name('sidearm-roster-player-height').text
#print('height:', height)
weight = row.find_element_by_class_name('sidearm-roster-player-weight').text
#print('weight:', weight)
# it seems some classes have two elements in row - first probably always is empty but I join all elements
hometown = row.find_elements_by_class_name('sidearm-roster-player-hometown')
hometown = ''.join(x.text for x in hometown)
#print('hometown:', hometown)
highschool = row.find_elements_by_class_name('sidearm-roster-player-highschool')
highschool = ''.join(x.text for x in highschool)
#print('highschool:', highschool)
academic_year = row.find_elements_by_class_name('sidearm-roster-player-academic-year')
academic_year = ''.join(x.text for x in academic_year)
#print('academic_year:', academic_year)
#print('---')
csvwriter.writerow([number, name, position, height, weight, hometown, highschool, academic_year])
fh.close()
推荐阅读
- amazon-web-services - 错误 aws_alb_target_group 设置了“计数”,必须在特定实例上访问其属性
- ghost-blog - 在 Ghost 博客中向本地 API 添加自定义集成
- javascript - 如何从数组中具有相同类的html返回值
- python - 如何解决 ImportError: libhdf5_serial.so.103: cannot open shared object file: No such file or directory while Importing h5py
- c# - 基于类属性的 Switch 语句
- postgresql - Sidekiq、Redis 和 Postgresql 连接池。考虑到我的情况,我的设置应该是什么样子?
- java - addAll 如果集合不为空
- python - 使用 pd.concat 堆叠 DataFrame - 包括列名
- node.js - 等待带有内部回调的同步函数
- c++ - 将数组播放为声音