首页 > 解决方案 > 我正在尝试将大学橄榄球队名册刮到一个 excel 文件中,需要帮助来组织数据

问题描述

我正在尝试使用 Python 构建一个程序,将 NCAA 足球名册刮到 Excel 文件中,但我不知道如何以我想要的方式组织数据。

目前,我能够从我想要的所有球员那里刮掉所有的文字,包括姓名、身高和体重、家乡等,但所有这些都是一大堆。我希望名称在一列中,身高和体重在另一列中,依此类推。当它不在表格中时,我找不到任何有关如何执行此操作的信息。


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.select import Select
from tkinter import *

window = Tk()
window.title("Roster Scraper v1.0")
window.configure(background="light grey")
window.geometry('300x250')

TeamRoster = Label(window, text="Roster URL: ", font=("Arial"), fg="gray17")
TeamRoster.grid(column=0, row=0, sticky='e')
TeamRoster.configure(background="light grey")
URLEntry = Entry(window, width=20)
URLEntry.configure(background="light grey")
URLEntry.grid(column=1, row=0)

def ScrapeScript():

    DesiredRoster = (URLEntry.get())

    driver = webdriver.Firefox()

    driver.get(DesiredRoster)

    PlayerCard = driver.find_element_by_class_name('sidearm-roster-players').text
    print(PlayerCard)


SearchButton = Button(window, text="Scrape", command=ScrapeScript)
SearchButton.grid(column=1, row=3)
SearchButton.configure(background = "light grey")

window.mainloop()

我试图从中抓取的网站来自阿拉巴马州的团队网站:https ://rolltide.com/roster.aspx?roster=226&path=football

许多大学团队都使用这种精确风格的网站,因此不必手动输入所有这些数据真的很有帮助。任何帮助将不胜感激。

标签: pythonexcelselenium

解决方案


您应该创建更复杂的规则来仅抓取行中的部分数据。

首先,您可以使用find_elements_by_class_name(with sin word elements) 获取所有带有 class 的元素,sidearm-roster-players-name并分别使用 class sidearm-roster-player-position,sidearm-roster-player-class-hometown等。

all_names = driver.find_elements_by_class_name('sidearm-roster-player-name')
all_pozitions = driver.find_elements_by_class_name('sidearm-roster-player-position')
all_hometowns = driver.find_elements_by_class_name('sidearm-roster-player-class-hometown')

然后你可以zip()用来创建对(name, size, hometown, etc.)

for name, position, hometown in zip(all_names, all_positions, all_hometowns):
    print(name.text, "|", position.text, "|", hometown.text)

from selenium import webdriver

url = 'https://rolltide.com/roster.aspx?roster=226&path=football'

driver = webdriver.Firefox()
driver.get(url)

all_names = driver.find_elements_by_class_name('sidearm-roster-player-name')
all_positions = driver.find_elements_by_class_name('sidearm-roster-player-position')
all_hometowns = driver.find_elements_by_class_name('sidearm-roster-player-class-hometown')

for name, position, hometown in zip(all_names, all_positions, all_hometowns):
    print(name.text, "|", position.text, "|", hometown.text)

对于更详细的抓取,您可以使用更复杂的规则,您可以使用xpath( find_elements_by_xpath)。

您甚至可以先抓​​取所有行,然后使用for-loop 分别抓取每一行中的元素。


from selenium import webdriver
import csv

url = 'https://rolltide.com/roster.aspx?roster=226&path=football'

driver = webdriver.Firefox()
driver.get(url)

all_rows = driver.find_elements_by_xpath('//ul[@class="sidearm-roster-players"]//li')

fh = open('output.csv', 'w')
csvwriter = csv.writer(fh)
#write headers
csvwriter.writerow(["Number", "Name", "Position", "Height", "Weight", "Hometown", "Highschool", "Academic year"])

for row in all_rows: #[:10]:
    number = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-name"]//span').text
    print('number:', number)

    name = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-name"]//p').text
    #print('name:', name)

    position = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-position"]/span').text
    #print('position:', position)

    height = row.find_element_by_class_name('sidearm-roster-player-height').text
    #print('height:', height)

    weight = row.find_element_by_class_name('sidearm-roster-player-weight').text
    #print('weight:', weight)

    # it seems some classes have two elements in row - first probably always is empty but I join all elements 

    hometown = row.find_elements_by_class_name('sidearm-roster-player-hometown')
    hometown = ''.join(x.text for x in hometown)
    #print('hometown:', hometown)

    highschool = row.find_elements_by_class_name('sidearm-roster-player-highschool')
    highschool = ''.join(x.text for x in highschool)
    #print('highschool:', highschool)

    academic_year = row.find_elements_by_class_name('sidearm-roster-player-academic-year')
    academic_year = ''.join(x.text for x in academic_year)
    #print('academic_year:', academic_year)

    #print('---')
    csvwriter.writerow([number, name, position, height, weight, hometown, highschool, academic_year])

fh.close()  

推荐阅读