首页 > 解决方案 > 如何从该页面的表格中抓取文本?

问题描述

我正在尝试使用and从此页面上的单词列表中刮取单词及其含义,尽管我不确定在从方法中获取表格 html 后如何循环遍历and标记:bs4selenium<tr><td>bs4 find_all

from selenium import webdriver
from bs4 import BeautifulSoup

root = "https://www.graduateshotline.com/gre-word-list.html"

driver.get(root)
content = driver.page_source
soup = BeautifulSoup(content,'html.parser')
table = soup.find_all('table',attrs={'class': 'tablex border1'})[0]

现在在表变量中,我有整个表的 html,这是从开始到结束的片段:

<table class="tablex border1"> <tbody><tr><td><a href="https://gre.graduateshotline.com/a.pl?word=introspection" target="_blank">introspection</a></td>
<td>examining one's own thoughts and feelings</td></tr>
<tr><td><a href="https://gre.graduateshotline.com/a.pl?word=philanthropist" target="_blank">philanthropist</a></td>
.
.
.
<tr><td><a href="https://gre.graduateshotline.com/a.pl?word=refine" target="_blank">refine</a></td>
<td>make or become pure cultural </td></tr>
</tbody></table>

我不确定如何使用它访问单词及其含义。有任何想法吗?

标签: pythonseleniumweb-scrapingbeautifulsoup

解决方案


现在您的表格数据正在生成,您可以通过这种方式收集所需的数据。谢谢

import pandas as pd
import requests
link = 'https://www.graduateshotline.com/gre-word-list.html'
r = requests.get(link, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
table_data = pd.read_html(r.text)
print(table_data)

推荐阅读