首页 > 解决方案 > 网页抓取基于 Javascript 的表格

问题描述

我正在尝试抓取包括电话号码在内的表格内容,但无法提取所有数据。

这是我的代码:

import urllib
import urllib.request
from bs4 import BeautifulSoup
import os


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata
playerdata=""
playerdatasaved=""
soup = make_soup("https://www.camicb.org/find-a-cmca")


for record in soup.findAll('tr'):
    for data in record.findAll('td'):
        print(data.string)

标签: pythonweb-scrapingbeautifulsoup

解决方案


如果您使用 Selenium,您可以获得数据。然后你可以点击网站上的所有页面抓取数据。首先安装硒。

sudo pip3 install selenium 

(在 Windows 上你不需要 sudo,你可能需要 pip 而不是 pip3)

然后获取驱动程序https://sites.google.com/a/chromium.org/chromedriver/downloads(根据您的操作系统,您可能需要指定驱动程序的位置)

import selenium
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException

# parse the page
def parse(html):
    soup = BeautifulSoup(html, "html.parser")
    for detail in  soup.find_all('div',{'class': 'professional-details'}):
        print (detail.find('div',{'class': 'fullname'}).get_text(strip=True),\
            detail.find('div',{'class': 'phone'}).get_text(strip=True))

# request the first page.
driver=webdriver.Chrome()
url = "https://www.camicb.org/find-a-cmca"
driver.get(url)

while True:
    try:
        # parse the current page.
        time.sleep(3)
        parse(driver.page_source)
        # Find the next page button and click it.
        driver.find_element_by_id('A6').click()
    except NoSuchElementException:
        # Couldn't find a next page button must have got to the end.
        break
driver.quit()

输出:

Ms. Sandy                             Aaron (602) 692-5494
Mrs. Zafera                             Aaron (425) 283-5858  (103)
Mr. Rick                             Abair 
Mr. Karmel Ahmed                            Abbas 971-0504805551
...
Ms. Kimberely Ann                            van Heel (949) 285-0111
Mr. Willem Schalk                            van Schalkwyk (617) 777-3761
Mr. Gary                             van der Laan (407) 781-5769

推荐阅读