首页 > 解决方案 > 无法从网站的登录页面获取所有名称

问题描述

我用 python 编写了一个脚本来从网页上获取不同大学的所有名称。该站点在其登录页面中仅存储 50 个名称。button但是,只有在单击某个名称时才能查看其余名称show more members。我希望在不使用任何浏览器模拟器的情况下从该页面获取所有名称,因为我可以看到其余名称page source在某些脚本标记中可用。

网站地址

我试过:

import requests
from bs4 import BeautifulSoup

link = 'https://www.abhe.org/directory/'

r = requests.get(link,headers={"user-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("h2 > a[title]"):
    print(item.text)

上面的脚本只获取前 50 个名称。

如何在不使用任何浏览器模拟器的情况下从该网页获取所有名称?

标签: pythonpython-3.xweb-scraping

解决方案


采取了不同的路线:

import re
import requests
from bs4 import BeautifulSoup

url = r'https://www.abhe.org/directory'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')


js_data = soup.find_all('script') # Get script tags
js_data_2 = [i for i in js_data if len(i) > 0] # Remove zero length strings
js_dict = {k:v for k, v  in enumerate(js_data_2)} # Create a dictionary for referencing
data = str(js_dict[10]) # Our target is key 10

# Clean up results
data2 = data.replace('<script>\r\n\t\tw2dc_map_markers_attrs_array.push(new w2dc_map_markers_attrs(\'e5d47824e4fcfb7ab0345a0c7faaa5d2\',','').strip()

# Split on left bracket
test1 = data2.split('[')

# Remove 'eval(' and zero-length strings
test2 = [i for i in test1 if len(i) > 0 and i != 'eval(']

# Use regex to find strings with numbers between double quotation marks
p = re.compile(r'"\d+"')
test3 = [i for i in test2 if p.match(i)]

# List comprenehsion for index value 6 items, which is the college name
# we also can replace double quotation marks.
college_list = sorted([test3[i].split(',')[6].replace('"','') for i in range(len(test3))])

输出:

In [116]: college_list
Out [116]: 
['Georgia Central University',
 'Northwest Baptist Theological Seminary',
 'Steinbach Bible College',
 'Yellowstone Christian College',
...]

推荐阅读