首页 > 解决方案 > 如何在经典asp网站上的python网络抓取应用程序中访问和解析源框架

问题描述

我正在尝试创建一个网络爬虫应用程序,它使用本教程作为我的基础从 SAI 下载一堆分类的 PDF https://www.youtube.com/watch?v=sVNJOiTBi_8

我在网站上尝试过 get(url) 我相信我想从

验证码(不包括在内)....

def subscription_spider(max_pages):
page = 1
while page <= max_pages:
    url= 'https://www.saiglobal.com/online/Script/listvwstds.asp?TR=' + str(page)
    source_code = session.get(url) # need to get the frame source!!
    # extra code to find the href for the frame than get the frame source
#https://www.saiglobal.com/online/Script/ListVwStds.asp

    plain_text = source_code.text #source_code.text printed is not the one we want

    playFile = open('source_code.text', 'wb')
    for chunk in r.iter_content(100000):
            playFile.write(chunk)
    playFile.close()

    soup = BeautifulSoup(source_code.content, features='html.parser') 
    for link in soup.findAll('a',{'class':'stdLink'}): #can't find the std link
        href= "https://www.saiglobal.com/online/" + link.get('href')
        '''
        get_pdf(href)
        ''' 
        print(href)
    page += 1

'''
#function will probably give a bad name to the files but can fix this later
def get_pdf(std_url)
    source_code = session.get(std_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.select("a[href$='.pdf']"):
        #Name the pdf files using the last portion of each link which are unique in this case
        filename = os.path.join(folder_location,link['href'].split('/')[-1])
        with open(filename, 'wb') as f:
            f.write(session.get(urljoin(url,link['href'])).content)
'''
r = session.get("https://www.saiglobal.com/online/")
print (r.status_code)

subscription_spider(1)

r = session.get("https://www.saiglobal.com/online/Script/Logout.asp") #not sure if this logs out or not
print (r.status_code) 

文本文件创建了这个

<frameset rows="*, 1">
<frame SRC="Script/Login.asp?">
<frame src="Script/Check.asp" noresize>
</frameset>

但是当我检查我想要的元素在我无法直接访问的框架源中时,我认为这个问题与经典 asp 页面的结构有关,但我不太确定要做什么 html 元素

程序的输出是

200 200 按任意键继续...

标签: pythonweb-scrapingweb-crawlerframe

解决方案


推荐阅读