首页 > 解决方案 > 在网站的python中填写表格后抓取数据

问题描述

我试图用 python 和 BeautifulSoup从http://www.educationboardresults.gov.bd/抓取数据。

首先,网站需要填写表格。填写表格后,网站提供结果。我在这里附上了两张图片。

提交表格前:https ://prnt.sc/w4lo7i

提交后:https ://prnt.sc/w4lqd0

我试过以下代码

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
    
    
}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'html5lib')
#Scraping  and by passing Captcha

alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
for digit in captcha:
   value_one, value_two = int(captcha[0]), int(captcha[1])

resultdata['value_s'] = value_one+value_two
r=s.post(url, data=resultdata, headers= headers)

在打印 r.content 时,它显示第一页的代码。我想刮第二页。提前致谢

标签: pythonweb-scrapingbeautifulsouppython-requests

解决方案


我也在努力。

import requests
from bs4 import BeautifulSoup as bs

resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': "2012",
'board': 'chittagong',
'roll': "102275",
'reg': "626948",

 }
headers ={
    'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
    'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
    'Origin': 'http://www.educationboardresults.gov.bd',
    'Referer': 'http://www.educationboardresults.gov.bd/',
    'Request URL': 'http://www.educationboardresults.gov.bd/result.php'


}
with requests.Session() as s:
    url = 'http://www.educationboardresults.gov.bd/index.php'
    r = s.get(url, headers=headers)
    soup = bs(r.content,'lxml')
    # print(soup.prettify())
#Scraping  and by passing Captcha

    alltable =soup.findAll('td')
    captcha = alltable[56].text.split('+')
    print(captcha)
    value_one, value_two = int(captcha[0]), int(captcha[1])
    print(value_one, value_one)

    resultdata['value_s'] = value_one+value_two

    resultdata['button2'] = 'Submit'
    print(resultdata)
    r=s.post("http://www.educationboardresults.gov.bd/result.php", data=resultdata, headers= headers)
    soup = bs(r.content, 'lxml')
    print(soup.prettify())

推荐阅读