首页 > 解决方案 > 无法从 https://www.screener.in/ 抓取表格元素

问题描述

我想使用以下代码从该网站上抓取一些财务数据

import bs4 as bs
import urllib
import urllib.request
from pandas import DataFrame
import plotly.graph_objects as go
import numpy as np
import ssl
from IPython.display import display
ssl._create_default_https_context = ssl._create_unverified_context

l=('https://www.screener.in/screen/raw/?sort=name&order=&source=&query=Market+capitalization+%3E+350&limit=50&page=1')
urlpage=urllib.request.urlopen(l)
soup=bs.BeautifulSoup(urlpage,'html.parser')
coulmn=[]
dta=[]
coulmn.append(' ')
soup.find('div',{'class':'responsive-holder fill-card-width'}).findAll('tr')

但它找不到任何东西它只是返回一个空列表

标签: pythonscrapy

解决方案


首先-您使用的是Beautiful Soup库而不是刮。

我为您创建了一个示例,说明如何使用 Python 发出登录请求并使用 BeautifulSoup 进行数据提取。

我已经测试过了Python 3.8.0

import bs4 as bs
import requests
import re

# url

login_URL = 'https://www.screener.in/login/'
data_URL = 'https://www.screener.in/screen/raw/?sort=name&order=&source=&query=Market+capitalization+%3E+350&limit=50&page=1'

# credentials

form_data = {
    'username': 'letiwoh199@ichkoch.com',
    'password': 'qweqweqwe'
}

# request config

form_csrf_key = 'csrfmiddlewaretoken'
cookie_csrf_key = 'csrftoken'
cookie_session_key = 'sessionid'
content_type = 'application/x-www-form-urlencoded'
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324'

# get form_csrf_value & cookie_csrf_value

get_login_request = requests.get(login_URL)
get_login_request_soup = bs.BeautifulSoup(get_login_request.text, 'html.parser')
form_csrf_value = get_login_request_soup.find('input', {'name': form_csrf_key})['value']
cookie_csrf_value = re.search(cookie_csrf_key + '=(.*?);', get_login_request.headers['Set-Cookie']).group(1)

# login into account & get cookie_session_value

form_data[form_csrf_key] = form_csrf_value

post_login_request = requests.post(login_URL, form_data, headers={
    'Cookie': cookie_csrf_key + '=' + cookie_csrf_value,
    'Content-Type': content_type,
    'User-Agent': user_agent,
    'Referer': login_URL
}, allow_redirects=False)

cookie_session_value = re.search(cookie_session_key + '=(.*?);', post_login_request.headers['Set-Cookie']).group(1)

# get data from the desired page

get_data_request = requests.get(data_URL, headers={
    'Cookie': cookie_csrf_key + '=' + cookie_csrf_value + ';' + cookie_session_key + '=' + cookie_session_value,
    'Content-Type': content_type,
    'User-Agent': user_agent,
})

get_data_request_soup = bs.BeautifulSoup(get_data_request.text, 'html.parser')
table_rows = get_data_request_soup.find('table', {'class': 'data-table'}).findAll('tr')

如果您print(table_rows[1])在此代码之后执行,它将输出:

<tr data-row-company-name="Zydus Wellness">
<td class="text">1.</td>
<td class="text">
<a href="/company/ZYDUSWELL/consolidated/" target="_blank">
            Zydus Wellness
          </a>
</td>
<td>1862.50</td>
<td>62.49</td>
<td>11851.49</td>
<td>0.27</td>
<td>1.74</td>
<td>305.42</td>
<td>381.58</td>
<td>14.70</td>
<td>6.04</td>
</tr>

您可以更改data_URL变量以从不同页面获取数据。

username此外,您可以password在凭据部分进行更改。我为此示例创建了一个临时帐户。


推荐阅读