首页 > 解决方案 > 从股票网站页面提取特定字符串匹配

问题描述

我正在使用下面的代码尝试 webscrape 股票市值。起初我传统上试图获取market cap values使用 bs4 的列表。当我 print(x.find('span',{'class': 'Trsdu(0.3s)'}).text)以前这样做时,我得到了AttributeError: 'NoneType' object has no attribute 'text'错误。

  for x in marketCapArray:
        print(x.find('span',{'class': 'Trsdu(0.3s)'}).text)

我不知道如何解决特定于我的代码的上述错误。所以我采取了一种替代方法,使用正则表达式来简单地提取所需的值,并在下面尝试了这个。

主要代码

import bs4
import re
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen

def pickTopGainers():
  url =  'https://in.finance.yahoo.com/gainers?offset=0&count=100'
  page = urlopen(url)
  soup = bs4.BeautifulSoup(page,"html.parser")
  marketCapArray = soup.find_all('td', {'class': 'Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)',
 'aria-label': 'Market cap'})
  print(str(marketCapArray))
  xi = re.findall("........</span>", str(marketCapArray)) # regex-use-1
  pi = re.sub("(</span>|....>N/A|>|\")","", str(xi))
  print(pi)

pickTopGainers()

结果

这就是print(str(marketCapArray)将输出的内容。(只粘贴了一部分)

[<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="93"><span class="Trsdu(0.3s)" data-reactid="94">159.404M</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="119"><span class="Trsdu(0.3s)" data-reactid="120">533.97M</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="145"><span data-reactid="146">N/A</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="171"><span class="Trsdu(0.3s)" data-reactid="172">2.952B</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="197"><span class="Trsdu(0.3s)" data-reactid="198">9.223B</span></td>, 
<td aria-label="Market cap" class="Va(m) Ta(end) Pstart(20px) Pend(10px) W(120px) Fz(s)" colspan="" data-reactid="223"><span data-reactid="224">N/A</span></td>]

这是 的输出print(pi)。也是最终输出。

['159.404M', '533.97M', '', '2.952B', '9.223B', '']


问题

如何避免在上面使用 regex replace(re.sub)Main Code来实现给定的最终输出pi?或建议我正确的方法来做到这一点。我觉得我的正则表达式令人不快。

标签: pythonhtmlbeautifulsoup

解决方案


<table>您可以在存储所有信息的 中逐行迭代。例如:

import requests
from bs4 import BeautifulSoup


url = 'https://in.finance.yahoo.com/gainers?offset=0&count=100'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

fmt_string = '{:<15} {:<60} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10}'
print(fmt_string.format('Symbol', 'Name', 'Price(int)', 'Change', '% change', 'Volume', 'AvgVol(3M)', 'Market Cap', 'PE ratio'))
for row in soup.select('table:has(a[href*="/quote/"]) > tbody > tr'):
    cells = [td.get_text(strip=True) for td in row.select('td')]
    print(fmt_string.format(*cells[:-1]))

印刷:

Symbol          Name                                                         Price(int) Change     % change   Volume     AvgVol(3M) Market Cap PE ratio  
CCCL.NS         Consolidated Construction Consortium Limited                 0.2000     +0.0500    +33.33%    57,902     290,154    159.404M   N/A       
KSERASERA.NS    KSS Limited                                                  0.2500     +0.0500    +25.00%    1.607M     2.601M     533.97M    N/A       
BONLON.BO       BONLON INDUSTRIES LIMITED                                    21.60      +3.60      +20.00%    16,000     N/A        N/A        N/A       
MENONBE.NS      Menon Bearings Limited                                       52.80      +8.80      +20.00%    2.334M     65,713     2.952B     25.05     
RPOWER.NS       Reliance Power Limited                                       3.3000     +0.5500    +20.00%    127.814M   18.439M    9.223B     N/A       
11DPD.BO        Nippon India Mutual Fund                                     0.0600     +0.0100    +20.00%    190        N/A        N/A        N/A       
ABFRLPP-E1.NS   Aditya Birla Rs.5 ppd up                                     105.65     +17.60     +19.99%    1.238M     N/A        N/A        N/A       
500110.BO       Chennai Petroleum Corporation Limited                        64.55      -0.15      -0.23%     42,765     61,584     9.612B     N/A       
ABFRLPP.BO      Aditya Birla Fashion and Retai                               106.05     +17.65     +19.97%    387,703    N/A        N/A        N/A       
RADIOCITY.NS    Music Broadcast Limited                                      21.35      +3.55      +19.94%    12.657M    1.013M     7.38B      124.13    
RADIOCITY.BO    Music Broadcast Limited                                      21.35      +3.55      +19.94%    898,070    90,236     7.38B      124.13    
MENONBE.BO      Menon Bearings Limited                                       52.65      +8.75      +19.93%    137,065    8,648      2.951B     24.98     
MTNL.BO         Mahanagar Telephone Nigam Limited                            10.72      +1.78      +19.91%    1.142M     156,275    6.754B     N/A       

...and so on.

推荐阅读