首页 > 解决方案 > Webscraping inconsistently built tables using BeautifulSoup [gurufocus site]

问题描述

I'm trying to get three indicators from gurufocus site, and encountered an issue I'm not sure how to address properly - the thing is tables I'm scraping are inconsistent regarding how many rows they have.

I'm getting Piotroski F-Score, Altman Z-Score and Beneish M-Score from the summary page for each of tickers I have on my list - example site for AAPL ticker is here

But, when I iterate through my tickers list, I encounter some stocks, for which the summary page has a table in which my values are one row earlier, see here (there is no "interest coverage" row).

Initially, my code looks like this:

    ls=['Ticker', 'Piotroski F-Score', 'Altman Z-Score', 'Beneish M-Score']

    dict_ls={k:ls[k] for k in range(len(ls))}
    df=pd.DataFrame()
    for j in range(len(symbols)):
        req = requests.get("https://www.gurufocus.com/stock/"+symbols[j])
        if req.status_code !=200:
            continue
        soup = BeautifulSoup(req.content, 'html.parser')
        table = soup.find_all(lambda tag: tag.name=='table')

        rows = table[1].findAll(lambda tag: tag.name=='tr')
        out=[]
        for i in range(len(rows)-1):
            td=rows[i].find_all('td')
            out=out+[x.text for x in td]
        out=[symbols[j]]+out[21::3]
        out_df=pd.DataFrame(out).transpose()
        df=df.append(out_df,ignore_index=True)

    df=df.rename(columns=dict_ls)
    df.to_csv('guru-output.csv')

What's the best way to deal with such tables to consistently get those three values?

标签: pythonpython-3.xpandasweb-scrapingbeautifulsoup

解决方案


几天前,我为我的大学项目编写了一个非常相似的程序。

我通过限制查看的股票数量解决了这个问题。我通过为不同的股票制作不同的程序来做到这一点。我对每个都使用了相同的结构,并根据每个页面修改了参数。

这是我认为目前唯一的方法。由于具有大量数据的网站(例如股票营销网站)的每个页面都有不同的结构。网络建设者很少更新结构,他们只有在有重大设计更新时才会这样做,正如我在项目试验期间注意到的那样。

希望对你有帮助!


推荐阅读