python - Webscraping inconsistently built tables using BeautifulSoup [gurufocus site]
问题描述
I'm trying to get three indicators from gurufocus site, and encountered an issue I'm not sure how to address properly - the thing is tables I'm scraping are inconsistent regarding how many rows they have.
I'm getting Piotroski F-Score, Altman Z-Score and Beneish M-Score from the summary page for each of tickers I have on my list - example site for AAPL ticker is here
But, when I iterate through my tickers list, I encounter some stocks, for which the summary page has a table in which my values are one row earlier, see here (there is no "interest coverage" row).
Initially, my code looks like this:
ls=['Ticker', 'Piotroski F-Score', 'Altman Z-Score', 'Beneish M-Score']
dict_ls={k:ls[k] for k in range(len(ls))}
df=pd.DataFrame()
for j in range(len(symbols)):
req = requests.get("https://www.gurufocus.com/stock/"+symbols[j])
if req.status_code !=200:
continue
soup = BeautifulSoup(req.content, 'html.parser')
table = soup.find_all(lambda tag: tag.name=='table')
rows = table[1].findAll(lambda tag: tag.name=='tr')
out=[]
for i in range(len(rows)-1):
td=rows[i].find_all('td')
out=out+[x.text for x in td]
out=[symbols[j]]+out[21::3]
out_df=pd.DataFrame(out).transpose()
df=df.append(out_df,ignore_index=True)
df=df.rename(columns=dict_ls)
df.to_csv('guru-output.csv')
What's the best way to deal with such tables to consistently get those three values?
解决方案
几天前,我为我的大学项目编写了一个非常相似的程序。
我通过限制查看的股票数量解决了这个问题。我通过为不同的股票制作不同的程序来做到这一点。我对每个都使用了相同的结构,并根据每个页面修改了参数。
这是我认为目前唯一的方法。由于具有大量数据的网站(例如股票营销网站)的每个页面都有不同的结构。网络建设者很少更新结构,他们只有在有重大设计更新时才会这样做,正如我在项目试验期间注意到的那样。
希望对你有帮助!
推荐阅读
- python - YOLO Training error with shape for my own dataset,有人可以帮助理解这一点并帮助解决问题吗
- javascript - 将firestore时间戳转换为不同的格式
- c# - 在非 c# 文件的 dotnet 新模板中添加可选内容
- php - 无法为 php-fpm 设置 Tombs zend 扩展
- liquibase - Liquibase 上下文感叹号运算符
- ruby - Ruby 从哈希数组中排除特定数据
- exception - 操作数类型冲突:xml 与文本不兼容
- javascript - 在 JavaScript 中按给定的垂直滚动百分比滚动
- javascript - webrtc 适用于 Intranet 但不适用于 Internet
- css - @include 参数内的 @each 循环