python - 滚动雅虎财经新闻
问题描述
所以我正在做一个小项目,我在特定公司上抓取雅虎财经新闻,并对其进行一些数据分析,以了解新闻情绪如何影响股票表现。我正在尝试无限地刮擦和滚动直到它停止,但是我在尝试刮过第一个滚动时遇到了麻烦。
我正在使用硒来帮助我。我一直在到处寻找帮助,但似乎是因为每次向下滚动时都会逐渐加载新闻结果,这会使事情变得更加复杂。
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
# Web scrapper for infinite scrolling page
url = "https://finance.yahoo.com/quote/company/press-releases?p=company"
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
time.sleep(2) # Allow 2 seconds for the web page to open
scroll_pause_time = 2
screen_height = driver.execute_script("return window.screen.height;") # get the screen height of the web
i = 1
SCROLL_PAUSE_TIME = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(SCROLL_PAUSE_TIME)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
##### Extract Article Titles #####
titles = []
soup = BeautifulSoup(driver.page_source, "html.parser")
for t in soup.find_all(class_="Cf"):
a_tag = t.find("a", class_="Fw(b)")
if a_tag:
text = a_tag.text
titles.append(text)
解决方案
This sample code is from a project that I worked on a short time ago. Hopefully it helps you get going in the right direction.
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
from pandas import DataFrame
resp = urllib.request.urlopen("https://www.cnbc.com/finance/")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
substring = 'https://www.cnbc.com/'
df = ['review']
for link in soup.find_all('a', href=True):
#print(link['href'])
if (link['href'].find(substring) == 0):
# append
df.append(link['href'])
#print(link['href'])
#list(df)
# convert list to data frame
df = DataFrame(df)
#type(df)
#list(df)
# add column name
df.columns = ['review']
df.columns
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review'].apply(lambda x: sid.polarity_scores(x))
def convert(x):
if x < 0:
return "negative"
elif x > .2:
return "positive"
else:
return "neutral"
df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
df['result']
df_final = pd.merge(df['review'], df['result'], left_index=True, right_index=True)
df_final
Result:
review result
0 review neutral
1 https://www.cnbc.com/business/ neutral
2 https://www.cnbc.com/2021/02/22/chinas-foreign... neutral
3 https://www.cnbc.com/2021/02/22/chinas-foreign... neutral
4 https://www.cnbc.com/evelyn-cheng/ neutral
.. ... ...
89 https://www.cnbc.com/banks/ neutral
90 https://www.cnbc.com/2021/02/17/wells-fargo-sh... neutral
91 https://www.cnbc.com/technology/ neutral
92 https://www.cnbc.com/2021/02/17/lakestar-found... neutral
93 https://www.cnbc.com/finance/?page=2 neutral
[94 rows x 2 columns]
推荐阅读
- python - 在 Windows 10 中训练 Haar Cascade
- excel - excel vba 对于每个语句
- javascript - 错误:将 firebase 分析与 ionic 3 集成
- ruby - 用于将字符串放在括号中的 Ruby 函数
- laravel - 为什么 Laravel 网站在 web.php 中没有走正确的路线?什么是 autoload_static.php?
- c# - ASP.Net MVC 延长 OAuth 令牌过期时间
- javascript - 如何使用以下 openerp 7.0 的 javascript 语句到 odoo 11?
- html-to-pdf - wkhtmltopdf | 嵌套表格 HTML 到 PDF | 空格问题
- java - 当我们继承一个在java中具有私有成员的类时,私有成员是否也被继承?
- c++ - 带有数组的 C++ 队列