python - 从纳斯达克生成的图像中抓取数据
问题描述
我想从https://www.nasdaq.com/symbol/amzn/recommendations中获取分析师的建议。
问题是数据以JPEG图像的形式显示,另存为:
https://www.nasdaq.com/charts/AMZN_cnb.jpeg
这些图像是如何生成的,有没有办法以文本形式访问内容?
解决方案
使用BeautifulSoup
库,您可以从网站获取所需的数据。
安装Requests
并BeautifulSoup
通过pip
pip install bs4
pip install requests
我希望这可以解决您的查询,因为我正在从网站上抓取 Heading、Netchange、Percentage 和 Recommendations。
from requests import get
from bs4 import BeautifulSoup as bs
url = "https://www.nasdaq.com/symbol/amzn/recommendations"
raw = get(url)
soup = bs(raw.content, 'html5lib')
heading = soup.find('div', {"id":"qwidget_pageheader"}).text
dollar = soup.find('div', {"class": "qwidget-dollar"}).text
netchange = soup.find('div', {"id":"qwidget_netchange"}).text
percentage = soup.find('div', {"id":"qwidget_percent"}).text
recommendations = soup.find('ul', {"class":"floatL fontS14px"}).text
print(heading, dollar, netchange, percentage, recommendations)
刮掉图像,然后您可以使用Pytessaract
它从图像中提取文本。
pip install pytessaract
- 在系统上安装 tessaract 例如在 Mac 上你使用 Brew 所以
brew install tessaract
Sample Code
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def ocr_core(filename):
"""
This function will handle the core OCR processing of images.
"""
text = pytesseract.image_to_string(Image.open(filename)) # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
return text
print(ocr_core('images/ocr_example_1.png'))