首页 > 解决方案 > 抓取网站(marketchameleon)返回加密数据

问题描述

我正在学习如何使用 python 抓取网站,现在只是使用请求和 BeautifulSoup ......

我正在尝试访问以下页面:https ://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates

是的,您需要订阅才能查看所有数据,但这仅用于学习目的,因此浏览器中可见的少量数据就足够了。

以下是我获取数据的方式:

import requests
import urllib.request
from bs4 import BeautifulSoup
headers_Get = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}
url = 'https://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates'
response = requests.get(url, headers_Get)
soup = BeautifulSoup(response.text, “html.parser”)

但是,返回的 html 数据似乎是加密的(只是一个提取,因为加密的部分很长):

<div class="symov_earnings">
<div class="flex_container_between flex_center_vertical">
<div class="dl-tbl-outer"><div class="dis-prem"><button class="_noprem prem-btn" onclick="site_OpenPremium();">Download Now</button><div class="dis-prem-pop"><p>Premium Feature</p><p><a href="/Account/Login">Login</a><span>|</span><a href="/Subscription/Compare">Subscribe</a></p></div></div></div>
</div>
<div cipherxx="OwA+ADwAOQA+ADwABABEAFcAVgBdAFYAEwBfAFwADQAUAEcASABeAGwAUABNAEQAaQBRAFAAQQBdAF8AVgBXAEUAFgARAFAAXwBXAEsAQwALABYAXABDAGwAWgBRAFcAXgBAAFMAXABBAFIAXQBCABQACgA8ADkAEwAWABgAEAAKAEAAWQBWAFIAUgAGAD0APAAUABEAEwATABYAGAAQABYACABFAEEAEwBVAFQAUQBFAEcADAARAF4AVwBRAF4AaQBcAFQAUgBXAF8AVgBXABQACgA8ADkAEwAWABgAEAAWABQAEQATABMAFgAYABAACgBAAFkAEwBQAFkAVABDAEYAVQBfAA4AEQAOABoADgBjAEQAUgBcAF4AXwBWAFcAFgBxAFAAQQBdAF8AVgBXAEUACAAeAEcAWwAIADUAOgAWABQAEQATABMAFgAYABAACgAbAEUAQQANA

有什么方法可以找出正在发生的事情(如何保护网站免受爬虫的侵害?)并获取实际的 html 数据?

谢谢

标签: pythonweb-scrapingbeautifulsoup

解决方案


数据确实是加密的。如果您查看作为网站一部分的 JS 文件,您可以发现这个包含用于解密数据的函数的特定文件。所有这些都是用 Javascript 完成的,所以你有 2 个选项:

使用第一个选项(在中重新编码加密函数),您可以这样做:

import requests
from bs4 import BeautifulSoup
import base64
import json

url = "https://marketchameleon.com/Overview/BAX/Earnings/Earnings-Dates"

session = requests.Session()

r = session.get(url)
soup = BeautifulSoup(r.text, "html.parser")

key = session.cookies.get_dict()["v1"]
encryptedDivs = [ i["cipherxx"] for i in soup.find_all("div") if i.get("cipherxx")]

unencrypted = []
for div in encryptedDivs:
    encryptedData = base64.b64decode(div)
    cipher = "".join([
        chr(encryptedData[i]) 
        for i in range(0,len(encryptedData),2)
    ])
    data = ""
    for i in range(0, len(cipher)):
        c_num = ord(cipher[i])
        k_num = ord(key[i % len(key)])
        c2 = c_num ^ k_num
        data += chr(c2)

    unencrypted.append(data)

# unencrypted[0] is the header div with some info about stock price etc...
# unencrypted[1] is the first table
# lets parse the second table unencrypted[2]

soup = BeautifulSoup(unencrypted[2], "html.parser")

tbody = soup.find("tbody").findAll("tr", recursive=False)
thead = soup.find("thead").findAll("tr", recursive=False)

table2 = [
    {
        "Date": t[0].text.strip(),
        "Time": t[1].text.strip(),
        "Period": t[2].text.strip(),
        "Conference Call": t[3].text.strip(),
        "Price Effect" : t[4].find("span").text if t[4].find("span") else t[4].text.strip(),
        "Implied Straddle": t[5].text.strip(),
        "Closing Price": t[6].text.strip(),
        "Opening Gap": t[7].text.strip(),
        "Drift Since": t[8].text.strip(),
        "Range Since": t[9].text.strip(),
        "Price Change 1 Week Before":t[10].text.strip(),
        "Price Change 1 Week After": t[11].text.strip()
    }
    for t in (t.findAll('td', recursive=False) for t in tbody)
    if len(t) >= 11
]

print(json.dumps(table2, indent=4, sort_keys=True))

请注意,加密密钥位于名为的 cookie 中v1(这就是您需要的原因requests.Session()

加密部分

这是XOR 加密。它将数据的值与键进行异或(在这种情况下,键存储在 cookie 中)。对于解密,您只需将密码与密钥进行异或运算即可取回原始数据。

解释它的最有效方法是使用示例:

  • 数据是字符串“HELLO”
  • 键是字符串“97523022”
"H"       "E"        "L"        "L"        "O"
 72        69         76         76         79
 01001000  01000101   01001100   01001100   01001111


"9"       "7"        "5"        "2"        "3"
 57        55         53         50         51
 00111001  00110111   00110101   00110010   00110011
 
     01001000  01000101   01001100   01001100   01001111
XOR  00111001  00110111   00110101   00110010   00110011
==>  01110001  01110010   01111001   01111110   01111100         
       113        114       121        126        124
HEX   \x71       \x72      \x79       \x7E       \x7C


complete with 0s  :
HEX    \x71\x00 \x72\x00 \x79\x00 \x7E\x00 \x7C\x00

encode \x71\x00\x72\x00\x79\x00\x7E\x00\x7C\x00 to base64

which gives : 'cQByAHkAfgB8AA=='

尝试使用此代码进行解密(与问题开头的代码相同):

key = "97523022"
payload = "cQByAHkAfgB8AA=="

data = base64.b64decode(payload)

cipher = "".join([
    chr(data[i]) 
    for i in range(0,len(data),2)
])
data = ""
for i in range(0, len(cipher)):
    c_num = ord(cipher[i])
    k_num = ord(key[i % len(key)])
    c2 = c_num ^ k_num
    data += chr(c2)

print(data)

输出 :

你好

如果您有兴趣,也可以查看此链接此 wiki


推荐阅读