bash - 如何从 linux 终端登录我的 wsj 帐户(使用 curl、oauth2.0)
问题描述
我是 wsj 的付费会员,我想从 linux 终端登录我的 wsj 帐户,这样我就可以编写代码来抓取一些文章来进行我的 NLP 研究。我不会发布任何数据。
我的方法是基于先前从wsj 请求、CURL 和 BeautifulSoup 获取的 Scrape 文章的回答。 当时有效但现在无效的代码的主要问题是,显然 wsj 采用了不同的 OAuth 2.0 方法。首先,我无法通过运行 login_url 获得连接。我觉得这是瓶颈。这是下一步的必填字段。
我注意到的另一件事是使用了状态参数。我不知道如何使用这个字段。运行后
curl -s 'https://sso.accounts.dowjones.com/authorize?scope=openid+idp_id+roles+email+given_name+family_name+djid+djUsername+djStatus+trackid+tags+prts&client_id=XXXXXXX&response_type=code&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&state=https://www.wsj.com&username=XXXXXX&password=XXXXXX'
它确实返回:“Found. Redirecting to /login?state=XXXX....” 但不确定如何在此步骤之后使用 state 参数。
我使用的一些参考资料是: https ://developer.dowjones.com/site/global/develop/authentication/index.gsp#2-exchanging-the-authorization-code-for-authn-tokens-98 https://oauth .net/2/
username="user@gmail.com"
password="YourPassword"
login_url=$(curl -s -I "https://accounts.wsj.com/login")
connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)")
client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)")
#connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}')
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}')
rm -f cookies.txt
IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \
--data-urlencode "username=$username" \
--data-urlencode "password=$password" \
--data-urlencode "connection=$connection" \
--data-urlencode "client_id=$client_id" \
--data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')
# replace double quote ""
wctx=$(echo "$wctx" | sed 's/"/"/g')
code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \
--data-urlencode "wa=$wa" \
--data-urlencode "wresult=$wresult" \
--data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)")
curl -s -c cookies.txt "$code_url"
# here call your URL loading cookies.txt
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"
解决方案
/usernamepassword/login
请求还需要一些参数。它需要state
and nonce
。此外,该字段似乎connection
不再存在于 Location 标头中,而是硬编码在 js 文件中。
凭据详细信息嵌入在https://accounts.wsj.com/login下脚本标记内的 Base64 编码 JSON 中
您可以按如下方式更新bash脚本。它使用curl、jq、sed和pup:
#/bin/bash
username="your_email@gmail.com"
password="your_password"
base_url="https://accounts.wsj.com"
rm -f cookies.txt
login_page=$(curl -s -L -c cookies.txt "$base_url/login")
jspage=$(echo "$login_page" | pup 'script attr{src}' | grep "app-min")
connection=$(curl -s "$base_url$jspage" | sed -rn "s/.*connection:\s*\"(\w+)\".*/\1/p" | head -1)
crendentials=$(echo "$login_page" | \
sed -rn "s/.*Base64\.decode\('(.*)'.*/\1/p" | \
base64 -d | \
jq -r '.internalOptions.state, .internalOptions.nonce, .clientID')
read state nonce clientID < <(echo $crendentials)
echo "state: $state"
echo "nonce: $nonce"
echo "client_id: $clientID"
echo "connection: $connection"
login_result=$(curl -s -b cookies.txt -c cookies.txt 'https://sso.accounts.dowjones.com/usernamepassword/login' \
--data-urlencode "username=$username" \
--data-urlencode "password=$password" \
--data-urlencode "connection=$connection" \
--data-urlencode "client_id=$clientID" \
--data-urlencode "state=$state" \
--data-urlencode "nonce=$nonce" \
--data-urlencode "scope=openid idp_id roles email given_name family_name djid djUsername djStatus trackid tags prts" \
--data 'tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | \
pup 'input json{}' | jq -r '.[] | .value')
read wa wresult wctx < <(echo $login_result)
wctx=$(echo "$wctx" | sed 's/"/"/g') #replace double quote ""
echo "wa: $wa"
echo "wresult: $wresult"
echo "wctx: $wctx"
callback=$(curl -s -b cookies.txt -c cookies.txt -L 'https://sso.accounts.dowjones.com/login/callback' \
--data-urlencode "wa=$wa" \
--data-urlencode "wresult=$wresult" \
--data-urlencode "wctx=$wctx")
#try this one to get an article, your username should be embedded in the page as logged in user
#curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"
但是这个bash脚本维护起来很痛苦,我建议使用这样的python脚本:
import requests
from bs4 import BeautifulSoup
import re
import base64
import json
username="your_email@gmail.com"
password="your_password"
base_url="https://accounts.wsj.com"
session = requests.Session()
r = session.get("{}/login".format(base_url))
soup = BeautifulSoup(r.text, "html.parser")
jscript = [
t.get("src")
for t in soup.find_all("script")
if t.get("src") is not None and "app-min" in t.get("src")
][0]
credentials_search = re.search("Base64\.decode\('(.*)'", r.text, re.IGNORECASE)
base64_decoded = base64.b64decode(credentials_search.group(1))
credentials = json.loads(base64_decoded)
print("client_id : {}".format(credentials["clientID"]))
print("state : {}".format(credentials["internalOptions"]["state"]))
print("nonce : {}".format(credentials["internalOptions"]["nonce"]))
print("scope : {}".format(credentials["internalOptions"]["scope"]))
r = session.get("{}{}".format(base_url, jscript))
connection_search = re.search('connection:\s*\"(\w+)\"', r.text, re.IGNORECASE)
connection = connection_search.group(1)
r = session.post(
'https://sso.accounts.dowjones.com/usernamepassword/login',
data = {
"username": username,
"password": password,
"connection": connection,
"client_id": credentials["clientID"],
"state": credentials["internalOptions"]["state"],
"nonce": credentials["internalOptions"]["nonce"],
"scope": credentials["internalOptions"]["scope"],
"tenant": "sso",
"response_type": "code",
"protocol": "oauth2",
"redirect_uri": "https://accounts.wsj.com/auth/sso/login"
})
soup = BeautifulSoup(r.text, "html.parser")
login_result = dict([
(t.get("name"), t.get("value"))
for t in soup.find_all('input')
if t.get("name") is not None
])
r = session.post(
'https://sso.accounts.dowjones.com/login/callback',
data = login_result)
#check connected user
r = session.get("https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y")
username_search = re.search('\"firstName\":\s*\"(\w+)\",', r.text, re.IGNORECASE)
print("connected user : " + username_search.group(1))
推荐阅读
- google-cloud-platform - 为什么使用 iptables 设置出站规则会阻止整个站点,但 Google Cloud 防火墙中的出口规则并没有太大变化?
- php - Laravel 不同用户类型的相同控制器
- ssl - 与证书相关的 SBT 项目编译错误
- javascript - 表单提交上的 NPM 包功能 - 前端
- javascript - 有没有办法简化 document.getElementById?
- python - 使用注释的 Django 数据分组
- javascript - 如何从 randomuser API 获取用户的用户详细信息?
- flutter - 即使在 null 检查 Flutter 之后也没有这样的方法错误
- c - 在我的 Glade GTK 应用程序中参考 indéfinie vers « curl_global_init »
- vue.js - 从 vue.js 中删除未使用的重包