首页 > 解决方案 > beautifulsoup4 python 处理解析的数据

问题描述

with requests.Session() as s:
auth_return = s.get('https://urproject.com/?page=com_auth_return')
soup = bs(auth_return.text,'html.parser')

我得到的是这样的。

<script type="text/javascript">
document.location = 'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876';
</script>

有了这个,我想得到 EncData 和 EncKey

EncData = soup.find_all("EncData")
EncKey = soup.find_all("EncKey")

encdatanenckey = {'EncData':EncData,
             'EncKey':EncKey}

print(encdatanenckey)

结果是

{'EncData': 'abcdefg1234', 'EncKey': 'hijk9876'}

我怎么会得到这个....我必须使用正则表达式吗?我对正则表达式很陌生,所以...你能给我举个例子吗?

标签: pythonregexbeautifulsoup

解决方案


首先可以使用 bs4 提取脚本内容,然后通过正则表达式匹配特定数据

from bs4 import BeautifulSoup
import re

html = """
<script type="text/javascript" ...></script>
<script type="text/javascript">
document.location = 'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876';
</script>
"""
soup = BeautifulSoup(html,'lxml')
js_ = soup.find_all("script",text=True)
regex = r"(?<={}\=).*?(?=&|\'|\")"
EncData = [ re.search(regex.format("EncData"),url.text).group(0)  for url in js_]
EncKey = [ re.search(regex.format("EncKey"),url.text).group(0)  for url in js_]

encdatanenckey = {'EncData':EncData,
             'EncKey':EncKey}

print(encdatanenckey)
# {'EncData': ['abcdefg1234'], 'EncKey': ['hijk9876']}

推荐阅读