首页 > 解决方案 > 使用 Python 从 JavaScript 中的 base64 格式文本中查找 url

问题描述

如何在 document.write(Base64.decode(" and ")); 之间找到基于 base64 的内容 使用给定网页中的 Python 使用正则表达式,然后以下列方式找到 base64 内的 URL

<div class="content_swf" style="position: relative;">
    <div class="player">
        <div class="mediaplayer" id="mediaplayer">
<script type="text/javascript">document.write(Base64.decode("PGlmcmFtZSB3aWR0aD0iMTAwJSIgaGVpZ2h0PSIxMDAlIiBhbGxvd2Z1bGxzY3JlZW4gd2Via2l0YWxsb3dmdWxsc2NyZWVuIG1vemFsbG93ZnVsbHNjcmVlbiBmcmFtZWJvcmRlcj0iMCIgc3JjPSJodHRwczovL3d3dy55b3V0dWJlLmNvbS93YXRjaD92PU9jajBzVkI5eWtZIiBzY3JvbGxpbmc9Im5vIj48L2lmcmFtZT4=="));</script>
        </div>
    </div>
</div>

所以我要提取的最终 URL 是 - https://www.youtube.com/watch?v=Ocj0sVB9ykY

我能够使用以下代码进行解码,但无法使用正则表达式或 Beautifulsoup 从 URL 网页中提取 base64 字符串

import requests, re, base64
v = "PGlmcmFtZSB3aWR0aD0iMTAwJSIgaGVpZ2h0PSIxMDAlIiBhbGxvd2Z1bGxzY3JlZW4gd2Via2l0YWxsb3dmdWxsc2NyZWVuIG1vemFsbG93ZnVsbHNjcmVlbiBmcmFtZWJvcmRlcj0iMCIgc3JjPSJodHRwczovL3d3dy55b3V0dWJlLmNvbS93YXRjaD92PU9jajBzVkI5eWtZIiBzY3JvbGxpbmc9Im5vIj48L2lmcmFtZT4=="
b64 = base64.b64decode(v).decode('utf-8')
print("Decoding: " + b64)
#this works but following not working
html = base64.b64decode(url).decode('utf8')
url = re.findall(r'''<iframe\s*src=["']([^"']+)''', html)[0] 

不工作。

标签: python

解决方案


import base64
import re

html = """
<div class="content_swf" style="position: relative;">
    <div class="player">
        <div class="mediaplayer" id="mediaplayer">
<script type="text/javascript">document.write(Base64.decode("PGlmcmFtZSB3aWR0aD0iMTAwJSIgaGVpZ2h0PSIxMDAlIiBhbGxvd2Z1bGxzY3JlZW4gd2Via2l0YWxsb3dmdWxsc2NyZWVuIG1vemFsbG93ZnVsbHNjcmVlbiBmcmFtZWJvcmRlcj0iMCIgc3JjPSJodHRwczovL3d3dy55b3V0dWJlLmNvbS93YXRjaD92PU9jajBzVkI5eWtZIiBzY3JvbGxpbmc9Im5vIj48L2lmcmFtZT4=="));</script>
        </div>
    </div>
</div>
"""

b64_regex = re.compile(r'Base64.decode\("([a-zA-Z0-9+/=]*)"\)')
b64_match = b64_regex.search(html)
assert b64_match is not None
b64_value = b64_match.group(1)
iframe = base64.b64decode(b64_value, validate=True).decode("utf-8")

src_regex = re.compile(r'src="([^"]*)"')
src_match = src_regex.search(iframe)
assert src_match is not None
url_value = src_match.group(1)

print(url_value)  # prints: https://www.youtube.com/watch?v=Ocj0sVB9ykY

推荐阅读