python - 如何使用正则表达式或 beautifulsoup 从输出中提取有用信息
问题描述
这是我已经工作了几天的代码片段。它尝试使用 dnsdumpster.com 获取域的子域列表。但是,我打印了很多我什至不需要的数据。
with requests.Session() as s:
url = 'https://dnsdumpster.com'
response = s.get(url, headers=headers, proxies=proxies)
response.encoding = 'utf-8' # Optional: requests infers this internally
soup1 = BeautifulSoup(response.text, 'html.parser')
input = soup1.find_all('input')
csrfmiddlewaretoken_raw = str(input[0])
csrfmiddlewaretoken = csrfmiddlewaretoken_raw[55:119]
data = {
'csrfmiddlewaretoken' : csrfmiddlewaretoken,
'targetip' : domain
}
send_data = s.post(url, data=data, proxies=proxies, headers=headers)
print(send_data.status_code)
soup2 = BeautifulSoup(send_data.text, 'html.parser')
td = soup2.find_all('td', {"class": "col-md-4"})
for i in range(len(td)):
item = str(td[i])
subdomain = item[0:100]
print(subdomain)
这就是输出。
<td class="col-md-4">ns.example.co.eu.<br/>
<a class="external nounderline" data-target="#myModal" d
<td class="col-md-4">0 ex-am-ple.mail.protection.outlook.com.<br/>
<a class="external nounderlin
<td class="col-md-4">blog.example.co.eu<br/>
<a class="external nounderline" data-target="#myModal"
<td class="col-md-4">dari.kardan.edu.af<br/>
我想要没有 HTML 标签和不相关数据的子域名?如您所见,子域名并不统一。任何人都可以帮助我使用正则表达式,或者有什么方法可以让我使用 BeautifulSoup 获得我想要的信息?
解决方案
First i try to catch the subdomain then in the last few line i clean the subdomain.
with requests.Session() as s:
url = 'https://dnsdumpster.com'
response = s.get(url, headers=headers, proxies=proxies)
response.encoding = 'utf-8' # Optional: requests infers this internally
soup1 = BeautifulSoup(response.text, 'html.parser')
input = soup1.find_all('input')
csrfmiddlewaretoken_raw = str(input[0])
csrfmiddlewaretoken = csrfmiddlewaretoken_raw[55:119]
data = {
'csrfmiddlewaretoken' : csrfmiddlewaretoken,
'targetip' : domain
}
send_data = s.post(url, data=data, proxies=proxies, headers=headers)
print(send_data.status_code)
soup2 = BeautifulSoup(send_data.text, 'html.parser')
td = soup2.find_all('td', {"class": "col-md-4"})
mydomain = []
for i in range(len(td)):
subdomain = td[i].text.strip()
mydomain.append(subdomain)
# filter_mydomain
filtered_subdomain = []
for sub in mydomain:
if sub.endswith('.'):
res = sub[:len(sub)-1]
filtered_subdomain.append(res)
else:
filtered_subdomain.append(sub)
print(filtered_subdomain)
推荐阅读
- php - laravel 资源函数模型对象实例参数
- c++ - C++ set_difference 方法未按预期工作
- linux - 将文本中的某行替换为另一个文本中的某行
- ios - iOS上的Ionic InAppBrowser加载白色空白屏幕错误
- azure - 用于获取指标统计信息的 SDK
- angularjs - 如何知道当前 $formatters 操作正在针对哪个模型属性运行?
- c++ - (C++) 修改了常量引用?
- oauth-2.0 - 如何在 OpenID Connect 中使用访问令牌进行单点登录?
- javascript - 为什么 jquery 和 javascript 模糊事件无限循环
- javascript - 拆分后如何分配整数值给el的某些孩子?