首页 > 解决方案 > 如何使用正则表达式或 beautifulsoup 从输出中提取有用信息

问题描述

这是我已经工作了几天的代码片段。它尝试使用 dnsdumpster.com 获取域的子域列表。但是,我打印了很多我什至不需要的数据。

with requests.Session() as s:
    url = 'https://dnsdumpster.com'
    response = s.get(url, headers=headers, proxies=proxies)
    response.encoding = 'utf-8' # Optional: requests infers this internally
    soup1 = BeautifulSoup(response.text, 'html.parser')
    input = soup1.find_all('input')
    csrfmiddlewaretoken_raw = str(input[0])
    csrfmiddlewaretoken = csrfmiddlewaretoken_raw[55:119]
    data = {
        'csrfmiddlewaretoken' : csrfmiddlewaretoken,
        'targetip' : domain
    }
    send_data = s.post(url, data=data, proxies=proxies, headers=headers)
    print(send_data.status_code)
    soup2 = BeautifulSoup(send_data.text, 'html.parser')
    td = soup2.find_all('td', {"class": "col-md-4"})
    for i in range(len(td)):
        item = str(td[i])
        subdomain = item[0:100]
        print(subdomain)

这就是输出。

<td class="col-md-4">ns.example.co.eu.<br/>
<a class="external nounderline" data-target="#myModal" d
<td class="col-md-4">0 ex-am-ple.mail.protection.outlook.com.<br/>
<a class="external nounderlin
<td class="col-md-4">blog.example.co.eu<br/>
<a class="external nounderline" data-target="#myModal"
<td class="col-md-4">dari.kardan.edu.af<br/>

我想要没有 HTML 标签和不相关数据的子域名?如您所见,子域名并不统一。任何人都可以帮助我使用正则表达式,或者有什么方法可以让我使用 BeautifulSoup 获得我想要的信息?

标签: pythonbeautifulsouppython-requests

解决方案


First i try to catch the subdomain then in the last few line i clean the subdomain.

with requests.Session() as s:
    url = 'https://dnsdumpster.com'
    response = s.get(url, headers=headers, proxies=proxies)
    response.encoding = 'utf-8' # Optional: requests infers this internally
    soup1 = BeautifulSoup(response.text, 'html.parser')
    input = soup1.find_all('input')
    csrfmiddlewaretoken_raw = str(input[0])
    csrfmiddlewaretoken = csrfmiddlewaretoken_raw[55:119]
    data = {
        'csrfmiddlewaretoken' : csrfmiddlewaretoken,
        'targetip' : domain
    }
    send_data = s.post(url, data=data, proxies=proxies, headers=headers)
    print(send_data.status_code)
    soup2 = BeautifulSoup(send_data.text, 'html.parser')
    td = soup2.find_all('td', {"class": "col-md-4"})
    mydomain = []
    for i in range(len(td)):
        subdomain = td[i].text.strip()
        mydomain.append(subdomain)
        
# filter_mydomain
filtered_subdomain = []
for sub in mydomain:
    if sub.endswith('.'):
        res = sub[:len(sub)-1]
        filtered_subdomain.append(res)
    else:
        filtered_subdomain.append(sub)

print(filtered_subdomain)

推荐阅读