首页 > 解决方案 > 如果找到匹配项,如何在 url 中搜索字符串并返回整行

问题描述

该片段部分有效,并且还会产生冗余输出。我需要帮助才能使其完全正常工作。我正在搜索页面中的字符串,如果找到部分匹配或完全匹配,则将返回整行。

from bs4 import BeautifulSoup as bs
import requests

addrlist = ['0xe56842ed550ff2794f010738554db45e60730371',
           '0xe1fd7b4c9debac3c490d8a553c455da4979482e4',
           '0x88c20beda907dbc60c56b71b102a133c1b29b053']

queries = ["Website", "Telegram", "https://www.", "Twitter", "https://t.me"]
baseurl = "https://bscscan.com/address/"


for i in addrlist:
      url = str(baseurl) + str(i)

      r = requests.get(url)
      soup = bs(r.text,'lxml')

      pre = soup.select_one('pre.js-sourcecopyarea.editor')
      ss = (list(pre.stripped_strings)[0]).split('*')
      for s in ss:
             for query in queries:
                  if query in s:
                      print(s)
           

电流输出:

Website: https://binemon.io             #output repeated 4x in actual run
Telegram: https://t.me/binemonchat      
Twitter: https://twitter.com/binemonnft 

// SPDX-License-Identifier: UNLICENSED  #output repeated 4x in actual run
// IERC20.sol

Website: www.shibuttinu.com             #output repeated 1x only
Telegram: https://t.me/Shibuttinu

想要的输出:

Website: https://binemon.io
Telegram: https://t.me/binemonchat
Twitter: https://twitter.com/binemonnft

// Telegram : https://t.me/stackdogebsc
// Website : https://www.stack-doge.com

*Website: www.shibuttinu.com
*Telegram: https://t.me/Shibuttinu

标签: pythonpython-3.xbeautifulsoup

解决方案


您可以使用正则表达式来提取 URL:

import re
import requests
from bs4 import BeautifulSoup as bs

addrlist = [
    "0xe56842ed550ff2794f010738554db45e60730371",
    "0xe1fd7b4c9debac3c490d8a553c455da4979482e4",
    "0x88c20beda907dbc60c56b71b102a133c1b29b053",
]

queries = ["Website", "Telegram", "https://www.", "Twitter", "https://t.me"]
baseurl = "https://bscscan.com/address/"

r_pat = re.compile("|".join("{}.*".format(re.escape(q)) for q in queries))


for i in addrlist:
    url = str(baseurl) + str(i)

    r = requests.get(url)
    soup = bs(r.text, "lxml")

    pre = soup.select_one("pre.js-sourcecopyarea.editor")

    print(url)
    print()
    for m in r_pat.findall(pre.string):
        print(m.strip())
    print("-" * 80)

印刷:

https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e60730371

Website: https://binemon.io
Telegram: https://t.me/binemonchat
Twitter: https://twitter.com/binemonnft
--------------------------------------------------------------------------------
https://bscscan.com/address/0xe1fd7b4c9debac3c490d8a553c455da4979482e4

Telegram : https://t.me/stackdogebsc
Website : https://www.stack-doge.com
--------------------------------------------------------------------------------
https://bscscan.com/address/0x88c20beda907dbc60c56b71b102a133c1b29b053

Website: www.shibuttinu.com
Telegram: https://t.me/Shibuttinu
--------------------------------------------------------------------------------

推荐阅读