首页 > 解决方案 > 如果找到匹配项,如何搜索预定义的字符串并返回整行

问题描述

该片段部分工作,因为它可以产生一些结果。我需要帮助才能使其完全正常工作。我正在搜索 url 中的字符串,如果找到部分匹配,则将返回整行。

from bs4 import BeautifulSoup as bs
import requests

addrlist = ['0xe56842ed550ff2794f010738554db45e60730371',
           '0xe1fd7b4c9debac3c490d8a553c455da4979482e4',
           '0x88c20beda907dbc60c56b71b102a133c1b29b053']

queries = ["Website", "Telegram", "https://www.", "Twitter", "https://t.me"]
url = "https://bscscan.com/address/"


for i in addrlist:
      url = str(url) + str(i)

      r = requests.get(url)
      soup = bs(r.text,'lxml')

      pre = soup.select_one('pre.js-sourcecopyarea.editor')
      ss = (list(pre.stripped_strings)[0]).split('*')
      for s in ss:
             for query in queries:
                  if query in s:
                      print(s)
           

电流输出:

Website: https://binemon.io
Telegram: https://t.me/binemonchat
Twitter: https://twitter.com/binemonnft

AttributeError: 'NoneType' object has no attribute 'stripped_strings'

想要的输出:

Website: https://binemon.io
Telegram: https://t.me/binemonchat
Twitter: https://twitter.com/binemonnft

// Telegram : https://t.me/stackdogebsc
// Website : https://www.stack-doge.com

*Website: www.shibuttinu.com
*Telegram: https://t.me/Shibuttinu

标签: pythonpython-3.xbeautifulsoup

解决方案


问题是url可变的。您将每个连接addrlist到上一个 url:

# 1st iteration:
https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e60730371

# 2nd iteration:
https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e607303710xe1fd7b4c9debac3c490d8a553c455da4979482e4

# 3rd iteration:
https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e607303710xe1fd7b4c9debac3c490d8a553c455da4979482e40x88c20beda907dbc60c56b71b102a133c1b29b053

像这样更改您的代码:

# url = "https://bscscan.com/address/"
baseurl = "https://bscscan.com/address/"

# url = str(url) + str(i)
url = str(baseurl) + str(i)

更新

使用正则表达式提取信息。

完整代码:

from bs4 import BeautifulSoup as bs
import requests
import re

addrlist = ['0xe56842ed550ff2794f010738554db45e60730371',
            '0xe1fd7b4c9debac3c490d8a553c455da4979482e4',
            '0x88c20beda907dbc60c56b71b102a133c1b29b053']

baseurl = "https://bscscan.com/address/"
pattern = r'(Website|Telegram|Twitter)\s*:\s*([^\s]+)'

for i in addrlist:
      url = str(baseurl) + str(i)

      r = requests.get(url)
      soup = bs(r.text,'lxml')

      pre = soup.select_one('pre.js-sourcecopyarea.editor')

      print(url)
      for match in re.findall(pattern, str(pre)):
          print(f"{match[0]}: {match[1]}")
      print()

输出:

https://bscscan.com/address/0xe56842ed550ff2794f010738554db45e60730371
Website: https://binemon.io
Telegram: https://t.me/binemonchat
Twitter: https://twitter.com/binemonnft

https://bscscan.com/address/0xe1fd7b4c9debac3c490d8a553c455da4979482e4
Telegram: https://t.me/stackdogebsc
Website: https://www.stack-doge.com

https://bscscan.com/address/0x88c20beda907dbc60c56b71b102a133c1b29b053
Website: www.shibuttinu.com
Telegram: https://t.me/Shibuttinu

推荐阅读