首页 > 解决方案 > 如何获得外部

  • 带内标签
  • 或在 python 中使用 BeautifulSoup 的其他标签文本
  • 问题描述

    我只想输出外部 li 标签文本。

      from bs4 import BeautifulSoup
    
      html = BeautifulSoup("""
    
          <ul>
    
                <li><a href="#">B2B Marketing</a>
                       <ul>
                            <li><a href="offerings/b2bmarketing/outboundai.php"> Campagin </a></li>
                            <li><b>Inbound AI </b>Enrich inbound leads</a></li>
                       </ul>
               </li>
    
               <li>Marketing Data Analysis
                       <ul>
                            <li><a href="offerings/marketingdataanalysis/event360ai.php"><b>Event 360 AI </b></a></li>
                       </ul>
              </li>
    
              <li class="drop-down"><a href="#">Enrichment API</a>
              </li>
    
    
          </ul>
    
          """)
    
      print([i.text.strip() for i in html.findAll('li')])
    

    输出是 html 内容的整个文本。

    ['B2B Marketing\n\n Campagin \nInbound AI Enrich inbound leads', 'Campagin', 'Inbound AI Enrich inbound leads', 'Marketing Data Analysis\n          \nEvent 360 AI', 'Event 360 AI', 'Enrichment API\n\nAPI  Technographics, Firmographics, Intent data', 'API  Technographics, Firmographics, Intent data']
    

    输出应该是: -

      [
       'B2B Marketing : Campagin, Enrich inbound leads',
       'Marketing Data Analysis : Event 360 AI',
       'Enrichment API'
      ]
    

    请帮我解决这个问题

    标签: pythonweb-scrapingbeautifulsouppython-requests

    解决方案


    这怎么样?

    from simplified_scrapy.simplified_doc import SimplifiedDoc
    html = '''<ul>
                <li><a href="#">B2B Marketing</a>
                       <ul>
                            <li><a href="offerings/b2bmarketing/outboundai.php"> Campagin </a></li>
                            <li><b>Inbound AI </b>Enrich inbound leads</a></li>
                       </ul>
               </li>
               <li>Marketing Data Analysis
                       <ul>
                            <li><a href="offerings/marketingdataanalysis/event360ai.php"><b>Event 360 AI </b></a></li>
                       </ul>
              </li>
              <li class="drop-down"><a href="#">Enrichment API</a>
              </li>
          </ul>
    '''
    doc = SimplifiedDoc(html)
    lis = doc.ul.lis
    out = []
    for li in lis:
      if li.b and li.b.nextText():
        li.removeElement('b')
      name = li.firstText() if li.firstText() else li.a.text
      tmp = ''
      for l in li.lis:
        tmp += l.text+','
      if tmp:
        out.append(name+':'+tmp[0:-1])
      else:
        out.append(name)
    print (out)
    

    结果:

    ['B2B Marketing:Campagin,Enrich inbound leads', 'Marketing Data Analysis:Event 360 AI', 'Enrichment API']
    

    推荐阅读