首页 > 解决方案 > 如何使用漂亮的汤仅在段落内获取锚标签的文本?

问题描述

我正在尝试用漂亮的汤来解析抓取的数据。我需要的是获取所有可见数据,即文章中的所有数据,以及 h1. 大多数情况下,文章数据都包含嵌入其中的文本。类似“我是班上的“href/good_boy””之类的东西。现在我只想要那个'a'标签,只要它在段落内。以下是我的代码。

from selenium import webdriver

    from selenium.webdriver.chrome.options import Options
    import time
    from bs4 import BeautifulSoup
    import json
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.common.exceptions import NoSuchElementException
    from selenium.webdriver.support.ui import Select
    from selenium.webdriver.common.action_chains import ActionChains
    from queue import Queue
    from threading import Thread
    options = Options()
    #data = []
    our_urls = []
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    #options.add_argument('--headless')
    options.add_argument("--no-sandbox")
    options.add_argument('--disable-dev-shm-usage')


    def foo():
        global our_urls
        with open('input_backup.json') as json_file:
            data = json.load(json_file)
            global our_urls
            our_urls = data['urls']
            return our_urls


    def scraper_worker(q):
     try:
        while not q.empty():
            url = q.get()
            #print(url)
            driver = url[2]
            r = driver.get(url[1])
            last_height = driver.execute_script("return document.body.scrollHeight")

            while True:
                # Scroll down to bottom
                # make_true=False
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

                # Wait to load page
                time.sleep(10)

                # Calculate new scroll height and compare with last scroll height
                new_height = driver.execute_script("return document.body.scrollHeight")
                # soup = BeautifulSoup(driver.page_source, "html.parser")
                # print("inside loop" + driver.current_url + "\n\t" + soup.get_text())
                if new_height == last_height:
                    # If heights are the same it will exit the function
                    break
                last_height = new_height
            driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

            soup = BeautifulSoup(driver.page_source, "html.parser")
            whitelist = [
                        'p', 'h1','a'
                     ]

            blackList = [   '[document]',
        'noscript','div',
        'footer',
        'html',
        'meta',
        'head',
        'input',
        'script',]
            text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]
            print("\n\t" + driver.current_url + "\n\t")
            print(text_elements)
            #page = pyquery(r.text)
            #data = page("#data").text()
            # do something with data
            driver.quit()
            q.task_done()
     except:
         pass

    # Create a queue and fill it
    urls = foo()
    #print(urls)
    mlen = len(urls)
    q = Queue()
    #for x in urls:
      #q.put(x)
    for i in range(len(urls)):
          # need the index and the url in each queue item.
        driver = webdriver.Chrome("./chromedriver", options=options)
        q.put((i, urls[i],driver))
    #map(q.put, urls)

    # Create 5 scraper workers
    for i in range(3):
        t = Thread(target=scraper_worker, args=(q, ))
        t.setDaemon(True)
        t.start()
    #print("waiting for queue to complete", jobs.qsize(), "tasks")
    q.join()

    print("all tasks completed")

以下是参考网址 示例网址

这是输出

['邮件'、'新闻'、'财经'、'体育'、'娱乐'、'搜索'、'手机'、'更多'、'登录'、'react-text: 10'、'财经之家' , '/react-text', 'react-text: 20', '关注列表', '/react-text', 'react-text: 23', '我的投资组合', '/react-text', 'react- text: 26 ', 'Screeners', ' /react-text ', ' react-text: 29 ', 'Premium', ' /react-text ', ' react-text: 32 ', 'Markets', ' /react -text ', ' react-text: 35 ', '行业', ' /react-text ', 'react-text: 38 ', '个人理财', ' /react-text ', 'react-text: 41 ' , '视频', ' /react-text ', ' react-text:44 ', '新闻', ' /react-text ', 'react-text: 47 ', '科技', ' /react-text ', 'S&P 500', 'Dow 30', 'Nasdaq', 'Russell 2000 ', '原油', 'Tethers Unlimited 说'终结者磁带'正在按预期加速卫星的下降', 'GeekWire', 'Bothell, Wash.-based', 'Tethers Unlimited', '说“终结者磁带”一个基于系绳的实验性系统,旨在将卫星从轨道上拖下,正在按预期的方式工作。“乔治亚理工学院的 Prox-1 卫星”,“去年 6 月由 SpaceX Falcon Heavy 火箭送入轨道”,“在新闻稿”、“与 Millennium Space Systems、TriSept 和 Rocket Lab 合作执行名为 DragRacer 的测试任务”、“Hoyt 告诉 Space News”、“LEO Knight 服务机器人”、“Tethers Unlimited 与 TriSept 联手测试减少轨道碎片的系统”、“Tethers Unlimited 致力于为“LEO Knight”卫星服务机器人提供技术”、“Tethers Unlimited 表示用于小型卫星的双向无线电已通过首次轨道测试”、“Tethers Unlimited 揭开了小型卫星网状网络系统的序幕”、“Kolte Patil - Ivy Nia”、“Ad”、“Maruti Suzuki”、“Ad”、“Fateheducation”、“Ad” , 'hear.com', 'Ad'] 所有任务完成。Tethers Unlimited 表示,用于小型卫星的双向无线电已通过首次轨道测试”、“Tethers Unlimited 揭开了小型卫星网状网络系统的序幕”、“Kolte Patil - Ivy Nia”、“Ad”、“Maruti Suzuki”、“ Ad', 'Fateheducation', 'Ad', 'hear.com', 'Ad'] 所有任务已完成。Tethers Unlimited 表示,用于小型卫星的双向无线电已通过首次轨道测试”、“Tethers Unlimited 揭开了小型卫星网状网络系统的序幕”、“Kolte Patil - Ivy Nia”、“Ad”、“Maruti Suzuki”、“ Ad', 'Fateheducation', 'Ad', 'hear.com', 'Ad'] 所有任务已完成。

所以任何人都可以帮助我,如何只获取标题和段落之间的文本文章。我没有得到想要的输出

总部位于华盛顿州博塞尔的 Tethers Unlimited 表示,“终结者磁带”是一种基于系绳的实验性系统,旨在将卫星从轨道上拖下,正在按预期的方式工作。

The notebook-sized Terminator Tape system has been placed on several nanosatellites for testing — including Georgia Tech’s Prox-1 satellite, which was sent into orbit last June on a SpaceX Falcon Heavy rocket. Last September, the system’s 230-foot-long tether was strung out to add to the slight atmospheric drag experienced in low Earth orbit. “We can see from observations by the U.S. Space Surveillance Network that the satellite immediately began deorbiting over 24 times faster,” Tethers Unlimited CEO Rob Hoyt said in a news release.
That’s a good thing: Terminator Tape is meant to address the need to move retired satellites more quickly out of orbit, rather than having them add to the growing space-junk problem. “Instead of remaining in orbit for hundreds or thousands of years, the Prox-1 satellite will fall out of orbit and burn up in the upper atmosphere in under 10 years. … This successful test proves that this lightweight and low-cost technology is an effective means for satellite programs to meet orbital debris mitigation requirements,” Hoyt said.
Tethers Unlimited is currently collaborating with Millennium Space Systems, TriSept and Rocket Lab on a test mission known as DragRacer, due for launch this year. The mission will compare the deorbit rates for two identical satellites, one with Terminator Tape and one without, to characterize the system’s performance more precisely. Hoyt told Space News that in the years ahead, the system could be attached to defunct satellites in orbit using Tethers Unlimited’s planned LEO Knight servicing robot.

标签: python-3.xseleniumweb-scrapingbeautifulsoup

解决方案


推荐阅读