python-3.x - 如何使用漂亮的汤仅在段落内获取锚标签的文本?
问题描述
我正在尝试用漂亮的汤来解析抓取的数据。我需要的是获取所有可见数据,即文章中的所有数据,以及 h1. 大多数情况下,文章数据都包含嵌入其中的文本。类似“我是班上的“href/good_boy””之类的东西。现在我只想要那个'a'标签,只要它在段落内。以下是我的代码。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import json
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
from queue import Queue
from threading import Thread
options = Options()
#data = []
our_urls = []
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
#options.add_argument('--headless')
options.add_argument("--no-sandbox")
options.add_argument('--disable-dev-shm-usage')
def foo():
global our_urls
with open('input_backup.json') as json_file:
data = json.load(json_file)
global our_urls
our_urls = data['urls']
return our_urls
def scraper_worker(q):
try:
while not q.empty():
url = q.get()
#print(url)
driver = url[2]
r = driver.get(url[1])
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
# make_true=False
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(10)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
# soup = BeautifulSoup(driver.page_source, "html.parser")
# print("inside loop" + driver.current_url + "\n\t" + soup.get_text())
if new_height == last_height:
# If heights are the same it will exit the function
break
last_height = new_height
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(driver.page_source, "html.parser")
whitelist = [
'p', 'h1','a'
]
blackList = [ '[document]',
'noscript','div',
'footer',
'html',
'meta',
'head',
'input',
'script',]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]
print("\n\t" + driver.current_url + "\n\t")
print(text_elements)
#page = pyquery(r.text)
#data = page("#data").text()
# do something with data
driver.quit()
q.task_done()
except:
pass
# Create a queue and fill it
urls = foo()
#print(urls)
mlen = len(urls)
q = Queue()
#for x in urls:
#q.put(x)
for i in range(len(urls)):
# need the index and the url in each queue item.
driver = webdriver.Chrome("./chromedriver", options=options)
q.put((i, urls[i],driver))
#map(q.put, urls)
# Create 5 scraper workers
for i in range(3):
t = Thread(target=scraper_worker, args=(q, ))
t.setDaemon(True)
t.start()
#print("waiting for queue to complete", jobs.qsize(), "tasks")
q.join()
print("all tasks completed")
以下是参考网址 示例网址
这是输出
['邮件'、'新闻'、'财经'、'体育'、'娱乐'、'搜索'、'手机'、'更多'、'登录'、'react-text: 10'、'财经之家' , '/react-text', 'react-text: 20', '关注列表', '/react-text', 'react-text: 23', '我的投资组合', '/react-text', 'react- text: 26 ', 'Screeners', ' /react-text ', ' react-text: 29 ', 'Premium', ' /react-text ', ' react-text: 32 ', 'Markets', ' /react -text ', ' react-text: 35 ', '行业', ' /react-text ', 'react-text: 38 ', '个人理财', ' /react-text ', 'react-text: 41 ' , '视频', ' /react-text ', ' react-text:44 ', '新闻', ' /react-text ', 'react-text: 47 ', '科技', ' /react-text ', 'S&P 500', 'Dow 30', 'Nasdaq', 'Russell 2000 ', '原油', 'Tethers Unlimited 说'终结者磁带'正在按预期加速卫星的下降', 'GeekWire', 'Bothell, Wash.-based', 'Tethers Unlimited', '说“终结者磁带”一个基于系绳的实验性系统,旨在将卫星从轨道上拖下,正在按预期的方式工作。“乔治亚理工学院的 Prox-1 卫星”,“去年 6 月由 SpaceX Falcon Heavy 火箭送入轨道”,“在新闻稿”、“与 Millennium Space Systems、TriSept 和 Rocket Lab 合作执行名为 DragRacer 的测试任务”、“Hoyt 告诉 Space News”、“LEO Knight 服务机器人”、“Tethers Unlimited 与 TriSept 联手测试减少轨道碎片的系统”、“Tethers Unlimited 致力于为“LEO Knight”卫星服务机器人提供技术”、“Tethers Unlimited 表示用于小型卫星的双向无线电已通过首次轨道测试”、“Tethers Unlimited 揭开了小型卫星网状网络系统的序幕”、“Kolte Patil - Ivy Nia”、“Ad”、“Maruti Suzuki”、“Ad”、“Fateheducation”、“Ad” , 'hear.com', 'Ad'] 所有任务完成。Tethers Unlimited 表示,用于小型卫星的双向无线电已通过首次轨道测试”、“Tethers Unlimited 揭开了小型卫星网状网络系统的序幕”、“Kolte Patil - Ivy Nia”、“Ad”、“Maruti Suzuki”、“ Ad', 'Fateheducation', 'Ad', 'hear.com', 'Ad'] 所有任务已完成。Tethers Unlimited 表示,用于小型卫星的双向无线电已通过首次轨道测试”、“Tethers Unlimited 揭开了小型卫星网状网络系统的序幕”、“Kolte Patil - Ivy Nia”、“Ad”、“Maruti Suzuki”、“ Ad', 'Fateheducation', 'Ad', 'hear.com', 'Ad'] 所有任务已完成。
所以任何人都可以帮助我,如何只获取标题和段落之间的文本文章。我没有得到想要的输出
总部位于华盛顿州博塞尔的 Tethers Unlimited 表示,“终结者磁带”是一种基于系绳的实验性系统,旨在将卫星从轨道上拖下,正在按预期的方式工作。
The notebook-sized Terminator Tape system has been placed on several nanosatellites for testing — including Georgia Tech’s Prox-1 satellite, which was sent into orbit last June on a SpaceX Falcon Heavy rocket. Last September, the system’s 230-foot-long tether was strung out to add to the slight atmospheric drag experienced in low Earth orbit. “We can see from observations by the U.S. Space Surveillance Network that the satellite immediately began deorbiting over 24 times faster,” Tethers Unlimited CEO Rob Hoyt said in a news release.
That’s a good thing: Terminator Tape is meant to address the need to move retired satellites more quickly out of orbit, rather than having them add to the growing space-junk problem. “Instead of remaining in orbit for hundreds or thousands of years, the Prox-1 satellite will fall out of orbit and burn up in the upper atmosphere in under 10 years. … This successful test proves that this lightweight and low-cost technology is an effective means for satellite programs to meet orbital debris mitigation requirements,” Hoyt said.
Tethers Unlimited is currently collaborating with Millennium Space Systems, TriSept and Rocket Lab on a test mission known as DragRacer, due for launch this year. The mission will compare the deorbit rates for two identical satellites, one with Terminator Tape and one without, to characterize the system’s performance more precisely. Hoyt told Space News that in the years ahead, the system could be attached to defunct satellites in orbit using Tethers Unlimited’s planned LEO Knight servicing robot.
解决方案
推荐阅读
- html - 你能把嵌套列的弹性盒做得尽可能小但尽可能宽吗
- angular - Angular - 反应式表单 - FormGroupName 内的 FormGroupName
- ios - 如何从 iOS 应用程序的 Google 日历访问数据
- apache-spark - apache spark和hadoop之间的jar冲突
- export - 使用 JsPDF 将 openlayers 当前地图视图导出为 PDF
- java - 如何旋转对象以始终面对鼠标?
- javascript - React 组件在更新到状态后不会重新渲染
- sql - 我在使用闪回时出错
- firebase - 创建新的 Firebase 项目:“创建项目时出现未知错误”
- python - 这段代码中的递归究竟是如何工作的