首页 > 解决方案 > 我可以在带有硒的a标签下用python获取b标签中的数据吗?

问题描述

我可以在带有硒的a标签下用python获取b标签中的数据吗?

如果可以,怎么做?你能告诉我解决方案吗?

这是html的结构

...
<div class = "cont_inner">
  <div class = "wrap_tit_ mg_tit">
    <a href = "href="https://cp.news.search.daum.net/p/97048679" class"f_link_b" onclick="smartLog(this, "dc=NNS&d=26DQnlvsWTMHk5CtBf&pg=6&r=2&p=4&rc=10&e1=163cv75CcAF31EvlGD&e3=0&ext=dsid=26DQnlvsWTMHk5CtBf", event, {"cpid": {"value": "163cv75CcAF31EvlGD"}});" target = "_blank"> == $0

        "하남지역자활센터,"
        <b>보건복지부</b>
        "간이평가 우수기관"
    </a>
</div>

我想像这样获取数据


"하남지역자활센터, 보건복지부 간이평가우수기관"

这是我的代码状态

[['"하남지역자활센터, , 간이평가 우수기관"']]

这是我在网站上抓取数据的源代码

class crwaler_daum:
    def __init__(self):
        self.title = []
        self.body = []
        self.url = input("please enter url for crawling data")
        self.page = input('please enter number of page to get data')
    
    def get_title(self):
        return self.title
    
    def set_title(self , title):
        self.title.append(title)
        
    def get_body(self):
        return self.body
    
    def set_body(self , body):
        self.body.append(body)
    
    def crwaling_title(self):
        title_list = []
        chrome_driver = webdriver.Chrome('D:/바탕 화면/인턴/python/crwaler/news_crawling/chromedriver.exe')
        url = self.url
        response = requests.get(url , verify = False)
        root = lxml.html.fromstring(response.content)
        chrome_driver.get(url)
        
        for i in range(int(self.page) + 1):
            for j in root.xpath('//*[@id="clusterResultUL"]/li'):
                title_list.append((j.xpath('div[2]/div/div[1]/a/text()')))
                
        print(title_list)
            
            chrome_driver.get('https://search.daum.net/search?w=news&DA=PGD&enc=utf8&cluster=y&cluster_page=3&q=%EB%B3%B4%EA%B1%B4%EB%B3%B5%EC%A7%80%EB%B6%80&p={}'.format(i))

标签: pythonseleniumweb-crawler

解决方案


我还没用过lxml爬虫,但是你可以用BeautifulSoup代替。

from bs4 import BeautifulSoup

chrome_driver = webdriver.Chrome('your direction')

chrome_driver.get('your url')

html = chrome_driver.page_source

soup = BeautifulSoup(html, 'html.parser')
b_tag = soup.find_all('b')

推荐阅读