首页 > 解决方案 > 不带 html 元素的打印结果

问题描述

使用 Beautifulsoup4 我正在解析新闻网站。但我无法处理摆脱 html 元素以获得纯文本。

还有一个问题是新闻的发布日期不是日期格式,我想将其更改为日期格式,以便过滤掉不必要的新闻。

我想知道哪种格式对我存储数据有用?我将在 ML 中使用它来训练模型。

import requests
from bs4 import BeautifulSoup as bs

URL = 'http://marja.az/search?q='

# if there is a prabel inside of keyword merge with + sign
KEYWORDS = ['Valizada',
            ]

for key in KEYWORDS:
    search_url = URL + key
    print(search_url)
    r = requests.get(search_url)
    soup = bs(r.content, "lxml")
    for data in soup.find_all("div", {"class": "searchNews"}):
        for a in data.find_all("a"):
            href = a.get("href")
            # print(href)
            link = "http://marja.az/" + href
            print(link)
            r1 = requests.get(link)
            soup1 = bs(r1.content, "lxml")
            header = soup1.findAll("h1", attrs={"class": "title"})
            print(header)
            paragraph = soup1.findAll("div", attrs={"class": "text"})
            for p in paragraph:
                print(p.findAll('p', text=True, recursive=False))
            date = soup1.findAll("div", attrs={"class": "left"})
            for d in date:
                print(soup1.find('div', {'style': 'color: #af0000; margin:10px 0px 10px 0px; font-size:12px; '
                                                  'font-weight:bold; text-align:left;'}))

期望的结果:

Date, Header, Content

标签: pythonpython-3.xbeautifulsouppython-requests

解决方案


尝试这个

soup1 = bs(r1.content, "lxml")
# header
header = soup1.find("h1", attrs={"class": "title"}).text
print(header)

# content
content = []
paragraph = soup1.find("div", attrs={"class": "text"}).findAll('p', text=True, recursive=False)
for p in paragraph:
    content.append(p.text)
    content_text = "".join(content)
    print(content_text)

# date
date = soup1.find('div', {'style': 'color: #af0000; margin:10px 0px 10px 0px; font-size:12px; ''font-weight:bold; text-align:left;'}).text

date = date.split(",")[0].split(" ")
date = date[0] + "." + date[1] + "." + date[2]
print(date)

推荐阅读