首页 > 解决方案 > 缺少使用 beautifulsoup 提取文本

问题描述

我正在使用漂亮的汤从 ul 和 li 标签中提取数据。我可以得到一个日期,但有些字不见了,而且行之间没有地方。

<li>Developing <span class="bte bte-78432-940">&nbsp;</span>pricing strategy that maximizes profits <span class="bte bte-78432-947">&nbsp;</span>market share <span class="bte bte-78432-962">&nbsp;</span>considers customer satisfaction</li>
<li>Supporting <span class="bte bte-78432-1041">&nbsp;</span>and <span class="bte bte-78432-1045">&nbsp;</span>launching</li>

HTML 视图文本: - 制定最大化利润和市场份额但考虑客户满意度的定价策略 - 支持销售和服务推出

我收到以下文本: 制定最大化利润市场份额的定价策略考虑客户满意度支持和启动

缺少单词,例如,a、and、sales 和 service。此外,它们写在一行中并且连续。

如何获得 HTML 视图中的确切文本,如果没有 bulttet,它至少应该在每个项目符号之间包含下划线。

一段代码:

 soup = BeautifulSoup(html, 'html.parser')
    ul_jobdetail = soup.find_all('ul',{'class':'job-detail-req'})
    i=1
    for ul_jdetail in ul_jobdetail:
        if i==1:
            duties = ul_jdetail.getText()
            print(ul_jdetail.text)
        else:
            requirements=ul_jdetail.getText()
        i=i+1

标签: beautifulsoup

解决方案


该页面似乎是通过 CSS 编码的,因此首先加载该 CSS,解析它以获取所需的信息(缺失的单词)并将这些单词放入汤中:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.bongthom.com/job_detail/various_positions_78432.html'

soup = BeautifulSoup(requests.get(url).text, 'lxml')
css_url = soup.select_one('link[data-src="escape"]')['href']

for css_class, word in re.findall(r'\.(bte-\d+-\d+).*?"(.*?)"', requests.get(css_url).text):
    for span in soup.select('span.{}'.format(css_class)):
        span.string = word + ' '
        span.unwrap()

for li in soup.select('.job-detail-req li'):
    print(li.text)

印刷:

Developing a pricing strategy that maximizes profits and market share but considers customer satisfaction
Supporting sale and service launching
Creating promotion, advertising and event planning
Developing and managing advertising campaigns
Organizing company conference, Trade shows, and major events
Building brand awareness
Evaluating and maintaining marketing strategy
Directing, planning and coordinating marketing plan
Researching market demand
Handling social media, public relation efforts, and marketing content
Build strategic relationships and partner with key industry players, and agencies
Be in charge of marketing budget and allocate
Up-to-date with the latest trends and best practices in online marketing and measurement
Identify weaknesses in existing marketing campaigns and develop pragmatic solution within budgetary constraints
Communicate with senior management about marketing initiatives and brainstorm fresh strategies
Bachelor degree in Marketing, Business Administration, Communication or relate field (MBA Preferred)
At least five years’ experience in Marketing and Promotion

...etc.

推荐阅读