首页 > 解决方案 > 从抓取的数据中删除空格/空格/换行符

问题描述

我已经使用漂亮的汤从 url 中抓取了数据。但清理后,清理后的数据中有许多空格/空格/换行符。我尝试.strip()了删除这些功能。但它仍然存在。

代码

from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', clean_data)
with open('read.txt', 'w') as file:
    file.writelines(text)

输出

   America the Beautiful: A Virtual Patriotic Salute   Flagstaff Symphony Orchestra                                                                                           Contact             Hit enter to search or ESC to close                                     About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets                  All Events   This event has passed. America the Beautiful: A Virtual Patriotic Salute  July 4, 2020         Violin Virtuoso Beethoven Virtual 5k             In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of  America the Beautiful  performed by 60 of their professional musicians, coming together virtually, to celebrate our nation s independence. CLICK HERE FOR DETAILS   + Google Calendar+ iCal Export     Details    Date:    July 4, 2020   Event Category: Concerts and Events             Violin Virtuoso Beethoven Virtual 5k                   Concert InfoConcerts Concerts and Events FAQs     FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members     Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards  (Used by permission of the Association of Fundraising Professionals)     ResourcesCommunity & Education For Musicians For Board Members             2021 Flagstaff Symphony Orchestra. 
           Copyright 2019 Flagstaff Symphony Association                             About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets   Contact  

在上面的代码中,我用 ' ' (空格)替换了 unicode 字符。如果我没有用空格替换,那么几个单词将被连接在一起。我想要获得的是一个字符串数据类型,没有不必要的空格和换行数据。

添加的问题

我尝试了诸如此类的所有方法strip(), re.sub()来替换文本中某些行开头的空格。但是对于以下数据没有任何作用

Subscription Tickets
 All Events
This event has passed.
America the Beautiful: A Virtual Patriotic Salute
July 4, 2020
 Violin Virtuoso
Beethoven Virtual 5k 

我们如何删除这些空格

标签: pythonre

解决方案


你可以试试:

print(re.sub('\s+',' ', text))

推荐阅读