python - 从抓取的数据中删除空格/空格/换行符
问题描述
我已经使用漂亮的汤从 url 中抓取了数据。但清理后,清理后的数据中有许多空格/空格/换行符。我尝试.strip()
了删除这些功能。但它仍然存在。
代码
from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', clean_data)
with open('read.txt', 'w') as file:
file.writelines(text)
输出
America the Beautiful: A Virtual Patriotic Salute Flagstaff Symphony Orchestra Contact Hit enter to search or ESC to close About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets All Events This event has passed. America the Beautiful: A Virtual Patriotic Salute July 4, 2020 Violin Virtuoso Beethoven Virtual 5k In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of America the Beautiful performed by 60 of their professional musicians, coming together virtually, to celebrate our nation s independence. CLICK HERE FOR DETAILS + Google Calendar+ iCal Export Details Date: July 4, 2020 Event Category: Concerts and Events Violin Virtuoso Beethoven Virtual 5k Concert InfoConcerts Concerts and Events FAQs FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards (Used by permission of the Association of Fundraising Professionals) ResourcesCommunity & Education For Musicians For Board Members 2021 Flagstaff Symphony Orchestra.
Copyright 2019 Flagstaff Symphony Association About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets Contact
在上面的代码中,我用 ' ' (空格)替换了 unicode 字符。如果我没有用空格替换,那么几个单词将被连接在一起。我想要获得的是一个字符串数据类型,没有不必要的空格和换行数据。
添加的问题
我尝试了诸如此类的所有方法strip(), re.sub()
来替换文本中某些行开头的空格。但是对于以下数据没有任何作用
Subscription Tickets
All Events
This event has passed.
America the Beautiful: A Virtual Patriotic Salute
July 4, 2020
Violin Virtuoso
Beethoven Virtual 5k
我们如何删除这些空格
解决方案
你可以试试:
print(re.sub('\s+',' ', text))
推荐阅读
- c# - EFCore IQueryable 的计数 > 0 但 ToList( 返回 0 条记录
- laravel - phpunit/php-token-stream 包被废弃,你应该避免使用它。没有建议更换
- sql - 在 SQL Server 表的表中导入 csv 数据时添加额外的列
- unity3d - Unity - 锁定位置的 MouseWheel 武器切换(竞技场射击风格)
- spring-boot - 如何使用 JPA 和 Criteria API 对不区分大小写的列进行计数
- javascript - 我添加了 Leaflet easybutton,但它不起作用
- python - Pylint 显示错误函数中的所有返回语句都应返回表达式
- python - 如何从 AWS Sagemaker 内置容器加载模型工件?
- reactjs - 我无法使用 Reactjs 和 TypeScript 在我的输入文本中写入
- ionic4 - IONIC 4按钮事件在单击时未触发功能