python - Python 检测 URL 并在文本文件中删除它
问题描述
我从中获取了一个 Python 脚本并对其进行了编辑以符合我的喜好,我将前 20 条推文从特定页面打印到一个文本文件中。
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "tweets.txt"
f = open(file, "w")
url = "https://twitter.com/BBCWorld"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# Gets the tweet
tweets = soup.find_all("li", attrs = {"class":"js-stream-item"})
# Writes tweet fetched in file
for tweet in tweets:
try:
if tweet.find('p',{'tweet-text'}):
tweet_text = tweet.find('p',{'tweet-text'}).text.encode('utf8').strip()
# tweet_user = tweet.find('span',{"class":'username'}).text.strip()
# replies = tweet.find('span',{"class":"ProfileTweet-actionCount"}).text.strip()
# retweets = tweet.find('span', {"class" : "ProfileTweet-action--retweet"}).text.strip()
# String interpolation technique
f.write(f'{tweet_text}\n')
except: AttributeError
f.close()
然而,当推文在那里打印时,它们看起来像这样(我以 BBCWorld 的提要为例):
b'Van crash in south-east Iran kills 28 Afghan nationalshttps://bbc.in/2qcsg9P\xc2\xa0'
b'Guernsey asbestos cancer compensation scheme to launchhttps://bbc.in/2qQD9OE\xc2\xa0'
b'Construction firm fined \xc2\xa310k for Jersey water pollutionhttps://bbc.in/2KgIk19\xc2\xa0'
b'US election 2020: Deval Patrick announces presidential bidhttps://bbc.in/32QbdHH\xc2\xa0'
b'Knottfield: Joseph Marshall indecent assault trial delayedhttps://bbc.in/2XcXYjg\xc2\xa0'
b"Hugo Carvajal: Venezuelan ex-spy chief's disappearance 'a scandal'https://bbc.in/34VIwdY\xc2\xa0"
b'What fate awaits those former members of Islamic State being expelled from Turkey?https://www.bbc.co.uk/news/50396607\xc2\xa0'
b"Notre Dame: Army general tells architect to 'shut his mouth'https://bbc.in/2qVP7pX\xc2\xa0"
b'Six years after a Boeing 737-500 crashed in Kazan, Russian investigators conclude that the pilot wasn\xe2\x80\x99t qualified to fly the plane & had used falsified documents to get his job with (now defunct) Tatarstan Airlines. 50 people were killed.'
b'South Africa rugby stars strip off for cancer challengehttps://bbc.in/2rJF2Nv\xc2\xa0'
b"Diabetes: UN to tackle 'overly expensive' insulin priceshttps://bbc.in/2Op0nUf\xc2\xa0"
b'US Senator blocks move to say Armenian mass killing was genocidehttps://bbc.in/2QfjjHr\xc2\xa0'
b"Turkey to extradite American IS suspect 'stranded on border'https://bbc.in/33OJ1X1\xc2\xa0"
b'Father and daughter ballet video breaks stereotypes, says teacherhttps://bbc.in/2KlFEPY\xc2\xa0'
b'Australia seeks to curb foreign interference in universitieshttps://bbc.in/2NJcb4i\xc2\xa0'
b'Washington teacher arrested for threatening to shoot studentshttps://bbc.in/2KlUk1o\xc2\xa0'
b'Denmark holds neo-Nazi over Jewish cemetery attackhttps://bbc.in/2pjrcR9\xc2\xa0'
b'Manus Island refugee author Behrouz Boochani arrives in New Zealandhttps://bbc.in/2NMI4cs\xc2\xa0'
b'Italy to declare state of emergency over damage from Venice floodshttp://bbc.in/2OdDoeu\xc2\xa0'
b'Condor Ferries bought by Swedish investment fundhttps://bbc.in/2NLw0rT\xc2\xa0'
如何删除“b”?而且,如果特定推文具有该链接,我该如何删除该 URL,就像所有这些一样?
此外,为什么有时会出现一串数字和字母,如何修复/删除这些?
解决方案
To remove the b's, you'd want to do something like:
str_tweet = tweet_text.decode('utf-8')
To get rid of the hyperlinks at the end you could do something like this, which is quick and dirty:
only_tweet = str_tweet.split('https://')[0]
And then of course change your write statement to point to the new variable. This will result in output like:
'Van crash in south-east Iran kills 28 Afghan nationals'
instead of
b'Van crash in south-east Iran kills 28 Afghan nationalshttps://bbc.in/2qcsg9P\xc2\xa0'
推荐阅读
- javascript - 如何使用 gatsby-config.js 文件添加多个 Gatsby 插件?
- python - 使用 Scrapy 从 div 选择器中提取文本
- python - 右键单击 Maya 中的架子按钮以启动不同的脚本
- javascript - 创建自定义年份,不同于日历年?JS/QML
- python - 空响应但浏览器下载视频
- rust - 为可执行闭包创建线程安全包装器
- python - Python 不通过串行与 Arduino 通信
- javascript - 加载时全高窗口并在窗口调整大小时滚动
- seo - # 如何设置 head og: 在 Nuxt
- sql-server - SQL使用pivot对不同国家的价格范围进行平均