首页 > 解决方案 > Python 检测 URL 并在文本文件中删除它

问题描述

我从中获取了一个 Python 脚本并对其进行编辑以符合我的喜好,我将前 20 条推文从特定页面打印到一个文本文件中。

from urllib.request import urlopen
from bs4 import BeautifulSoup

file = "tweets.txt"
f = open(file, "w")
url = "https://twitter.com/BBCWorld"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

# Gets the tweet
tweets = soup.find_all("li", attrs = {"class":"js-stream-item"})

# Writes tweet fetched in file
for tweet in tweets:
   try:
    if tweet.find('p',{'tweet-text'}):
       tweet_text = tweet.find('p',{'tweet-text'}).text.encode('utf8').strip()
       # tweet_user = tweet.find('span',{"class":'username'}).text.strip()
       # replies = tweet.find('span',{"class":"ProfileTweet-actionCount"}).text.strip()
       # retweets = tweet.find('span', {"class" : "ProfileTweet-action--retweet"}).text.strip()
       # String interpolation technique
       f.write(f'{tweet_text}\n')
  except: AttributeError
f.close()

然而,当推文在那里打印时,它们看起来像这样(我以 BBCWorld 的提要为例):

b'Van crash in south-east Iran kills 28 Afghan nationalshttps://bbc.in/2qcsg9P\xc2\xa0' 

b'Guernsey asbestos cancer compensation scheme to launchhttps://bbc.in/2qQD9OE\xc2\xa0'

b'Construction firm fined \xc2\xa310k for Jersey water pollutionhttps://bbc.in/2KgIk19\xc2\xa0'

b'US election 2020: Deval Patrick announces presidential bidhttps://bbc.in/32QbdHH\xc2\xa0'

b'Knottfield: Joseph Marshall indecent assault trial delayedhttps://bbc.in/2XcXYjg\xc2\xa0'

b"Hugo Carvajal: Venezuelan ex-spy chief's disappearance 'a scandal'https://bbc.in/34VIwdY\xc2\xa0"

b'What fate awaits those former members of Islamic State being expelled from Turkey?https://www.bbc.co.uk/news/50396607\xc2\xa0'

b"Notre Dame: Army general tells architect to 'shut his mouth'https://bbc.in/2qVP7pX\xc2\xa0"

b'Six years after a Boeing 737-500 crashed in Kazan, Russian investigators conclude that the pilot wasn\xe2\x80\x99t qualified to fly the plane & had used falsified documents to get his job with (now defunct) Tatarstan Airlines. 50 people were killed.'

b'South Africa rugby stars strip off for cancer challengehttps://bbc.in/2rJF2Nv\xc2\xa0'

b"Diabetes: UN to tackle 'overly expensive' insulin priceshttps://bbc.in/2Op0nUf\xc2\xa0"

b'US Senator blocks move to say Armenian mass killing was genocidehttps://bbc.in/2QfjjHr\xc2\xa0'

b"Turkey to extradite American IS suspect 'stranded on border'https://bbc.in/33OJ1X1\xc2\xa0"

b'Father and daughter ballet video breaks stereotypes, says teacherhttps://bbc.in/2KlFEPY\xc2\xa0'

b'Australia seeks to curb foreign interference in universitieshttps://bbc.in/2NJcb4i\xc2\xa0'

b'Washington teacher arrested for threatening to shoot studentshttps://bbc.in/2KlUk1o\xc2\xa0'

b'Denmark holds neo-Nazi over Jewish cemetery attackhttps://bbc.in/2pjrcR9\xc2\xa0'

b'Manus Island refugee author Behrouz Boochani arrives in New Zealandhttps://bbc.in/2NMI4cs\xc2\xa0'

b'Italy to declare state of emergency over damage from Venice floodshttp://bbc.in/2OdDoeu\xc2\xa0'

b'Condor Ferries bought by Swedish investment fundhttps://bbc.in/2NLw0rT\xc2\xa0'

如何删除“b”?而且,如果特定推文具有该链接,我该如何删除该 URL,就像所有这些一样?

此外,为什么有时会出现一串数字和字母,如何修复/删除这些?

标签: pythonpython-3.xtwitter

解决方案


To remove the b's, you'd want to do something like:

str_tweet = tweet_text.decode('utf-8')

To get rid of the hyperlinks at the end you could do something like this, which is quick and dirty:

only_tweet = str_tweet.split('https://')[0]

And then of course change your write statement to point to the new variable. This will result in output like:

'Van crash in south-east Iran kills 28 Afghan nationals'

instead of

b'Van crash in south-east Iran kills 28 Afghan nationalshttps://bbc.in/2qcsg9P\xc2\xa0'


推荐阅读