首页 > 解决方案 > 如何在不知道位置的情况下在某个点拆分字符串。在蟒蛇

问题描述

我目前正在从 TFL API 中提取天气预报。一旦为“今天的预测”提取了 json,段落中间就会出现随机符号——我认为这可能是从 API 格式化的。

这是提取的内容:

Bank holiday Monday will stay dry with some long sunny spells. Temperatures will remain warm for the time of year.<br/><br/>PM2.5 particle pollution increased rapidly overnight. Increases began across Essex and spread across south London.  Initial chemical analysis suggests that this is composed mainly of wood burning particles but also with some additional particle pollution from agriculture and traffic. This would be consistent with an air flow from the continent where large bonfires are part of the Easter tradition. This will combine with our local emissions today and 'high' PM2.5 is possible.<br/><br/>The sunny periods, high temperatures and east winds will bring additional ozone precursors allowing for photo-chemical generation of ozone to take place. Therefore 'moderate' ozone is likely.<br/><br/>Air pollution should remain 'Low' through the forecast period for the following pollutants:<br/><br/>Nitrogen Dioxide<br/>Sulphur Dioxide.

这一段比必要的更详细,前两句话就是我所需要的。我认为.split这是一个好主意,并通过 for 循环运行它,直到它到达 string "<br/><br/>PM2.5"
但是,我不能确定这将是每天相同的字符串,或者简化的预测是否仍然只是前两个句子。

有人对我如何解决这个问题有任何想法吗?

作为参考,这是我目前拥有的代码,它还不是其他任何东西的一部分。

import urllib.parse
import requests

main_api = "https://api.tfl.gov.uk/AirQuality?"

idno = "1"
url = main_api + urllib.parse.urlencode({"$id": idno})

json_data = requests.get(main_api).json()

disclaimer = json_data['disclaimerText']
print("Disclaimer: " + disclaimer)

print()

today_weather = json_data['currentForecast'][0]['forecastText']
print("Today's forecast: " + today_weather.replace("<br/><br/>"," "))

标签: pythonpython-3.xstringapi

解决方案


我相信,如果您清理 HTML 标记,然后使用 NLTK 的句子标记器对段落进行标记,那么您应该很高兴。

from nltk.tokenize import sent_tokenize

import urllib.parse
import requests
import re

main_api = "https://api.tfl.gov.uk/AirQuality?"

idno = "1"
url = main_api + urllib.parse.urlencode({"$id": idno})

json_data = requests.get(main_api).json()

disclaimer = json_data['disclaimerText']
print("Disclaimer: " + disclaimer)

print()

# Clean out HTML tags
today_weather_str = re.sub(r'<.*?>', '', json_data['currentForecast'][0]['forecastText'])

# Get the first two sentences out of the list
today_weather = ' '.join(sent_tokenize(today_weather_str)[:2])

print("Today's forecast: {}".format(today_weather))

推荐阅读