python - 如何在不知道位置的情况下在某个点拆分字符串。在蟒蛇
问题描述
我目前正在从 TFL API 中提取天气预报。一旦为“今天的预测”提取了 json,段落中间就会出现随机符号——我认为这可能是从 API 格式化的。
这是提取的内容:
Bank holiday Monday will stay dry with some long sunny spells. Temperatures will remain warm for the time of year.<br/><br/>PM2.5 particle pollution increased rapidly overnight. Increases began across Essex and spread across south London. Initial chemical analysis suggests that this is composed mainly of wood burning particles but also with some additional particle pollution from agriculture and traffic. This would be consistent with an air flow from the continent where large bonfires are part of the Easter tradition. This will combine with our local emissions today and 'high' PM2.5 is possible.<br/><br/>The sunny periods, high temperatures and east winds will bring additional ozone precursors allowing for photo-chemical generation of ozone to take place. Therefore 'moderate' ozone is likely.<br/><br/>Air pollution should remain 'Low' through the forecast period for the following pollutants:<br/><br/>Nitrogen Dioxide<br/>Sulphur Dioxide.
这一段比必要的更详细,前两句话就是我所需要的。我认为.split
这是一个好主意,并通过 for 循环运行它,直到它到达 string "<br/><br/>PM2.5"
。
但是,我不能确定这将是每天相同的字符串,或者简化的预测是否仍然只是前两个句子。
有人对我如何解决这个问题有任何想法吗?
作为参考,这是我目前拥有的代码,它还不是其他任何东西的一部分。
import urllib.parse
import requests
main_api = "https://api.tfl.gov.uk/AirQuality?"
idno = "1"
url = main_api + urllib.parse.urlencode({"$id": idno})
json_data = requests.get(main_api).json()
disclaimer = json_data['disclaimerText']
print("Disclaimer: " + disclaimer)
print()
today_weather = json_data['currentForecast'][0]['forecastText']
print("Today's forecast: " + today_weather.replace("<br/><br/>"," "))
解决方案
我相信,如果您清理 HTML 标记,然后使用 NLTK 的句子标记器对段落进行标记,那么您应该很高兴。
from nltk.tokenize import sent_tokenize
import urllib.parse
import requests
import re
main_api = "https://api.tfl.gov.uk/AirQuality?"
idno = "1"
url = main_api + urllib.parse.urlencode({"$id": idno})
json_data = requests.get(main_api).json()
disclaimer = json_data['disclaimerText']
print("Disclaimer: " + disclaimer)
print()
# Clean out HTML tags
today_weather_str = re.sub(r'<.*?>', '', json_data['currentForecast'][0]['forecastText'])
# Get the first two sentences out of the list
today_weather = ' '.join(sent_tokenize(today_weather_str)[:2])
print("Today's forecast: {}".format(today_weather))
推荐阅读
- acumatica - 升级到 2019 R1 时如何解决此错误
- python - 使用 Python Pandas 数据框提取一个大型 Postgres 表并将其写入 csv 文件
- html - CSS 样式化看板 - 调整 DIV 的大小和重新定位以适应内容
- python - h2o XGBoost 未找到后端
- c# - 使用 vlc 播放器在 C# 中寻找视频的问题
- udeploy - uDeploy 读取json文件的步骤
- javascript - 从 testcafe 中打开的第 N 个模式中选择 OK 按钮
- excel - 如何使用项目符号将表格复制到 Excel 中并保持项目符号的格式与 ms-word 中的格式相同
- python - KafkaProducer - GSSAPI 库不可用
- multithreading - 对内置 Common Lisp 对象的多线程(并行)访问