首页 > 解决方案 > 如何提取字符串的哪些句子已更改或不存在于另一个字符串中?

问题描述

我正在从网站获取持续更新。每当我运行我的脚本时,我都会得到一个old_string,这是当前存储在我的数据库中的字符串。我还得到一个new_string包含从站点获取的当前文本正文的内容。

有没有一种聪明的方法来检查哪些句子new_string不在old_string?查找哪些是最新的更新/更改并将其存储在newest_updates

--> x <--我用来指示新/修改字符串的示例:

old_string = 
"Inbound restrictions:
The country’s airports closed to international flights on 18 March and will remain closed until 1 
April. The land and sea borders at this time remain open.
Travellers coming from Brazil, China, Dominican Republic, French Guiana, Italy, Iran, Jamaica, Japan, 
Malaysia, Panama, Singapore, South Korea, St Vincent, Thailand and the US should anticipate increased 
screenings upon arrival. There is also a possibility that these individuals would be denied entry 
into the country, according to government officials.
There are currently no known restrictions on individuals seeking to depart the country."

new_string = 
"Inbound restrictions:
The country’s airports closed to international flights on 18 March and will remain closed until -->5 
April<--. The land and sea borders at this time remain open.
Travellers coming from Brazil, China, Dominican Republic, French Guiana, Italy, Iran,-->Sweden<--, Jamaica, Japan, 
Malaysia, Panama, Singapore, South Korea, St Vincent, Thailand and the US should anticipate increased 
screenings upon arrival. There is also a possibility that these individuals would be denied entry 
into the country, according to government officials.
There are currently no known restrictions on individuals seeking to depart the country.-->

Outbound restrictions:
There are currently no known restrictions on individuals seeking to depart the country.<--"

由此输出将是:

 newest_updates = "The country’s airports closed to international flights on 18 March and will remain 
 closed until 5 April. 

 Travellers coming from Brazil, China, Dominican Republic, French Guiana, Italy, Iran,Sweden, 
 Jamaica, Japan, Malaysia, Panama, Singapore, South Korea, St Vincent, Thailand and the US should 
 anticipate increased screenings upon arrival

 Outbound restrictions:
 There are currently no known restrictions on individuals seeking to depart the country."

最好的方法是什么?一个建议是使用difflib. 但是difflib,即使没有进行任何更改,我也会抓住这两个句子中常见的每个句子。

标签: pythonnlp

解决方案


我会在“in”条件下尝试它:

首先,您应该在每个句子的末尾拆分字符串:

new_strings = new_string.split(".")

从那时起,我将搜索不匹配的句子:

newest_updates = ""
for sentence in new_strings:
    if not sentence in old_string:
        newest_updates += sentence

现在你应该有一个包含所有更新的变量。


推荐阅读