首页 > 解决方案 > 如何提取以下文本中带有评论/文本的所有句子?

问题描述

在这里,我想提取评论/文本。但它只提取了其中的一小部分。以下是输出:- <re.Match 对象;span=(226, 258), match='review/text: I like Creme Brulee'> <re.Match object; span=(750, 860), match='review/text: 不是我所期望的 >

重新进口

text='''
'product/productId: B004K2IHUO\n',
 'review/userId: A2O9G2521O626G\n',
 'review/profileName: Rachel Westendorf\n',
 'review/helpfulness: 0/0\n',
 'review/score: 5.0\n',
 'review/time: 1308700800\n',
 'review/summary: The best\n',
 'review/text: I like Creme Brulee. I loved that these were so easy. Just sprinkle on the sugar that came with and broil. They look amazing and taste great. My guess thought I really went out of the way for them when really it took all of 5 minutes. I will be ordering more!\n',
 '\n',
 'product/productId: B004K2IHUO\n',
 'review/userId: A1ZKFQLHFZAEH9\n',
 'review/profileName: S. J. Monson "world citizen"\n',
 'review/helpfulness: 2/8\n',
 'review/score: 3.0\n',
 'review/time: 1236384000\n',
 'review/summary: disappointing\n',
 "review/text: not what I was expecting in terms of the company's reputation for excellent home delivery products\n",
 '\n',
'''

pattern=re.compile(r'review/text:\s[^.]+')
matches=pattern.finditer(text)

for match in matches:
  print(match)

标签: pythonpython-3.xregex

解决方案


如果您不介意不使用re并且标识符是'review/text'并且您的数据始终以逗号分隔,则可以简单地使用以下命令获取这些行:

matches = [s.strip() for s in text.split(',') if s.strip(' "\n\'').startswith('review/text')]
for match in matches:
  print(match)

where从行的开头和结尾s.strip(' "\'\n')删除空格、"'和换行符以进行字符串比较。返回这两行:

'review/text: I like Creme Brulee. I loved that these were so easy. Just sprinkle on the sugar that came with and broil. They look amazing and taste great. My guess thought I really went out of the way for them when really it took all of 5 minutes. I will be ordering more!
'
"review/text: not what I was expecting in terms of the company's reputation for excellent home delivery products
"

推荐阅读