python - 使用正则表达式或python函数提取两个字符串的所有相同对之间的所有字符串
问题描述
我正在尝试使用正则表达式或 python 函数来提取所有粗体文本,或'和 <= 之间的文本。
"[Text(447.1153846153846, 471.625, ' < = 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, ' donald <=熵 = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, ' hillary < = 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text( 357.6923076923077, 202.125, '希拉里<= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226 , 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, ' trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News' ), Text(596.1538461538462, 202.125, ' hillary < = 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (. .) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, '熵 = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\ nclass = 真实新闻')]"
到目前为止,我得到的最接近的是 (?=')(.*)(?= <=),但到目前为止还没有运气。
谁能让我知道如何在单引号和 <= 之间提取这些粗体文本?
不需要使用正则表达式。
谢谢!
解决方案
这个正则表达式有效。我们使用命名组,因此很容易引用您想要的确切数据。它被设置为查找连续的单词,以及后跟“<=”的数字。然后我们finditer
用来获取所有的匹配项。
import re
data = "[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, 'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\nclass = True News')]"
fmt = re.compile(r'(?P<info>[\w\d]+) <=', re.I)
for m in fmt.finditer(data):
print(m.group('info'))
如果你只想走完整的 9 码,下面会将整个内容解析为一个命名元组,该元组主要反映文本的格式。我不知道前 2 个值代表什么,我只是称它们为x
和y
。我走到这一步是因为你想要的似乎不是很有用,我认为这个问题只是最终确定更多数据的前兆。这可以精确定位所有数据。任何带有\n (...) \n
数据的条目都被打印为“空”,并且不存储在条目中list
。
import re
from collections import namedtuple
data = "[Text(447.1153846153846, 471.625, 'the <= 0.5\nentropy = 0.97\nsamples = 100.0%\nvalue = [0.399, 0.601]\nclass = True News'), Text(238.46153846153845, 336.875, 'donald <= 0.5\nentropy = 0.921\nsamples = 83.7%\nvalue = [0.336, 0.664]\nclass = True News'), Text(119.23076923076923, 202.125, 'hillary <= 0.5\nentropy = 0.981\nsamples = 55.6%\nvalue = [0.42, 0.58]\nclass = True News'), Text(59.61538461538461, 67.375, '\n (...) \n'), Text(178.84615384615384, 67.375, '\n (...) \n'), Text(357.6923076923077, 202.125, 'hillary <= 0.5\nentropy = 0.663\nsamples = 28.2%\nvalue = [0.172, 0.828]\nclass = True News'), Text(298.0769230769231, 67.375, '\n (...) \n'), Text(417.30769230769226, 67.375, '\n (...) \n'), Text(655.7692307692307, 336.875, 'trumps <= 0.5\nentropy = 0.859\nsamples = 16.3%\nvalue = [0.718, 0.282]\nclass = Fake News'), Text(596.1538461538462, 202.125, 'hillary <= 0.5\nentropy = 0.821\nsamples = 15.7%\nvalue = [0.744, 0.256]\nclass = Fake News'), Text(536.5384615384615, 67.375, '\n (...) \n'), Text(655.7692307692307, 67.375, '\n (...) \n'), Text(715.3846153846154, 202.125, 'entropy = 0.0\nsamples = 0.6%\nvalue = [0.0, 1.0]\nclass = True News')]"
#regex to describe the overall entry
entfmt = re.compile(r'Text\((?P<x>([\d\.]+)), (?P<y>([\d\.]+)), \'(?P<data>([^\']+))\'\)', re.I|re.S)
#format all of the float groups ~
# flt is a repeatable chunk so we create this part of the expression in a loop
# all this really does is make the final datfmt regex seem shorter
flt = '{}(?P<{}>([\d\.]+))'
args = ('_fval', '\nentropy = _ent', '\nsamples = _samp', '%\nvalue = \[_lval', ', _rval')
fltreg = ''.join([flt.format(a, b) for (a, b) in [arg.split('_') for arg in args]])
#regex to describe the data portion of an entry
datfmt = re.compile('(?P<focus>([\w\d]+)) <= {}\]\nclass = (?P<class>(.+))'.format(fltreg), re.I|re.S)
#container for individual entries
entries = []
#entry descriptor
Entry = namedtuple('Entry', 'x y focus fvalue entropy samples value cls')
#for storing entry index
c = 0
#find all entries
for m in entfmt.finditer(data):
#consistent entry data
x, y = float(m.group('x')), float(m.group('y'))
#get all data for this entry
m2 = datfmt.match(m.group('data'))
#make sure this was not an empty entry
if m2:
#append entry
entries.append(Entry(x, y,
m2.group('focus'),
float(m2.group('fval')),
float(m2.group('ent')),
float(m2.group('samp')),
[float(m2.group('lval')), float(m2.group('rval'))],
m2.group('class')))
else:
#entry has empty data
print('Data[{}] with [x:{}, y:{}] is empty'.format(c, x, y))
#increment entry index
c += 1
#print all entries
print(*entries, sep='\n')
#Entry(x=447.1153846153846 , y=471.625, focus='the' , fvalue=0.5, entropy=0.97 , samples=100.0, value=[0.399, 0.601], cls='True News')
#Entry(x=238.46153846153845, y=336.875, focus='donald' , fvalue=0.5, entropy=0.921, samples=83.7 , value=[0.336, 0.664], cls='True News')
#Entry(x=119.23076923076923, y=202.125, focus='hillary', fvalue=0.5, entropy=0.981, samples=55.6 , value=[0.42 , 0.58 ], cls='True News')
#Entry(x=357.6923076923077 , y=202.125, focus='hillary', fvalue=0.5, entropy=0.663, samples=28.2 , value=[0.172, 0.828], cls='True News')
#Entry(x=655.7692307692307 , y=336.875, focus='trumps' , fvalue=0.5, entropy=0.859, samples=16.3 , value=[0.718, 0.282], cls='Fake News')
#Entry(x=596.1538461538462 , y=202.125, focus='hillary', fvalue=0.5, entropy=0.821, samples=15.7 , value=[0.744, 0.256], cls='Fake News')
推荐阅读
- node.js - 在 React 本机 Expo 应用程序中上传个人资料图片
- java - 许多部署的元空间问题(wildfly 9)
- angular - 在 Angular 中接受多个 baseHref
- gitlab - 在 GitLab 中为合并请求设置默认审阅者
- python - 未考虑 make_regression() 的 n_informative 参数。为什么?
- angular - 我收到此错误,但我无法弄清楚,我错过了什么吗?
- java - AmazonDynamoDB.executeStatement() 总是抛出 UnknownOperationException:请求了未知操作。爪哇
- firebase - firebase 存储如何显示图像
- r - 在 dplyr 中跨多行保留一个计数器
- javascript - 将表单数据保存到文件