python - 在通用列表中拆分 unicode 字符串
问题描述
所以我的数据看起来像这样:
data = {"technology1": [
[
20, 0.02,
u'10.00,106.10,107.00,107.00,0.45',
u'24.00,-47.15,-49.50,-51.00,0.12',
u'11.00,0.35,0.00,0.00,0.92',
u'0.00',0.04,0.16, u'0.223196881092', u'f',0.02,
],
[
100, 0.02,
u'10.00,106.10,107.00,107.00,0.45',
u'24.00,-47.15,-49.50,-51.00,0.12',
u'11.00,0.35,0.00,0.00,0.92', u'0.00', 0.04,
0.16, u'0.223196881092', u'f', 0.01
] ... ],
"technology2": ...}
如您所见,它是一个字典,每个键都访问一个列表列表,所有列表都具有相同的格式。每个“内部”列表都包含整数、浮点数的混合。还有一些 unicode 字符串,其中一些具有单个值,一些具有 unicode 字符串中的一组 5 个数字。
我想要的是:
为每种技术制作一个阵列。在每个数组中,行将是上面的“外部”列表,列是“内部列表”的不同元素。理想情况下,unicode 需要转换为字符串(因为我知道如何更好地使用它们),并且 unicode 字符串中的 5 个数字的集合需要扩展为每个元素。
即技术阵列1
20, 0.02, 10.00, 106.10, ... "f", 0.02
100, 0.02, ... "f", 0.01
到目前为止的尝试:
for tech in data:
features = data[tech] # i.e. grab technologyn
for row in features:
for i in row[2:5]: # 2 til 5 defines the instance which are sets of 5
#print i,"\n"
i = str(i)
i = i.split(',')
这不行,而且当我在代码执行后查看特性时,它完全一样!
这不是一个完整解决方案的尝试,因为它显然不会将所有 unicode 类型转换为字符串,但这是一个垫脚石。我也尝试这样使用列表理解:
for row in features:
[i.split(',') for i in row if (type(i)==unicode and "," in i)]
解决方案
您需要为每一行创建一个新的列表对象,然后替换原始列表值:
def row_to_values(row):
values = []
for col in row:
if isinstance(col, unicode) and col != u'f':
# split and convert all entries to float
values += (float(v) for v in col.split(','))
else:
values.append(col)
return values
for value in data.values():
value[:] = [row_to_values(row) for row in value]
该value[:] = ...
赋值告诉 Python用一组新的对象替换列表对象中包含的所有索引。由于每个value
都是字典中的外部列表,data
因此这会将所有子列表替换为扩展行。
演示您的部分样本数据:
>>> data = {"technology1": [
... [
... 20, 0.02,
... u'10.00,106.10,107.00,107.00,0.45',
... u'24.00,-47.15,-49.50,-51.00,0.12',
... u'11.00,0.35,0.00,0.00,0.92',
... u'0.00',0.04,0.16, u'0.223196881092', u'f',0.02,
... ],
... [
... 100, 0.02,
... u'10.00,106.10,107.00,107.00,0.45',
... u'24.00,-47.15,-49.50,-51.00,0.12',
... u'11.00,0.35,0.00,0.00,0.92', u'0.00', 0.04,
... 0.16, u'0.223196881092', u'f', 0.01
... ]],
... }
>>> from pprint import pprint
>>> pprint(data["technology1"][0])
[20,
0.02,
u'10.00,106.10,107.00,107.00,0.45',
u'24.00,-47.15,-49.50,-51.00,0.12',
u'11.00,0.35,0.00,0.00,0.92',
u'0.00',
0.04,
0.16,
u'0.223196881092',
u'f',
0.02]
>>> pprint(row_to_values(data["technology1"][0]))
[20,
0.02,
10.0,
106.1,
107.0,
107.0,
0.45,
24.0,
-47.15,
-49.5,
-51.0,
0.12,
11.0,
0.35,
0.0,
0.0,
0.92,
0.0,
0.04,
0.16,
0.223196881092,
u'f',
0.02]
因此,可以将一行扩展为包含字符串中的所有浮点值,并通过函数调用返回新的列表对象。
使用该函数替换所有字典值中的所有行:
>>> for value in data.values():
... value[:] = [row_to_values(row) for row in value]
...
我们可以看到我们之前看到的第一行已经更新:
>>> pprint(data["technology1"][0])
[20,
0.02,
10.0,
106.1,
107.0,
107.0,
0.45,
24.0,
-47.15,
-49.5,
-51.0,
0.12,
11.0,
0.35,
0.0,
0.0,
0.92,
0.0,
0.04,
0.16,
0.223196881092,
u'f',
0.02]
与字典的其余部分一样:
>>> pprint(data)
{'technology1': [[20,
0.02,
10.0,
106.1,
107.0,
107.0,
0.45,
24.0,
-47.15,
-49.5,
-51.0,
0.12,
11.0,
0.35,
0.0,
0.0,
0.92,
0.0,
0.04,
0.16,
0.223196881092,
u'f',
0.02],
[100,
0.02,
10.0,
106.1,
107.0,
107.0,
0.45,
24.0,
-47.15,
-49.5,
-51.0,
0.12,
11.0,
0.35,
0.0,
0.0,
0.92,
0.0,
0.04,
0.16,
0.223196881092,
u'f',
0.01]]}