首页 > 解决方案 > 在通用列表中拆分 unicode 字符串

问题描述

所以我的数据看起来像这样:

data = {"technology1": [
       [
       20, 0.02,
      u'10.00,106.10,107.00,107.00,0.45',
      u'24.00,-47.15,-49.50,-51.00,0.12',
      u'11.00,0.35,0.00,0.00,0.92',
      u'0.00',0.04,0.16, u'0.223196881092', u'f',0.02,
     ], 
      [
       100, 0.02,
  u'10.00,106.10,107.00,107.00,0.45',
  u'24.00,-47.15,-49.50,-51.00,0.12',
  u'11.00,0.35,0.00,0.00,0.92', u'0.00', 0.04,
  0.16, u'0.223196881092',  u'f', 0.01
   ] ... ],

       "technology2": ...}

如您所见,它是一个字典,每个键都访问一个列表列表,所有列表都具有相同的格式。每个“内部”列表都包含整数、浮点数的混合。还有一些 unicode 字符串,其中一些具有单个值,一些具有 unicode 字符串中的一组 5 个数字。

我想要的是:

为每种技术制作一个阵列。在每个数组中,行将是上面的“外部”列表,列是“内部列表”的不同元素。理想情况下,unicode 需要转换为字符串(因为我知道如何更好地使用它们),并且 unicode 字符串中的 5 个数字的集合需要扩展为每个元素。

即技术阵列1

20, 0.02, 10.00, 106.10, ... "f", 0.02
100, 0.02, ...            "f", 0.01

到目前为止的尝试:

for tech in data:

    features = data[tech] # i.e. grab technologyn
    for row in features:
        for i in row[2:5]: # 2 til 5 defines the instance which are sets of 5
            #print i,"\n"
            i = str(i)
            i = i.split(',')

这不行,而且当我在代码执行后查看特性时,它完全一样!

这不是一个完整解决方案的尝试,因为它显然不会将所有 unicode 类型转换为字符串,但这是一个垫脚石。我也尝试这样使用列表理解:

for row in features:
    [i.split(',') for i in row if (type(i)==unicode and "," in i)]

标签: pythonstringunicodesplit

解决方案


您需要为每一行创建一个新的列表对象,然后替换原始列表值:

def row_to_values(row):
    values = []
    for col in row:
        if isinstance(col, unicode) and col != u'f':
            # split and convert all entries to float
            values += (float(v) for v in col.split(','))
        else:
            values.append(col)
    return values

for value in data.values():
    value[:] = [row_to_values(row) for row in value]

value[:] = ...赋值告诉 Python用一组新的对象替换列表对象中包含的所有索引。由于每个value都是字典中的外部列表,data因此这会将所有子列表替换为扩展行。

演示您的部分样本数据:

>>> data = {"technology1": [
...        [
...        20, 0.02,
...       u'10.00,106.10,107.00,107.00,0.45',
...       u'24.00,-47.15,-49.50,-51.00,0.12',
...       u'11.00,0.35,0.00,0.00,0.92',
...       u'0.00',0.04,0.16, u'0.223196881092', u'f',0.02,
...      ],
...       [
...        100, 0.02,
...   u'10.00,106.10,107.00,107.00,0.45',
...   u'24.00,-47.15,-49.50,-51.00,0.12',
...   u'11.00,0.35,0.00,0.00,0.92', u'0.00', 0.04,
...   0.16, u'0.223196881092',  u'f', 0.01
...    ]],
... }
>>> from pprint import pprint
>>> pprint(data["technology1"][0])
[20,
 0.02,
 u'10.00,106.10,107.00,107.00,0.45',
 u'24.00,-47.15,-49.50,-51.00,0.12',
 u'11.00,0.35,0.00,0.00,0.92',
 u'0.00',
 0.04,
 0.16,
 u'0.223196881092',
 u'f',
 0.02]
>>> pprint(row_to_values(data["technology1"][0]))
[20,
 0.02,
 10.0,
 106.1,
 107.0,
 107.0,
 0.45,
 24.0,
 -47.15,
 -49.5,
 -51.0,
 0.12,
 11.0,
 0.35,
 0.0,
 0.0,
 0.92,
 0.0,
 0.04,
 0.16,
 0.223196881092,
 u'f',
 0.02]

因此,可以将一行扩展为包含字符串中的所有浮点值,并通过函数调用返回新的列表对象。

使用该函数替换所有字典值中的所有行:

>>> for value in data.values():
...     value[:] = [row_to_values(row) for row in value]
...

我们可以看到我们之前看到的第一行已经更新:

>>> pprint(data["technology1"][0])
[20,
 0.02,
 10.0,
 106.1,
 107.0,
 107.0,
 0.45,
 24.0,
 -47.15,
 -49.5,
 -51.0,
 0.12,
 11.0,
 0.35,
 0.0,
 0.0,
 0.92,
 0.0,
 0.04,
 0.16,
 0.223196881092,
 u'f',
 0.02]

与字典的其余部分一样:

>>> pprint(data)
{'technology1': [[20,
                  0.02,
                  10.0,
                  106.1,
                  107.0,
                  107.0,
                  0.45,
                  24.0,
                  -47.15,
                  -49.5,
                  -51.0,
                  0.12,
                  11.0,
                  0.35,
                  0.0,
                  0.0,
                  0.92,
                  0.0,
                  0.04,
                  0.16,
                  0.223196881092,
                  u'f',
                  0.02],
                 [100,
                  0.02,
                  10.0,
                  106.1,
                  107.0,
                  107.0,
                  0.45,
                  24.0,
                  -47.15,
                  -49.5,
                  -51.0,
                  0.12,
                  11.0,
                  0.35,
                  0.0,
                  0.0,
                  0.92,
                  0.0,
                  0.04,
                  0.16,
                  0.223196881092,
                  u'f',
                  0.01]]}

推荐阅读