首页 > 解决方案 > Python 正则表达式:使用 new Array() 的 JavaScript 字符串创建 pandas 数据框

问题描述

我正在尝试迭代 Javascript 响应以构建熊猫数据框。我从正则表达式和 JS 开始,所以可能会有明显的改进来使代码更可靠。Javascript 响应,返回以下字符串:


    mmSuggestDeliver(0, new Array("Name", "Category", "Keywords", "Bias", "Extension", "IDs"), new Array(new Array("HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"),new Array("HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046"),new Array("HSBC Trinkaus & Burkhardt AG", "Stocks", "|DE0008115106|||TUBG", "75", "", "hsbc_trinkausburkhardt||1|3774"),new Array("HSBC Bank Malta Plc Registered Shs", "Stocks", "|MT0000030107|||", "75", "", "hsbc_bank_malta||1|16831644"),new Array("HSBC-D7 SA de CV SIID (A)", "Stocks", "|MX51HS0Q00E8|||", "75", "", "hsbc-d7||1|5125971"),new Array("HSBC US Buy-Out GmbH & Co. KGaA", "Stocks", "|DE000A0MM6H7|||", "75", "", "hsbc_us_buy-out_gmbhco||1|23145"),new Array("HSBC Holdings PLC ADR Cert Deposito Arg Repr 0.5 ADRs", "Stocks", "|ARDEUT112257|||", "75", "", "hsbc_2||1|1399269"),new Array("HSBC Holdings PLC6.2 % Pfd Shs Sponsored American Deposit Repr 1/40th 6.2 % PfdShs Ser -A-", "Stocks", "HSBC.PA|US4042806046|HSBC.PA||", "75", "", "hsbc-pa|HSBC.PA|1|19327"),new Array("HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"),new Array("HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270")), 10, 0);

我希望将数据组织为下表但没有双引号:


    "Name", "Category", "Keywords", "Bias", "Extension", "IDs"
    "HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"
    "HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046")
    ......
    "HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"
    "HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270"

理想情况下,我希望将最终结果放入 pandas Dataframe。

下面的代码会起作用,但有很多警告。任何优化/改进和更正将不胜感激:

获取列名的代码:

    js_text = """
mmSuggestDeliver(0, new Array("Name", "Category", "Keywords", "Bias", "Extension", "IDs"), 
new Array(new Array("HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"),
new Array("HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046"),
new Array("HSBC Trinkaus & Burkhardt AG", "Stocks", "|DE0008115106|||TUBG", "75", "", "hsbc_trinkausburkhardt||1|3774"),
new Array("HSBC Bank Malta Plc Registered Shs", "Stocks", "|MT0000030107|||", "75", "", "hsbc_bank_malta||1|16831644"),
new Array("HSBC-D7 SA de CV SIID (A)", "Stocks", "|MX51HS0Q00E8|||", "75", "", "hsbc-d7||1|5125971"),
new Array("HSBC US Buy-Out GmbH & Co. KGaA", "Stocks", "|DE000A0MM6H7|||", "75", "", "hsbc_us_buy-out_gmbhco||1|23145"),
new Array("HSBC Holdings PLC ADR Cert Deposito Arg Repr 0.5 ADRs", "Stocks", "|ARDEUT112257|||", "75", "", "hsbc_2||1|1399269"),
new Array("HSBC Holdings PLC6.2 % Pfd Shs Sponsored American Deposit Repr 1/40th 6.2 % PfdShs Ser -A-", "Stocks", "HSBC.PA|US4042806046|HSBC.PA||", "75", "", "hsbc-pa|HSBC.PA|1|19327"),new Array("HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"),new Array("HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270")), 10, 0);
    """

    regex_text = r"new Array\((.*)\)"
    column_header = _re.search(regex_text, js_text, flags=_re.MULTILINE).group(1)
    regex_text = ', (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)'
    column_header = _re.split(regex_text, column_header, flags=_re.MULTILINE)
    print('regex1:', column_header)

打印: regex1: ['"Name"', '"Category"', '"Keywords"', '"Bias"', '"Extension"', '"IDs"'] 单引号和双引号的组合...

获取数据的代码:

    regex_text = r"new Array\(([\s\S]*?)\),"
    table_rows = _re.findall(regex_text, js_text, flags=_re.MULTILINE)
    table_rows.pop(0)
    table_rows[0] = str(table_rows[0]).replace('new Array(', '')

    my_data = []
    regex_text = ', (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)'

    for my_row in table_rows:
        my_row = _re.split(regex_text, my_row, flags=_re.MULTILINE)
        print('row is:', my_row)
        my_data.append(my_row)
    result_df = _pd.DataFrame(data=my_data, columns=column_header)
    print(result_df)
    print(result_df.dtypes)


打印一个巨大的 DF 为:

                                                  "Name"  ...                               "IDs"
    0                  "HSBC Holdings plc (Spons. ADRs)"  ...                  "hsbc|HSBC|1|4917"
    1                                "HSBC Holdings plc"  ...         "hsbc-gb0005405286||1|1046"
    2                     "HSBC Trinkaus & Burkhardt AG"  ...    "hsbc_trinkausburkhardt||1|3774"
    3               "HSBC Bank Malta Plc Registered Shs"  ...       "hsbc_bank_malta||1|16831644"
    4                        "HSBC-D7 SA de CV SIID (A)"  ...                "hsbc-d7||1|5125971"
    5                  "HSBC US Buy-Out GmbH & Co. KGaA"  ...   "hsbc_us_buy-out_gmbhco||1|23145"
    6  "HSBC Holdings PLC ADR Cert Deposito Arg Repr ...  ...                 "hsbc_2||1|1399269"
    7  "HSBC Holdings PLC6.2 % Pfd Shs Sponsored Amer...  ...           "hsbc-pa|HSBC.PA|1|19327"
    8  "HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-...  ...               "hseb|HSEB|1|5083319"
    9  "HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap S...  ...              "hsea|HSEA|1|3782270")


    [10 rows x 6 columns]
    "Name"          object
     "Category"     object
     "Keywords"     object
     "Bias"         object
     "Extension"    object
     "IDs"          object
    dtype: object

数据代码有许多警告,包括:

  1. 如果 JS 的“新数组”部分有更多/更少的空格,代码将找不到该行。我最初使用正则表达式的开始结束,但是当“,”和“新数组”之间没有空格分隔时出现了另一个问题:regex:“new Array(\"(.*))”
  2. 所有值都用双引号引起来
  3. 它看起来不是很pythonesque...

谢谢您的帮助!

到目前为止,我查看了以下链接:

在此处输入链接描述

在此处输入链接描述

在此处输入链接描述

标签: javascriptarraysregexpython-3.xpandas

解决方案


尝试在输入字符串上的循环中查找(直到不匹配)以下正则表达式:

\bnew\s+Array\s*[(]\s*(".*?)(?=[)]\s*,\s*new\s+Array|\s*[)]\s*[)]\s*[^(]+$)

然后,对于每次迭代,获取第一个捕获组。

第一次迭代应该得到标题,而下面的迭代应该得到数据。

您在这里有一个演示(绿色突出显示的是您将保留在第一个捕获组中的数据)

如果 R 支持\K,这是另一种可能对您有用的选择:

\bnew\s+Array\s*[(]\s*\K"(?:[^"\\]+|\\.)*"\s*(?:,\s*"(?:[^"\\]++|\\.)*")+

在这种情况下,您不需要使用捕获组。所有数据都将在同一场比赛中。

您在这里有第二种方法的演示。


推荐阅读