javascript - Python 正则表达式:使用 new Array() 的 JavaScript 字符串创建 pandas 数据框
问题描述
我正在尝试迭代 Javascript 响应以构建熊猫数据框。我从正则表达式和 JS 开始,所以可能会有明显的改进来使代码更可靠。Javascript 响应,返回以下字符串:
mmSuggestDeliver(0, new Array("Name", "Category", "Keywords", "Bias", "Extension", "IDs"), new Array(new Array("HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"),new Array("HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046"),new Array("HSBC Trinkaus & Burkhardt AG", "Stocks", "|DE0008115106|||TUBG", "75", "", "hsbc_trinkausburkhardt||1|3774"),new Array("HSBC Bank Malta Plc Registered Shs", "Stocks", "|MT0000030107|||", "75", "", "hsbc_bank_malta||1|16831644"),new Array("HSBC-D7 SA de CV SIID (A)", "Stocks", "|MX51HS0Q00E8|||", "75", "", "hsbc-d7||1|5125971"),new Array("HSBC US Buy-Out GmbH & Co. KGaA", "Stocks", "|DE000A0MM6H7|||", "75", "", "hsbc_us_buy-out_gmbhco||1|23145"),new Array("HSBC Holdings PLC ADR Cert Deposito Arg Repr 0.5 ADRs", "Stocks", "|ARDEUT112257|||", "75", "", "hsbc_2||1|1399269"),new Array("HSBC Holdings PLC6.2 % Pfd Shs Sponsored American Deposit Repr 1/40th 6.2 % PfdShs Ser -A-", "Stocks", "HSBC.PA|US4042806046|HSBC.PA||", "75", "", "hsbc-pa|HSBC.PA|1|19327"),new Array("HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"),new Array("HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270")), 10, 0);
我希望将数据组织为下表但没有双引号:
"Name", "Category", "Keywords", "Bias", "Extension", "IDs"
"HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"
"HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046")
......
"HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"
"HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270"
理想情况下,我希望将最终结果放入 pandas Dataframe。
下面的代码会起作用,但有很多警告。任何优化/改进和更正将不胜感激:
获取列名的代码:
js_text = """
mmSuggestDeliver(0, new Array("Name", "Category", "Keywords", "Bias", "Extension", "IDs"),
new Array(new Array("HSBC Holdings plc (Spons. ADRs)", "Stocks", "HSBC|US4042804066|HSBC||", "75", "", "hsbc|HSBC|1|4917"),
new Array("HSBC Holdings plc", "Stocks", "|GB0005405286|||HSBA", "75", "", "hsbc-gb0005405286||1|1046"),
new Array("HSBC Trinkaus & Burkhardt AG", "Stocks", "|DE0008115106|||TUBG", "75", "", "hsbc_trinkausburkhardt||1|3774"),
new Array("HSBC Bank Malta Plc Registered Shs", "Stocks", "|MT0000030107|||", "75", "", "hsbc_bank_malta||1|16831644"),
new Array("HSBC-D7 SA de CV SIID (A)", "Stocks", "|MX51HS0Q00E8|||", "75", "", "hsbc-d7||1|5125971"),
new Array("HSBC US Buy-Out GmbH & Co. KGaA", "Stocks", "|DE000A0MM6H7|||", "75", "", "hsbc_us_buy-out_gmbhco||1|23145"),
new Array("HSBC Holdings PLC ADR Cert Deposito Arg Repr 0.5 ADRs", "Stocks", "|ARDEUT112257|||", "75", "", "hsbc_2||1|1399269"),
new Array("HSBC Holdings PLC6.2 % Pfd Shs Sponsored American Deposit Repr 1/40th 6.2 % PfdShs Ser -A-", "Stocks", "HSBC.PA|US4042806046|HSBC.PA||", "75", "", "hsbc-pa|HSBC.PA|1|19327"),new Array("HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-Without Fixed Maturity Series -2-", "Stocks", "HSEB|US4042808026|HSEB||", "75", "", "hseb|HSEB|1|5083319"),new Array("HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap Secs 2008-Exch into Non-Cum Dollar Pref Shs", "Stocks", "HSEA|US4042807036|HSEA||", "75", "", "hsea|HSEA|1|3782270")), 10, 0);
"""
regex_text = r"new Array\((.*)\)"
column_header = _re.search(regex_text, js_text, flags=_re.MULTILINE).group(1)
regex_text = ', (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)'
column_header = _re.split(regex_text, column_header, flags=_re.MULTILINE)
print('regex1:', column_header)
打印: regex1: ['"Name"', '"Category"', '"Keywords"', '"Bias"', '"Extension"', '"IDs"'] 单引号和双引号的组合...
获取数据的代码:
regex_text = r"new Array\(([\s\S]*?)\),"
table_rows = _re.findall(regex_text, js_text, flags=_re.MULTILINE)
table_rows.pop(0)
table_rows[0] = str(table_rows[0]).replace('new Array(', '')
my_data = []
regex_text = ', (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)'
for my_row in table_rows:
my_row = _re.split(regex_text, my_row, flags=_re.MULTILINE)
print('row is:', my_row)
my_data.append(my_row)
result_df = _pd.DataFrame(data=my_data, columns=column_header)
print(result_df)
print(result_df.dtypes)
打印一个巨大的 DF 为:
"Name" ... "IDs"
0 "HSBC Holdings plc (Spons. ADRs)" ... "hsbc|HSBC|1|4917"
1 "HSBC Holdings plc" ... "hsbc-gb0005405286||1|1046"
2 "HSBC Trinkaus & Burkhardt AG" ... "hsbc_trinkausburkhardt||1|3774"
3 "HSBC Bank Malta Plc Registered Shs" ... "hsbc_bank_malta||1|16831644"
4 "HSBC-D7 SA de CV SIID (A)" ... "hsbc-d7||1|5125971"
5 "HSBC US Buy-Out GmbH & Co. KGaA" ... "hsbc_us_buy-out_gmbhco||1|23145"
6 "HSBC Holdings PLC ADR Cert Deposito Arg Repr ... ... "hsbc_2||1|1399269"
7 "HSBC Holdings PLC6.2 % Pfd Shs Sponsored Amer... ... "hsbc-pa|HSBC.PA|1|19327"
8 "HSBC Holdings PLC 8 % Perp Sub Cap Secs 2010-... ... "hseb|HSEB|1|5083319"
9 "HSBC Holdings PLC 8 1-8 % Perpetual Sub Cap S... ... "hsea|HSEA|1|3782270")
[10 rows x 6 columns]
"Name" object
"Category" object
"Keywords" object
"Bias" object
"Extension" object
"IDs" object
dtype: object
数据代码有许多警告,包括:
- 如果 JS 的“新数组”部分有更多/更少的空格,代码将找不到该行。我最初使用正则表达式的开始结束,但是当“,”和“新数组”之间没有空格分隔时出现了另一个问题:regex:“new Array(\"(.*))”
- 所有值都用双引号引起来
- 它看起来不是很pythonesque...
谢谢您的帮助!
到目前为止,我查看了以下链接:
解决方案
尝试在输入字符串上的循环中查找(直到不匹配)以下正则表达式:
\bnew\s+Array\s*[(]\s*(".*?)(?=[)]\s*,\s*new\s+Array|\s*[)]\s*[)]\s*[^(]+$)
然后,对于每次迭代,获取第一个捕获组。
第一次迭代应该得到标题,而下面的迭代应该得到数据。
您在这里有一个演示(绿色突出显示的是您将保留在第一个捕获组中的数据)
如果 R 支持\K
,这是另一种可能对您有用的选择:
\bnew\s+Array\s*[(]\s*\K"(?:[^"\\]+|\\.)*"\s*(?:,\s*"(?:[^"\\]++|\\.)*")+
在这种情况下,您不需要使用捕获组。所有数据都将在同一场比赛中。
您在这里有第二种方法的演示。
推荐阅读
- python - groupby max 不删除列
- python - ImportError:无法从“TwitterAPI”导入名称“TwitterRestPager”(C:\Users\My name\Anaconda3\lib\site-packages\TwitterAPI\__init__.py)
- terraform - Terraform插值函数“加入”空值错误
- sql - 尝试输出演员角色时得到错误答案
- html - 为什么我在 Squarespace 中的动画可以在 Chrome 中运行,但在 Safari 中无法运行?
- c++ - 如何编写聚合模板别名的推演指南?
- kubernetes - 带有 HTTPS 后端的 VirtualService
- node.js - 如何在回调函数Node JS中打破for循环
- python - Python 窗口在后台启动
- c++ - 在编译时添加到静态字符串