python - How do I split a string with multiple word delimiters in Python?
问题描述
I want an efficient way to split a list of strings using a list of words as the delimiters. The output is another list of strings.
I tried multiple .split
in a single line, which does not work because the first .split
returns a list and succeeding .split
require a string.
Here is the input:
words = ["hello my name is jolloopp", "my jolloopp name is hello"]
splitters = ['my', 'is']
I want the output to be
final_list = ["hello ", " name ", " jolloopp", " jolloopp name ", " hello"]
Note the spaces.
It is also possible to have something like
draft_list = [["hello ", " name ", " jolloopp"], [" jolloopp name ", " hello"]]
which can be flattened using something like numpy reshape(-1,1)
to get final_list
, but the ideal case is
ideal_list = ["hello", "name", "jolloopp", "jolloopp name", "hello"]
where the spaces have been stripped, which is similar to using .strip()
.
EDIT 1:
Using re.split
doesn't fully work if the word delimiters are part of other words.
words = ["hellois my name is myjolloopp", "my isjolloopp name is myhello"]
splitters = ['my', 'is']
then the output would be
['hello', '', 'name', '', 'jolloopp', '', 'jolloopp name', '', 'hello']
when it should be
['hellois', 'name', 'myjolloopp', 'isjolloopp name', 'myhello']
This is a known issue with solutions using re.split
.
EDIT 2:
[x.strip() for x in re.split(' | '.join(splitters), ''.join(words))]
does not work properly when the input is
words = ["hello world", "hello my name is jolloopp", "my jolloopp name is hello"]
The output becomes
['hello worldhello', 'name', 'jolloopp', 'jolloopp name', 'hello']
when the output should be
['hello world', 'hello', 'name', 'jolloopp', 'jolloopp name', 'hello']
解决方案
You could use re
like,
Updated using the better way suggested by @pault using word boundaries \b
instead of :space:
,
>>> import re
>>> words = ['hello world', 'hello my name is jolloopp', 'my jolloopp name is hello']
# Iterate over the list of words and then use the `re` to split the strings,
>>> [z for y in (re.split('|'.join(r'\b{}\b'.format(x) for x in splitters), word) for word in words) for z in y]
['hello world', 'hello ', ' name ', ' jolloopp', '', ' jolloopp name ', ' hello']
推荐阅读
- pine-script - PineScript 中是否有计算显示条数的函数?
- wix - 如何使用 Wix msi 选择性地安装 MSI
- css - 用 CSS 绘制一个 div
- javascript - 尝试将 Sequelize 中的两个数据库表与连接器表连接起来
- excel - 如何使用 VBA 从 IE 的下拉列表中选择和选择一个值?
- logging - cakephp ip 地址返回 1 而不是 ip 地址
- javascript - 用于计算的函数不起作用,一条不能长于 2 的蛇?
- flutter - HTTP 调用永远不会颤抖
- vhdl - VHDL 连接两个位
- sql - SQL Server 希望返回随机 10% 的记录