首页 > 解决方案 > How do I split a string with multiple word delimiters in Python?

问题描述

I want an efficient way to split a list of strings using a list of words as the delimiters. The output is another list of strings.

I tried multiple .split in a single line, which does not work because the first .split returns a list and succeeding .split require a string.

Here is the input:

words = ["hello my name is jolloopp", "my jolloopp name is hello"]
splitters = ['my', 'is']

I want the output to be

final_list = ["hello ", " name ", " jolloopp", " jolloopp name ", " hello"]

Note the spaces.

It is also possible to have something like

draft_list = [["hello ", " name ", " jolloopp"], [" jolloopp name ", " hello"]]

which can be flattened using something like numpy reshape(-1,1) to get final_list, but the ideal case is

ideal_list = ["hello", "name", "jolloopp", "jolloopp name", "hello"]

where the spaces have been stripped, which is similar to using .strip().

EDIT 1:

Using re.split doesn't fully work if the word delimiters are part of other words.

words = ["hellois my name is myjolloopp", "my isjolloopp name is myhello"]
splitters = ['my', 'is']

then the output would be

['hello', '', 'name', '', 'jolloopp', '', 'jolloopp name', '', 'hello']

when it should be

['hellois', 'name', 'myjolloopp', 'isjolloopp name', 'myhello']

This is a known issue with solutions using re.split.

EDIT 2:

[x.strip() for x in re.split(' | '.join(splitters), ''.join(words))]

does not work properly when the input is

words = ["hello world", "hello my name is jolloopp", "my jolloopp name is hello"]

The output becomes

['hello worldhello', 'name', 'jolloopp', 'jolloopp name', 'hello']

when the output should be

['hello world', 'hello', 'name', 'jolloopp', 'jolloopp name', 'hello']

标签: pythonstringlistsplitdelimiter

解决方案


You could use re like,

Updated using the better way suggested by @pault using word boundaries \b instead of :space:,

>>> import re
>>> words = ['hello world', 'hello my name is jolloopp', 'my jolloopp name is hello']

# Iterate over the list of words and then use the `re` to split the strings,
>>> [z for y in (re.split('|'.join(r'\b{}\b'.format(x) for x in splitters), word) for word in words) for z in y]
['hello world', 'hello ', ' name ', ' jolloopp', '', ' jolloopp name ', ' hello']

推荐阅读