首页 > 解决方案 > 拆分一个字符恰好重复两次的序列

问题描述

我想拆分一个字符恰好重复两次的序列,并保留分隔部分。有没有更短的正则表达式?

In [101]: seq='tgtttccgagtgacccgagatagaaacttaccgga'

In [102]: l=[ s for s in re.split(r"(?<!a)(a{2})(?!a)|(?<!g)(g{2})(?!g)|(?<!c)(c{2})(?!c)|(?<!t)(t{2})(?!t)",seq) if s ]

In [103]: l
Out[103]: ['tgttt', 'cc', 'gagtgacccgagatagaaac', 'tt', 'a', 'cc', 'gg', 'a']

In [104]: ''.join(l)==seq
Out[104]: True

标签: pythonregexpython-3.x

解决方案


而不是正则表达式,使用itertools.groupby

import itertools
def get_combos(d):
  for a, b in d:
    if a:
      yield from b
    else:
      yield ''.join(b)

seq='tgtttccgagtgacccgagatagaaacttaccgga'
new_seq = [''.join(b) for _, b in itertools.groupby(seq)]
final_result = list(get_combos([[a, list(b)] for a, b in itertools.groupby(new_seq, key=lambda x:len(x) == 2 and x[0] == x[1])]))

输出:

['tgttt', 'cc', 'gagtgacccgagatagaaac', 'tt', 'a', 'cc', 'gg', 'a']

推荐阅读