首页 > 解决方案 > 正则表达式仅匹配副标题

问题描述

我有一个带有标题列的数据框(请参见下面的示例)

import numpy as np




Fairytales_in = {'Titles': ['Fairy Tales',
                    'Tales.3.2.Dancing Shoes, ballgowns and frogs',
                    'Tales.2.4.6.Red Riding Hood',
                    'Fairies.1Your own Fairy godmother',
                    'Ogres-1.1.The wondrous world of Shrek',
                    'Witches-1-4Maleficient and the malicious curse',
                    'Tales.2.1.The big bad wolf',
                    'Tales.2.Little Red riding Hood',
                    'Tales.2.4.6.1.Why the huntsman is underrated',
                    'Tales.5.f.Cinderella and the pumpkin carriage',
                    'Ogres-1.Best Ogre in town',
                    'No.3.Great Expectations']}

    Fairytales_in = pd.DataFrame.from_dict(Fairytales_in)

我想创建一个新列,其中包含与标题列完全相同的字符串,但仅当它是副标题时。(例如 Tales.3.2. 或 Ogres-1.1. 或 Witches-1-4 或 Tales.5.f)。

This would be my expected output: 


    Fairytales_expected_output = {'Titles': ['Fairy Tales',
                    'Tales.3.2.Dancing Shoes, ballgowns and frogs',
                    'Tales.2.4.6.Red Riding Hood',
                    'Fairies.1Your own Fairy godmother',
                    'Ogres-1.1.The wondrous world of Shrek',
                    'Witches-1-4Maleficient and the malicious curse',
                    'Tales.2.1.The big bad wolf',
                    'Tales.2.Little Red riding Hood',
                    'Tales.2.4.6.1.Why the huntsman is underrated',
                    'Tales.5.f.Cinderella and the pumpkin carriage',
                    'Ogres-1.Best Ogre in town',
                    'No.3.Great Expectations'],
                    'Subheading': ['NaN', 
                                   'Tales.3.2.Dancing Shoes, ballgowns and frogs',
                                   'NaN',
                                   'NaN',
                                   'Ogres-1.1.The wondrous world of Shrek',
                                   'Witches-1-4Maleficient and the malicious curse',
                                   'Tales.2.1.The big bad wolf',
                                   'NaN',
                                   'NaN',
                                   'Tales.5.f.Cinderella and the pumpkin carriage',
                                   'NaN',
                                   'NaN']}

    Fairytales_expected_output = pd.DataFrame.from_dict(Fairytales_expected_output)

我一直在努力寻找一种方法让我的模式只匹配副标题。无论我尝试什么,仍然包含第一级或第三级标题。这个问题的问题或多或少相同,但它在 C# 中,我无法使其适用于我的用例。

这是我到目前为止所尝试的:

Fairytales_in['Subheading'] = Fairytales_in.Titles.str.extract(r'(^(?:\w+\.|\-\d{1}\.\d{1}\.)\W*(?:\w+\b\W*){1,100})$')

但正如您所看到的,它不会产生预期的结果。我一直在尝试使用 regex101.com,但我已经坚持了两天了。任何有关修复我的模式的帮助将不胜感激!

标签: pythonregex

解决方案


您可以使用

rx = r'^(\w+(?:[.-](?:\d+|[a-zA-Z]\b)){2}(?![.-]?\d).*)'
Fairytales_in['Subheading'] = Fairytales_in['Titles'].str.extract(rx, expand=False)

查看正则表达式演示

细节

  • ^- 字符串的开始
  • \w+- 1 个或多个单词字符
  • (?:[.-](?:\d+|[a-zA-Z]\b)){2}- 两次出现
    • [.-] - 一个点或-
    • (?:\d+|[a-zA-Z]\b)- 1 个或多个数字或 ASCII 字母后跟单词边界
  • (?![.-]?\d)- 没有可选的.-后面紧跟当前位置右侧允许的数字
  • .* - 除换行符之外的任何 0 个或多个字符,尽可能多

熊猫测试:

>>> Fairytales_in['Titles'].str.extract(rx, expand=False)
0                                                NaN
1       Tales.3.2.Dancing Shoes, ballgowns and frogs
2                                                NaN
3                                                NaN
4              Ogres-1.1.The wondrous world of Shrek
5     Witches-1-4Maleficient and the malicious curse
6                         Tales.2.1.The big bad wolf
7                                                NaN
8                                                NaN
9      Tales.5.f.Cinderella and the pumpkin carriage
10                                               NaN
11                                               NaN
Name: Titles, dtype: object

推荐阅读