首页 > 解决方案 > Figuring out RegEx search term

问题描述

I'm very new to this whole thing. I am using regex to extract data from an HTML which contains:

<p class="bold"> Last Statement:</p>
<p>Yes sir. I  would like to thank God, my dad, my Lord Jesus savior for saving me and changing  my life. I want to apologize to my in-laws for causing all this emotional pain.  I love y&rsquo;all and consider y&rsquo;all my sisters I never had. I want to thank you for  forgiving me. Thank you warden. </p>

I am trying to extract the text using

word = re.findall('Last Statement:</p>.*<p>(.+)</p>', x)

But it's giving me an empty list. How can I debug that?

标签: pythonhtmlregex

解决方案


you were almost here. replacing .* by \s* should make it work.

word = re.findall('Last Statement:</p>\s*<p>(.+)</p>', x)

e.g.

import re

if __name__ == "__main__":
    s = """
<p class="bold"> Last Statement:</p>
<p>Yes sir. I  would like to thank God, my dad, my Lord Jesus savior for saving me and changing  my life. I want to apologize to my in-laws for causing all this emotional pain.  I love y&rsquo;all and consider y&rsquo;all my sisters I never had. I want to thank you for  forgiving me. Thank you warden. </p>
        """
    word = re.findall('Last Statement:</p>\s*<p>(.+)</p>', s)
    print(word)

since you are processing html, it might be better tough to use an xml parser + xpath to find the text you are interested in...


推荐阅读