首页 > 解决方案 > I need to use RegEx to find a speciffic word in HTML page?

问题描述

I'm trying to extract a specific word (that might change) which comes after a permanent expression. I want to extract the name Taldor in this code:

<h4 class="t-16 t-black t-normal">
    <span class="visually-hidden">Company Name</span>
    <span class="pv-entity__secondary-title">Taldor</span>
</h4>

For now I able to find <h4 class="t-16 t-black t-normal"> using this regex:

(?<=<h4 class="t-16 t-black t-normal">).*

Will be glad for any kind of advice.

标签: regexregex-groupregex-greedy

解决方案


I'd suggest you to use an HTML parsing library like Jsoup in Java or beautifulsoup in Python to parse HTML instead of using regex for this reason

Following is the kind of code that does the job for you,

String s = "<h4 class=\"t-16 t-black t-normal\">\r\n" + 
        "    <span class=\"visually-hidden\">Company Name</span>\r\n" + 
        "    <span class=\"pv-entity__secondary-title\">Taldor</span>\r\n" + 
        "  </h4>";
Document doc = Jsoup.parse(s);
for (Element element : doc.getElementsByClass("pv-entity__secondary-title")) {
    System.out.println(element.text());
    break;
}

Prints,

Taldor

In worst case, if you are doing some quick and dirty work, you can do this temporary solution using regex but it is surely not recommended thing to do.

<span class="pv-entity__secondary-title">(.*?)<\/span>

Use this regex and capture your data from group1.

Demo


推荐阅读