首页 > 解决方案 > 我想使用 python 正则表达式从字符串中找到特定的数据片段并排除特定部分

问题描述

((http(s?):)./([a-z]).*/)这是正则表达式是尝试

但在这个字符串中,我想要这样的目录:/wp-content/uploads/2021/09/

像这样的图像名称: VideoHive-Happy-Kids-Slideshow-Premiere-Pro-MOGRT-Free-Download-GetintoPC.com_-300x169.jpg

标签: regexweb-scrapingregular-language

解决方案


您可以使用 2 个捕获组

https?:\/\/[^/]*(\/wp-content\/uploads\/\d{4}\/\d{2}\/)([^\/\s]+)
  • https?:\/\/[^/]*将协议匹配到第一个之前/
  • (捕获组 1
    • \/wp-content\/uploads\/\d{4}\/\d{2}\/匹配/wp-content/uploads/4 位/2 位/
  • )关闭组 1
  • ([^\/\s]+)捕获组 2,匹配 1+ 次除/空格字符以外的任何字符

正则表达式演示

const s = `https://getintopc.com/wp-content/uploads/2021/09/VideoHive-Happy-Kids-Slideshow-Premiere-Pro-MOGRT-Free-Download-GetintoPC.com_-300x169.jpg https://getintopc.com/wp-content/uploads/2021/09/VideoHive-Happy-Kids-Slideshow-Premiere-Pro-MOGRT-Direct-Link-Free-Download-GetintoPC.com_-300x169.jpg https://getintopc.com/wp-content/uploads/2021/09/VideoHive-Happy-Kids-Slideshow-Premiere-Pro-MOGRT-Full-Offline-Installer-Free-Download-GetintoPC.com_-300x169.jpg https://getintopc.com/wp-content/uploads/2021/09/VideoHive-Happy-Kids-Slideshow-Premiere-Pro-MOGRT-Latest-Version-Free-Download-GetintoPC.com_-300x169.jpg`;
const regex = /https?:\/\/[^/]*(\/wp-content\/uploads\/\d{4}\/\d{2}\/)([^\/\s]+)/g;
const res = Array.from(s.matchAll(regex), m => [m[1], m[2]]);
console.log(res);

或者更广泛的版本,例如首先匹配以数字开头的文件夹,[a-z]然后是以数字开头并以最后一部分结尾的文件夹.jpg

https?:\/\/[^/]*((?:\/[a-z][^/]*)+(?:\/\d+)+\/)([^\/]+\.jpg)

正则表达式演示


推荐阅读