python - 如何对特定模式进行字符串拆分、匹配和输出?
问题描述
我正在尝试解决一个我用 PHP 完成的问题,不知道如何在 Python 中做到这一点。
在以下三行中,我们喜欢基于这两种模式进行匹配:
仅 vine.co 和 twitter.com URL(其他域应忽略)
仅逗号,前的 URL(应忽略每行中的最后一个 URL)
输入
Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1
输出将是 Python 中的一个数组(此输出基于 PHP):
array(3) {
[0]=>
string(30) "https://vine.co/v/5W2Dg3XPX7a
"
[1]=>
string(64) "https://twitter.com/dog_rates/status/836677758902222849/photo/1
"
[2]=>
string(63) "https://twitter.com/dog_rates/status/835264098648616962/photo/1"
}
PHP代码:
$input = 'Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1';
$array = preg_split('/Row\s\d:\s/s', $input);
$output = array();
foreach ($array as $key => $value) {
if (strlen($value) > 1) {
$URL_arrays = explode(',', $value);
foreach ($URL_arrays as $key => $value) {
if ($key = sizeof($URL_arrays) - 1) {
unset($URL_arrays[sizeof($URL_arrays) - 1]);
} else {
$match = preg_match('/twitter\.com|vine\.co/s', $value);
if ($match) {
array_push($output, $value);
}
}
}
}
}
var_dump($output);
此问题基于此 RegEx 问题,您可以回答其中任何一个问题。
解决方案
您可以使用此正则表达式来捕获所有在 URL 之后具有逗号的 URLvine.com
或twitter.com
域,
https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)
如您所愿,关键是这种积极的前瞻性(?=,)
,可确保您的 URL 后面紧跟一个逗号。
Python代码使用提取URLre.findall
import re
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print(re.findall(r'https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)', s))
输出,
['https://vine.co/v/5W2Dg3XPX7a', 'https://twitter.com/dog_rates/status/836677758902222849/photo/1', 'https://twitter.com/dog_rates/status/835264098648616962/photo/1']