首页 > 解决方案 > 使用 PHP 从字符串中删除标点符号,但不在缩略词中

问题描述

我正在编写将文本分解为单词并执行诸如计算单词大小等操作的代码。

我想出了这个(经过一番搜索):

$text = preg_replace("/[^[:alnum:][:space:]]/u", ' ', $text);
$words = mb_split( ' +', $text );

但是,收缩不起作用,因为撇号和单引号看起来相同(因为它们是)。

我需要一种方法来分离单词但包括收缩。目前,我已经包含了所有我认为是停用词的收缩,但这是最不令人满意的。我不擅长正则表达式,需要一些建议。

尽管我发布了自己不优雅的解决方案,但我仍将这个问题保持开放,以期鼓励更完美的答案。

标签: phpregex

解决方案


找到了一个更好的方法,使用单词中允许的单词边界和字符,可以直接统计单词​​:

<?php

$text = "One morning, when Gregor Samsa woke from troubled dreams, 
he found himself transformed in his bed into a horrible vermin. 
'He lay on his armour-like back', and if he lifted his head a 
little he could see his brown belly, slightly domed and divided by arches
into stiff sections. The bedding was hardly able to cover it and 
seemed ready to slide off any moment. His many legs, pitifully thin 
compared with the size of the rest of him, waved about helplessly as he 
looked. \"What's happened to me?\" he thought. It wasn't a dream. His 
room, a proper human room although a little too small, lay peacefully
between its four familiar walls. A collection of textile samples lay 
spread out on the table - Samsa was a travelling salesman - and 
above it there hung a picture that he had recently cut out of an 
illustrated magazine and housed in a nice, gilded frame. It showed 
a lady fitted out with a fur hat and fur boa who sat upright, 
raising a heavy fur muff that covered the whole of her lower arm 
towards the viewer. Gregor then turned to look out the window at the 
dull weather";

preg_match_all("/\b[\w'-]+\b/", $text, $words);
print_r(count($words[0]));

注意:我允许-with'存在于单词中。像“盔甲状”将算作一个词。

正则表达式测试:regexr.com/4ego6


推荐阅读