php - 使用 PHP 从字符串中删除标点符号,但不在缩略词中
问题描述
我正在编写将文本分解为单词并执行诸如计算单词大小等操作的代码。
我想出了这个(经过一番搜索):
$text = preg_replace("/[^[:alnum:][:space:]]/u", ' ', $text);
$words = mb_split( ' +', $text );
但是,收缩不起作用,因为撇号和单引号看起来相同(因为它们是)。
我需要一种方法来分离单词但包括收缩。目前,我已经包含了所有我认为是停用词的收缩,但这是最不令人满意的。我不擅长正则表达式,需要一些建议。
尽管我发布了自己不优雅的解决方案,但我仍将这个问题保持开放,以期鼓励更完美的答案。
解决方案
找到了一个更好的方法,使用单词中允许的单词边界和字符,可以直接统计单词:
<?php
$text = "One morning, when Gregor Samsa woke from troubled dreams,
he found himself transformed in his bed into a horrible vermin.
'He lay on his armour-like back', and if he lifted his head a
little he could see his brown belly, slightly domed and divided by arches
into stiff sections. The bedding was hardly able to cover it and
seemed ready to slide off any moment. His many legs, pitifully thin
compared with the size of the rest of him, waved about helplessly as he
looked. \"What's happened to me?\" he thought. It wasn't a dream. His
room, a proper human room although a little too small, lay peacefully
between its four familiar walls. A collection of textile samples lay
spread out on the table - Samsa was a travelling salesman - and
above it there hung a picture that he had recently cut out of an
illustrated magazine and housed in a nice, gilded frame. It showed
a lady fitted out with a fur hat and fur boa who sat upright,
raising a heavy fur muff that covered the whole of her lower arm
towards the viewer. Gregor then turned to look out the window at the
dull weather";
preg_match_all("/\b[\w'-]+\b/", $text, $words);
print_r(count($words[0]));
注意:我允许-with'存在于单词中。像“盔甲状”将算作一个词。
正则表达式测试:regexr.com/4ego6
推荐阅读
- http - 了解静态视频流
- php - 粘贴 PHP 变量
- validation - Symfony3.4 Sonata Admin 多语言站点,具有超过 2 个字段的实体验证作为唯一
- swift - 如何在 SQLite.swift 中升级数据库版本并在 swift 中的表中添加新列
- javascript - Javascript忽略大小写
- node.js - 异步文件夹创建回调永远不会发生
- regex - 无法在 vba IE 中应用正则表达式
- php - 试图理解一些在 php 中以快捷方式编写的代码以返回布尔值
- python - 获取 IOError:[Errno 22] 无效参数:
- java - 对齐相似词算法