首页 > 解决方案 > Extract Decimal and Integer numbers from sentences using Perl

问题描述

I have sentences including letters, integers, and decimals.

Example:

There are 1.6mm, 2.1mmcycst.
There are many about 3mm cysts.
There are 2 cysts about 4~5mm.

("2.1mm cyst" or "2.1 mm scysts" is the accurate sentence, but our data is "2.1mmcycst")

From these sentences, I want to extract numeri's. For example,

1.6 and 2.1
3
4~5

I'm not familiar with regular expressions, and I cannot pick up only numeri's including decimals or other relative signs (eg., "~").

Here is the code:

#!/usr/bin/perl

my $qwe = "There are 1.6mm, 2.1mmcycst.";
print "$qwe\n";
if($qwe =~ /\d+(\.\d)?\d*/){
    print "$&\n";
}

From the script, I got below output:

1.6

I am expecting 1.6 and 2.1.

How can I change my regex here to match multiple patterns in single line?

I use macOS 10.14.5 and perl v5.18.4.

标签: regexperl

解决方案


Do not reinvent the wheel. If a task seems common to you, it is likely that there is a Perl module for that. Regexp::Common can be used for matching common regular expressions, including numbers of various kinds. For example, your sample input can be extended with more complex examples of numbers, all of which can be parsed as shown below:

Create the input:

cat > in.txt <<EOF
There are 1.6mm, 2.1mmcycst.
There are many about 3mm cysts.
There are 2 cysts about 4~5mm.
The collection has 1.23E6 frozen cysts, stored at -70.5C, with cysts ranging in size from 1e-3m to 5.12E-3
EOF

Parse and print the real numbers:

perl -MRegexp::Common -lne 'print join " ", /($RE{num}{real})/g;' in.txt

Output:

1.6 2.1
3
2 4 5
1.23E6 -70.5 1e-3 5.12E-3

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-MRegexp::Common : same as BEGIN { use Regexp::Common; }.

/($RE{num}{real})/g : Capture all real numbers in the input line $_. Parenthesis mean capture. /.../g means match multiple times. In the LIST context, imposed by join, this returns the list of all matches. These matches are then printed.

SEE ALSO:

perldoc perlrun: how to execute the Perl interpreter: command line switches


Note: you need to install Regexp::Common Perl module - it is not part of the standard Perl library.


推荐阅读