首页 > 解决方案 > How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files?

问题描述

I have a batch job in xml that gets scheduled by a job scheduling engine. This engine provides the possibility of observing directories for changes of their content. My task is to monitor directories on a file exchange server running Windows, where customers and clients upload files we need to process.

We need to know about the arrival of new files as soon as possible.

I have to put a regular expression into that xml-job in order to not match subdirectories and temporary files.

In most cases, customers and clients upload files formatted as text/csv/pdf, which don't cause any problems. Some upload MS Office files, which, on the other hand, become a problem if someone opens them in the directory. Then an invisible temporary file is created beginning with ~$.

According to the documentation of the scheduling engine, the regex follows the POSIX 1003.2 standard. However, I am not able to prevent notifications being sent when someone opens an MS Office file in a monitored directory.

My regular expressions, that I have tried so far are:

First try before even noticing temporary office files:

^[a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$

Second try, intention was excluding a leading ~:

^[^~][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$

Third try, intention was excluding a leading ~ by its character code:

^[^\x7e][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$

Fourth try, intention was excluding a leading ~ by its character code with a capital E:

^[^\x7E][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$

All of those don't stop sending notifications on file openings…</p>

Does anyone have any idea what to do? All suggestions and alternatives are welcome.

I even checked them at regex101, regexplanet.com, regexr.com and regextester.com where the second try was matching exactly as desired. I did not even forget to configure POSIX compilation if it was possible on those sites (not all).

How can I exclude the ~ character from matching the regular expression (at the beginning of a file name)?

Short version:

How can I create a regular expression that matches any file with any extension apart from .part and does neither match the file thumbs.db, nor any file whose name begins with a ~?

Requirements: What should not be matched:

Subfolders (my approach was files without a .),

Thumbs.db (Windows thumbnails db),

*.part (filezilla partial uploads),

~$. (temporary files starting with ~ or ~$, MS Office tmp files)

The following list provides some files and folders that must be matched or not matched by the regex:

New Problems occurred while trying to find the regex

A few problems came up after the creation of this question when I tried to apply the actually correct regex stated in the answer given by @Bohemian. I wasn't aware of those problems, so I just add them here for completeness.

The first one occurred when certain characters in the regex were not allowed in xml. The xml file is parsed by a java class that throws an exception trying to parse < and >, they are forbidden in xml documents if not related to xml nodes directly (valid: <xml-node>...</xml-node>, invalid: attribute="<ome_on, why isn't this VALI|>").

This can be avoided by using the html names &lt; instead of < and &gt; instead of >.

The second (and currently unresolved) issue is an operand criticized for the actually correct regular expression ^(?=.*\.)(?!thumbs.db$)[^~].*(?&lt;!\.part)$. The engine says:

Error: 2018-08-17T06:05:46Z REGEX-13

[repetition-operator operand invalid, ^(?=.*\.)(?!thumbs.db$)[^~].*(?&lt;!\.part)$]

enter image description here

The corresponding line in the xml file looks like this:

<start_when_directory_changed directory="F:\someDirectory" regex="^(?=.*\.)(?!thumbs.db$)[^~].*(?&lt;!\.part)$" />

Now I am stuck again, because my knowledge of regular expressions is pretty low. It is so low, that I don't even have any idea what character could be that criticized operand in the regex.

Research has brought me to this question whose accepted answer states "POSIX regexes don't support using the question mark ? as a non-greedy (lazy) modifier to the star and plus quantifiers (…)", which gives me an idea about what is wrong with the great regex. Still, I am not able to provide a working regex, more research will have to follow…</p>

标签: regexposixregex-negationjob-scheduling

解决方案


POSIX ERE doesn't allow for a simple way to exclude a particular string from matching. You can disallow a particular character -- like in [^.part] you are matching a single character which is not (newline or) dot or p or a or r or t -- and you can specify alternations, but those are very cumbersome to combine into an expression which excludes some particular patterns.

Here's how to do it, but as you can see, it's not very readable.

^([^~t.]|t($|[^h])|th($|[^u])|thu($|[^m])|thum($|[^b])|thumb($|[^s])|thumbs($|[^.])|thumbs\.($|[^d])|thumbs\.d($|[^b])|\.($|[^p])|\.p($|[^a])|\.pa($|[^r])|\.par($|[^t]))+$

... and it still probably doesn't do exactly what you want.


推荐阅读