首页 > 解决方案 > PHP preg_replace 仅拉丁字符

问题描述

当我为意大利的电子发票构建 XML 时,我需要过滤字符串。

仅接受来自特定的:

String1000LatinType
"[\p{IsBasicLatin}\p{IsLatin-1Supplement}]{1,1000}"

我不喜欢这个范围,但我认为:

a-z, A-Z, 0-9, 重音如:à ò ù è é ìç符号如:, . _ - : ; '和空格

我想直接从键盘中排除所有其他符号,例如:"£$%&/()=?^°§*+\|/<>tab

我尝试使用此功能进行转换,但我不是正则表达式的专家:

function sanitize($tag) {

$newtag = preg_replace ("/[\p{Latin}A-Z0-9a-z\-\_\.\,\:\;' ]/", "", $tag);

return $newtag;

}

$tag = "Qwerty 12345 £$%&/()=?^ èéòàùì +*°ç.,-_<>\/l'èok .,;:";

var_dump(sanitize($tag));

有人能帮我吗?

我想检索:

Qwerty 12345  èéòàùì ç.,-_l'èok .,;:

标签: phpregex

解决方案


经过一些测试,我创建了这个函数来满足我的目的:

function sanitize_string_xml($string, $opzioni = array()) {

    $chr_map = array(
       // Windows codepage 1252
       "\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
       "\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
       "\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
       "\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark
       "\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark
       "\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark
       "\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark
       "\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark

       // Regular Unicode     // U+0022 quotation mark (")
                              // U+0027 apostrophe     (')
       "\xC2\xAB"     => '"', // U+00AB left-pointing double angle quotation mark
       "\xC2\xBB"     => '"', // U+00BB right-pointing double angle quotation mark
       "\xE2\x80\x98" => "'", // U+2018 left single quotation mark
       "\xE2\x80\x99" => "'", // U+2019 right single quotation mark
       "\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
       "\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
       "\xE2\x80\x9C" => '"', // U+201C left double quotation mark
       "\xE2\x80\x9D" => '"', // U+201D right double quotation mark
       "\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
       "\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
       "\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
       "\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
    );

    $type = isset($opzioni['Type']) ? $opzioni['Type'] : "";    // IsBasicLatin /IsLatin

    $lunghezzaMax = isset($opzioni['LunghezzaMax']) ? $opzioni['LunghezzaMax'] : "";

    if ( $type == "IsBasicLatin" ) {

        $unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', "ü" => "u", 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );

        $string = strtr( $string, $unwanted_array );

        $string = preg_replace('/[^\x{0020}-\x{007E}]+/u', '', $string);

    }

    if ( $type == "IsLatin" ) {

        $unwanted_array = array(  'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z' );

        $string = strtr( $string, $unwanted_array );

        $string = preg_replace('/[^\x{0020}-\x{007E}\x{00A0}-\x{00FF}]+/u', '', $string);

    }

    //  CONVERTI GLI ACCENTI FUORI DAL RANGE IN APICI AMMESSI:

    $chr = array_keys  ($chr_map); // but: for efficiency you should

    $rpl = array_values($chr_map); // pre-calculate these two arrays

    $string = str_replace($chr, $rpl, html_entity_decode($string, ENT_QUOTES, "UTF-8"));




    $string = htmlspecialchars(str_replace(PHP_EOL, " ", $string));

    if ( $lunghezzaMax != "" ) {
        $string = substr($string, 0, $lunghezzaMax);
    }

    return $string;

}

使用示例:

$clear_string = sanitize_string_xml($dirty_string, array("Type" => "IsLatin", "LunghezzaMax" => 60));

推荐阅读