首页 > 解决方案 > 通过 PHP 将西里尔 HTML 标签转换为拉丁文

问题描述

这不是删除 HTML 标签的正则表达式!仔细阅读问题!

我创建了一个脚本来将西里尔文转换为拉丁文,将拉丁文转换为西里尔文。

将拉丁文转换为西里尔文会给 HTML 带来很多问题,因为脚本也会转换 HTML 元素。我创建了一个算法,将西里尔 HTML 转换为拉丁文,并将所有内容都保存在西里尔文的标签中。

该脚本几乎运行良好,但要么我有内存问题,要么 while 循环开始无限期地旋转。基本上,问题在于脚本的速度。

public function html_tags() {
    $tags = explode(',', '!DOCTYPE,a,abbr,acronym,address,applet,area,article,aside,audio,b,base,basefont,bdi,bdo,big,blockquote,body,br,button,canvas,caption,center,cite,code,col,colgroup,data,details,dd,del,details,dfn,dialog,dir,div,dl,dt,em,embed,fieldset,figcaption,figure,font,footer,form,frame,frameset,h1,h2,h3,h4,h5,h6,head,header,hr,html,i,iframe,img,input,ins,kbd,label,legend,li,link,main,map,mark,meta,master,nav,noframes,noscript,object,ol,optgroup,option,output,p,param,picture,pre,progress,q,rp,rt,ruby,s,samp,script,section,select,small,source,span,strike,strong,style,sub,summary,sup,svg,table,tbody,td,template,textarea,tfoot,th,thead,time,title,tr,track,tt,u,ul,var,video,wbr');
    $tags = array_map('trim', $tags);
    $tags = array_filter($tags);
    return apply_filters('serbian_transliteration_html_tags', $tags);
}

public function fix_cyr_html($content){
    $content = htmlspecialchars_decode($content);

    $tags = $this->html_tags();
    
    $tags_cyr = $tags_lat = array();
    foreach($tags as $tag){
        $tags_cyr[]='<' . str_replace($this->lat(), $this->cyr(), $tag);
        $tags_cyr[]='</' . str_replace($this->lat(), $this->cyr(), $tag) . '>';
        
        $tags_lat[]= '<' . $tag;
        $tags_lat[]= '</' . $tag . '>';
    }
    
    $tags_cyr = array_merge($tags_cyr, array('&нбсп;','&лт;','&гт;','&ндасх;','&мдасх;','хреф','срц','&лдqуо;','&бдqуо;','&лсqуо;','&рсqуо;','&сцарон;','&Сцарон;','&тилде;'));
    $tags_lat = array_merge($tags_lat, array('&nbsp;','&lt;','&gt;','&ndash;','&mdash;','href','src','&ldquo;','&bdquo;','&lsquo;','&rsquo;','ш','Ш','&tilde;'));
    
    $content = str_replace($tags_cyr, $tags_lat, $content);
    
    $lastPos = 0;
    $positions = [];

    while (($lastPos = mb_strpos($content, '<', $lastPos, 'UTF-8')) !== false) {
        $positions[] = $lastPos;
        $lastPos = $lastPos + mb_strlen('<', 'UTF-8');
    }

    foreach ($positions as $position) {
        if(mb_strpos($content, '>', 0, 'UTF-8') !== false) {
            $end   = mb_strpos($content, ">", $position, 'UTF-8') - $position;
            $tag  = mb_substr($content, $position, $end, 'UTF-8');
            $tag_lat = $this->cyr_to_lat($tag);
            $content = str_replace($tag, $tag_lat, $content);
        }
    }
    
    // Fix open tags
    $content = preg_replace_callback ('/(<[\x{0400}-\x{04FF}0-9a-zA-Z\/\=\"\'_\-\s\.\;\,\!\?\*\:\#\$\%\&\(\)\[\]\+\@\€]+>)/iu', function($m){
        return $this->cyr_to_lat($m[1]);
    }, $content);
    
    // FIx closed tags
    $content = preg_replace_callback ('/(<\/[\x{0400}-\x{04FF}0-9a-zA-Z]+>)/iu', function($m){
        return $this->cyr_to_lat($m[1]);
    }, $content);
    
    // Fix HTML entities
    $content = preg_replace_callback ('/\&([\x{0400}-\x{04FF}0-9]+)\;/iu', function($m){
        return '&' . $this->cyr_to_lat($m[1]) . ';';
    }, $content);
    
    // Fix JavaScript
    $content = preg_replace_callback('/(?=<script(.*?)>)(.*?)(?<=<\/script>)/s', function($matches) {
            return $this->cyr_to_lat($m[2]);
    }, $content);
    
    // Fix CSS
    $content = preg_replace_callback('/(?=<style(.*?)>)(.*?)(?<=<\/style>)/s', function($matches) {
            return $this->cyr_to_lat($m[2]);
    }, $content);
    
    // Fix email
    $content = preg_replace_callback ('/(([\x{0400}-\x{04FF}0-9\_\-\.]+)@([\x{0400}-\x{04FF}0-9\_\-\.]+)\.([\x{0400}-\x{04FF}0-9]{3,10}))/iu', function($m){
        return $this->cyr_to_lat($m[1]);
    }, $content);

    // Fix URL
    $content = preg_replace_callback ('/(([\x{0400}-\x{04FF}]{4,5}):\/{2}([\x{0400}-\x{04FF}0-9\_\-\.]+)\.([\x{0400}-\x{04FF}0-9]{3,10})(.*?)($|\n|\s|\r|\"\'\.\;\,\:\)\]\>))/iu', function($m){
        return $this->cyr_to_lat($m[1]);
    }, $content);
    
    // Fix attributes with doublequote
    $content = preg_replace_callback ('/(title|alt|data-(title|alt))\s?=\s?"(.*?)"/iu', function($m){
        return sprintf('%1$s="%2$s"', $m[1], esc_attr($this->lat_to_cyr($m[3])));
    }, $content);
    
    // Fix attributes with single quote
    $content = preg_replace_callback ('/(title|alt|data-(title|alt))\s?=\s?\'(.*?)\'/iu', function($m){
        return sprintf('%1$s=\'%2$s\'', $m[1], esc_attr($this->lat_to_cyr($m[3])));
    }, $content);

    return $content;
}

主要问题是我是否以及如何将坏的 HTML 标记(西里尔 HTML 标签和属性)转换为拉丁文并以西里尔文保存所有其他文本?

<див цласс="цонтент">Мама воли бебу</див>   --->   <div class="content">Мама воли бебу</div>

可以通过正则表达式实现还是有更快/更好的解决方案?

标签: phpregexcyrillictransliterationlatin

解决方案


推荐阅读