首页 > 解决方案 > 如何在php中将doc、docx文件转换为纯文本?

问题描述

代码:

<?php
    function parseWord($userDoc) 
    {
        $fileHandle = fopen($userDoc, "r");
        $word_text = @fread($fileHandle, filesize($userDoc));
        $line = "";
        $tam = filesize($userDoc);
        $nulos = 0;
        $caracteres = 0;
        for($i=1536; $i<$tam; $i++)
        {
            $line .= $word_text[$i];
            if( $word_text[$i] == 0)
            {
                $nulos++;
            }
            else
            {
                $nulos=0;
                $caracteres++;
            }

            if( $nulos>1996)
            {   
                break;  
            }
        }
        $lines = explode(chr(0x0D),$line);
        $outtext = "";
        foreach($lines as $thisline)
        {
            $tam = strlen($thisline);
            if( !$tam )
            {
                continue;
            }
            $new_line = ""; 
            for($i=0; $i<$tam; $i++)
            {
                $onechar = $thisline[$i];
                if( $onechar > chr(240) )
                {
                    continue;
                }

                if( $onechar >= chr(0x20) )
                {
                    $caracteres++;
                    $new_line .= $onechar;
                }

                if( $onechar == chr(0x14) )
                {
                    $new_line .= "</a>";
                }
                if( $onechar == chr(0x07) )
                {
                    $new_line .= "\t";
                    if( isset($thisline[$i+1]) )
                    {
                        if( $thisline[$i+1] == chr(0x07) )
                        {
                            $new_line .= "\n";
                        }
                    }
                }
            }
            $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
            $new_line = str_replace("\o" ,">",$new_line); 
            $new_line .= "\n";
            $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
            $new_line = str_replace("\*" ,"><br>",$new_line); 
            $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 
            $outtext .= nl2br($new_line);
        }
        return $outtext;
    } 
    $userDoc = "upload_resume/".$upload_resume;
    $text = parseWord($userDoc);
    echo $text;
?>

我只在我的upload_resume 文件夹中上传doc、docx 文件现在我想显示doc、docx 文件或使用此功能(即parseWord)转换为纯文本格式我只读取文件和打印文本,但它无法转换为纯文本. 当我看到我的输出时,它看起来像。

���JA����a�}7�"����H�w"넙�w̤ھ�� �P�^���O֛����;�<�aYՠ؛`G�kxm�� PY �[��g Gΰino�/<���<�1��ⳆA$>"f3��\���T��IS��̌����W����Y

我不知道问题出在哪里。那么,我该如何纠正这个问题?请帮助我。

谢谢你

标签: php

解决方案


Word 文件不是您可以打开和阅读的简单文本文档。有一些库可以用 php 做到这一点。

https://github.com/PHPOffice/PHPWord


推荐阅读