首页 > 解决方案 > 获取 CSV 文件中的选项子字符串

问题描述

我需要解析一个 csv 文件以从每一行中获取一些信息(公司代码、公司描述、国家),我在 PHP 中使用 preg_match 来解析文件,但我遇到了一些行的问题。

在 csv 文件的一些行下面

"ASTA","Aerospace Technologies of Australia Pty Ltd (Australia)"
"ATAC"," American Tactical Aircraft Consultants (United States)"
"ATEC"," ATEC vos (Czech Republic)"
"ATG","Aviation Technology Group Inc (United States)"
"ATLAS","Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)"
"ATR","GIE Avions de Transport Régional (France/Italy)"
"AUSTER","Auster Aircraft Ltd (United Kingdom)"
"AUSTFLIGHT","Austflight ULA Pty Ltd (Australia)"
"AUSTRALIAN AEROSPACE","Australian Aerospace Pty Ltd (Australia)"
"AUSTRALITE","Australite Inc (United States)"
"AUTOGYRO","AutoGyro Europe GmbH (Germany)"
"AVANTAGE","OOO Samoletstroitelynyi Kompaniya Avantazh (Russia)"
"AVCRAFT","AvCraft Aviation LLC (United States)"
"AVEKO","Aveko sro (Czech Republic)"
"AVIA (1)","Azionari Vercellese Industrie Aeronautiche (Italy)"
"AVIA (2)","Avia-Zavody Jirího Dimitrova (Czech Republic)"

PHP preg_match 代码如下

preg_match('#^(.+?)\s\((.+?)\)$#',$string,$matches);

该代码适用于如下行:

"ASSO AEREI","Asso Aerei Srl (Italy)"

在上面的示例中,我成功地将三个数据放入匹配数组中......但是使用以下行

"ATLAS","Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)"

我得到,作为公司描述:

Atlas Aircraft Corporation of South Africa

并作为国家:

Pty) Ltd (South Africa

相反,它们应该是:

Atlas Aircraft Corporation of South Africa (Pty) Ltd

South Africa

另一个让我抓狂的问题是:当行不包括国家/地区时,例如以下行

"AERFER-AERMACCHI","see AERFER and AERMACCHI"

我得到一个空的公司描述数组。

对修复正则表达式模式有什么帮助吗?非常感谢任何帮助

标签: phpregexpreg-match

解决方案


$csv = <<<'EOD'
"ASTA","Aerospace Technologies of Australia Pty Ltd (Australia)"
"ATAC"," American Tactical Aircraft Consultants (United States)"
"ATEC"," ATEC vos (Czech Republic)"
"ATG","Aviation Technology Group Inc (United States)"
"ATLAS","Atlas Aircraft Corporation of South Africa (Pty) Ltd (South Africa)"
"ATR","GIE Avions de Transport Régional (France/Italy)"
"AUSTER","Auster Aircraft Ltd (United Kingdom)"
"AUSTFLIGHT","Austflight ULA Pty Ltd (Australia)"
"AUSTRALIAN AEROSPACE","Australian Aerospace Pty Ltd (Australia)"
"AUSTRALITE","Australite Inc (United States)"
"AUTOGYRO","AutoGyro Europe GmbH (Germany)"
"AVANTAGE","OOO Samoletstroitelynyi Kompaniya Avantazh (Russia)"
"AVCRAFT","AvCraft Aviation LLC (United States)"
"AVEKO","Aveko sro (Czech Republic)"
"AVIA (1)","Azionari Vercellese Industrie Aeronautiche (Italy)"
"AVIA (2)","Avia-Zavody Jirího Dimitrova (Czech Republic)"
"AERFER-AERMACCHI","see AERFER and AERMACCHI"
EOD;

$url = 'data:text/plain,' . urlencode($csv);

if ( false !== $handle = fopen($url, "r") ) {
    while ( false !== $data = fgetcsv($handle) ) {
        if ( preg_match('~(\S.*?)(?|\h*\(([^)]*)\)|())\h*$~', $data[1], $m) )
            printf("%-70s\t%s\n", $m[1], $m[2]);

    }
}

演示

模式解释:

您的问题中有两件重要的事情:

  • 国家可以是可选的
  • 描述也可以包含括号

这就是我在描述部分使用非贪婪量词的原因(\S.*?)。这样,即使存在国家名称,描述子模式也被迫在左括号处停止(仅当这个位于字符串末尾时)。

开头的\S这里只是为了修剪左边的描述部分。这也是为什么该模式不带有^锚点的原因。其中一个\h*将在右侧修剪它(也由于非贪婪量词)。

关于国家部分:我没有使用可选的非捕获组(?:\h*\(([^)]*)\))?,而是选择使用分支重置组 (?|... (...) ... | ... (...) ...)来确保捕获组 2 存在,即使国家不存在。在这种组中,捕获组在每个分支中具有相同的编号:

(?|
    \h* \( ([^)]*) \) # the country name is present and captured in group 2
  |   # OR
    () # the capture group 2 contains an empty string
)

推荐阅读