首页 > 解决方案 > 如何在模式和逗号的第 n 次出现之间刮擦

问题描述

我正在尝试在模式和字符串的第 12 个逗号之间抓取一个文本文件。
我得到的只是一个空页面。

我的预期结果是:

WHEAT-SRW - CHICAGO BOARD OF TRADE",200114,2020-01-14,001602,CBT ,00,001 , 476764, 146061, 107856, 162340, 136922

这是代码:

$scrape = scrape_between($scraped_page, 
                         'WHEAT-SRW - CHICAGO BOARD OF TRADE', 
                         '/[.*^,]+,[.*^,]+,[.*^;]+,[.*^,]+,[.*^,]+/'
                         );

如果我使用其他模式,比如fghi,一切都很好,我得到了我的结果。

怎么了?

完整的脚本是:

<?php
    function scrape_between($data, $start, $end){
        $data = stristr($data, $start);
        $data = substr($data, strlen($start));
        $stop = stripos($data, $end);
        $data = substr($data, 0, $stop);
        return $data;
    }
    function curl($url) {
        $options = Array(
            CURLOPT_RETURNTRANSFER => TRUE,
            CURLOPT_FOLLOWLOCATION => TRUE,
            CURLOPT_AUTOREFERER => TRUE,
            CURLOPT_CONNECTTIMEOUT => 120,
            CURLOPT_TIMEOUT => 120,
            CURLOPT_MAXREDIRS => 10,
            CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",
            CURLOPT_URL => $url,
        );

        $ch = curl_init();
        curl_setopt_array($ch, $options);
        $data = curl_exec($ch);
        curl_close($ch);
        return $data;
    }
    $scraped_page = curl("https://www.cftc.gov/dea/newcot/deafut.txt");
    $scraped_wheat = scrape_between($scraped_page, 'WHEAT-SRW - CHICAGO BOARD OF TRADE', '/(?:,[^,]+){11}/');

    echo ($scraped_wheat."<br>");
?>

标签: phpregexfileweb-scraping

解决方案


我不会为此使用正则表达式。尝试这些方法,看看它是否有效:

$pieces = explode('"
"', $scraped_page);
$items = explode(",",$pieces[0]);
$tmp = 0;
foreach ($items as $value) if ($tmp++ < 12) {
    echo ($value.",");
}

我得到的输出是:

"WHEAT-SRW - 芝加哥贸易委员会",200114,2020-01-14,001602,CBT ,00,001 , 476764, 146061, 107856, 162340, 136922,


推荐阅读