首页 > 解决方案 > 使用python抓取时获取javascript变量值

问题描述

我知道以前也有人问过这个问题,但我是爬虫和 python 的新手。请帮助我,这对我的学习路径非常有帮助。

我正在使用带有Beautiful Soup等软件包的 python 抓取一个新闻网站。

java script我在获取标签中声明的变量的值时遇到了困难,script而且它也在那里得到更新。

这是我正在抓取的 HTML 页面的一部分:(仅包含脚本部分)

<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>

从上面的部分,我想得到min_news_idpython中的值。如果从第 2 行更新,我也应该得到相同变量的值。

这是我的做法:

    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
    page = bs(htmlPage, "html.parser")
    //find all the scripts tag
    scripts = page.find_all("script")
    for script in scripts:
        for line in script:
            scriptString = str(line)
            if "min_news_id" in scriptString:
                scriptString.replace('"', '\\"')
                print(scriptString)
                if(self.pattern.match(str(scriptString))):
                    print("matched")
                    data = self.pattern.match(scriptString)
                    jsVariable = json.loads(data.groups()[0])
                    InShortsScraper.newsOffset = jsVariable
                    print(InShortsScraper.newsOffset)

但我从来没有得到变量的值。是我的正则表达式或其他任何问题吗?请帮我。先感谢您。

标签: pythonweb-scrapingbeautifulsouppython-3.6

解决方案


html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>'''

finder = re.findall(r'min_news_id = .*;', html)
print(finder)

Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']

#2 或者你可以使用

print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

Output:
d7zlgjdu-1

#3 或者你可以使用

finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)   

Output:
['d7zlgjdu-1'] 

推荐阅读