python - 查看页面源码时发现抓取代码
问题描述
我正在尝试从右键单击并选择“查看页面源”时找到的网站中抓取代码。我在下面的代码从您右键单击然后选择“检查”时找到的输出中刮掉我认为。我收到一条错误消息,提示“文件以错误的编码加载:'UTF-8' 我正在根据原始页面源信息进行数据挖掘,但我不知道如何将其拉入。
见下文
from bs4 import BeautifulSoup
import requests
import urllib.request
import urllib.error
import os, os.path, csv
import sys
from lxml import html
import requests
sys.stdout = open('scrapingoutput', 'a')
print(sys.stdout)
url= "https://www.geodatadirect.com/SearchResults/SuffolkSearchResults.aspx?state=NY&id=Suffolk&type=Sales"
urllib.request.urlopen("https://www.geodatadirect.com/SearchResults/SuffolkSearchResults.aspx?state=NY&id=Suffolk&type=Sales").read()
content = urllib.request.urlopen(url).read()
soup = BeautifulSoup(content)
print(soup.prettify())
解决方案
尝试selenium library
下载网页。selenium 库也有助于下载动态数据内容。
对于 chrome 浏览器:
http://chromedriver.chromium.org/downloads
为 chrome 浏览器安装 web 驱动程序:
unzip ~/Downloads/chromedriver_linux64.zip -d ~/Downloads
chmod +x ~/Downloads/chromedriver
sudo mv -f ~/Downloads/chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
硒教程
https://selenium-python.readthedocs.io/
将您的代码替换为此。
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://www.geodatadirect.com/SearchResults/SuffolkSearchResults.aspx?state=NY&id=Suffolk&type=Sales')
time.sleep(3)
soup = BeautifulSoup(driver.page_source,'html.parser')
print(soup.prettify())
输出/输出:
<html>
<head>
<title>
Nationwide Property Data, Reports, Sales Comps
</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport"/>
<meta content="--XLC17oYuQE6UEhT-9_rC13L639t4C53w40_nWSbDM" name="google-site-verification"/>
<meta content="GeoData Plus provides nationwide property reports, sales comparables, foreclosures, mortgages. Property data for residential and commercial real estate." name="description"/>
<meta content="GeoData Plus" property="og:title"/>
<meta content="https://www.geodataplus.com" property="og:url"/>
<link href="https://www.geodataplus.com" rel="canonical"/>
<meta content="website" property="og:type"/>
<link href="/favicon.ico?v=Kx5JMIU84bo6i-lOxVlIH29IO5Qc9QPT6ENpVMaN-JE" rel="shortcut icon"/>
<link href="/css/master.css?v=Liu7xdmA3BH167YXbnG76LfxA58TPHQR1J4L4ZzM5Qk" rel="stylesheet"/>
<link href="/fonts/stylesheet.css?v=3NqqVyD10iq4848EK3FrA0HOaygo2MyDfL49n8ftRB0" rel="stylesheet"/>
<link href="/css/Jquery-ui-auto.css?v=Nul8_ltyyt4O0iNe5la8BhlJ-Z84SOdeInfup2plryA" media="all" onload="if(media!='all')media='all'" rel="stylesheet"/>
<noscript>
<link href="/css/Jquery-ui-auto.css?v=Nul8_ltyyt4O0iNe5la8BhlJ-Z84SOdeInfup2plryA" rel="stylesheet"/>
</noscript>
<link href="theme/default/style.css" rel="stylesheet" type="text/css"/>
</head>
<body data-offset="200" data-spy="scroll" data-target=".navbar">
<div class="" id="mainDiv">
<div class="load-complete" id="site-loader">
.........
..........
</div>
</div>
</body>
</html>
'/usr/bin/chromedriver'
chrome驱动路径在哪里。
推荐阅读
- visual-studio - nuget中的uap10.0 netstandard2.0消歧
- c - 如果未调用 init 函数,C Drop 编译器错误
- firebase - 我需要在 Firebase 中替换 AppID 和 AppSecret 以进行 Facebook 登录
- javascript - 如何在由 html 表单填写的在线表单中将日期和时间添加到谷歌工作表单元格
- rust - 如何从 serde_yaml::Value 获取嵌套属性?
- pagespeed-insights - Pagespeed见解如何从API读取分数?
- python - 将 pandas 日期时间列向下舍入到前一分钟
- python - 如何将值附加到现有的逗号分隔 csv (excel) 文件
- python - 无法在新数据框中追加行
- python - 需要一种方法来区分白色 img 和带有内容的 img