首页 > 解决方案 > 我需要在 python 中使用 lxml 获取元标记的值

问题描述

网站检查

我想提取 content="------" 中的 postalCode 值

import requests
import lxml.html

html = requests.get("https://www.craispesaonline.it/provincia/lucca")
doc = lxml.html.fromstring(html.content)

x = doc.xpath('//meta[@itemprop="postalCode"]/@content')
print(x.text_content())

标签: pythonseleniumweb-scrapingpython-requestslxml

解决方案


如果这样做print("itemprop" in html.content),您会发现该标记根本不在 HTML 源代码中,这意味着它是由页面上运行的一些 JavaScript 添加的。只是 LXML(或 BeautifulSoup)不会执行 JavaScript。您将需要一个无头浏览器引擎来执行此操作。

另一方面,对于这个特定站点,您不需要从源代码中抓取邮政编码,因为如果您在加载页面时查看浏览器检查器,您可以看到地址信息是从https 加载的: //www.craispesaonline.it/showcase/rest/api/public/province/lucca

[
  {
    id: 256,
    name: "CRAI Lucca",
    alias: "lucca-via-prov-salessio-1609",
    address: "Via di Sant'Alessio, 1609",
    city: "Lucca",
    zipCode: "55100",
    servedZipCodes: [],
    latitude: 43.8611631,
    longitude: 10.487961799999994,
    groceryCode: "005",
    email: "tetsrllucca@gmail.com",
    telephone: "0583/341251",
    media: [
      { url: "694/694_1", altText: "Crai Lucca", title: "Crai Lucca" },
      { url: "694/694_2", altText: "Crai Lucca", title: "Crai Lucca" },
    ],
    fullCity: { id: 4484, name: "Lucca", latitude: 43.84432282, longitude: 10.50151366 },
    province: { id: 49, name: "Lucca", code: "LU", istatCode: "046", alias: null, region: "Toscana", regionIstat: "09", temp_alias: "lucca" },
    shippingEnabled: true,
    disabled: false,
    indexable: true,
  },
  ...
]

推荐阅读