首页 > 解决方案 > 在 Python 中读取 gz/gzip XML Sitemap

问题描述

我尝试将 gzipped XML 站点地图读取到 pandas。请求应该能够自动处理 gzip,并且在标头中检测到 gzip,但是 gzip 无法正常工作,显示“格式不正确(无效令牌):第 1 行,第 0 列”,但站点地图对我来说看起来不错?

import requests
import pandas as pd
import xmltodict
import numpy as np

url = "https://www.blick.ch/article.xml"
res = requests.get(url)
raw = xmltodict.parse(res.text)

dfAllLocs = pd.DataFrame({'loc': []})

for r in raw["sitemapindex"]["sitemap"]:
    #try: 
        print(r["loc"])
        resSingle = requests.get(r["loc"])
        #print(resSingle.headers)

        rawSingle = xmltodict.parse(resSingle.text, encoding='utf-8')
        dataSingle = [[rSingle["loc"]] for rSingle in rawSingle["urlset"]["url"]]
        dfSingle = pd.DataFrame(dataSingle, columns=["loc"])
        dfAllLocs = pd.concat([dfAllLocs,dfSingle])
        print(len(dfAllLocs))
    #except:
    #    print("something went wrong at: " + r["loc"])

标签: pythonpandaspython-requestsgzip

解决方案


谢谢 Ionut Ticus。这个链接非常有用无法获取请求==2.7.0 以自动解压缩 gzip

现在工作

#Get Sitemap
url = 'https://www.watson.ch/sitemap.xml'
pattern = '(.*?)\/'
maxSubsitemapsToCrawl = 10

res = requests.get(url)
raw = xmltodict.parse(res.text)

dfSitemap = pd.DataFrame({'loc': []})

breakcounter = 0
for r in raw["sitemapindex"]["sitemap"]:
    try: 
        print(r["loc"])
        resSingle = requests.get(r["loc"], stream=True)
        if resSingle.status_code == 200:
            if resSingle.headers['Content-Type'] == 'application/x-gzip':
                resSingle.raw.decode_content = True
                resSingle = gzip.GzipFile(fileobj=resSingle.raw)
            else: 
                resSingle = resSingle.text

            rawSingle = xmltodict.parse(resSingle)
            dataSingle = [[rSingle["loc"]] for rSingle in rawSingle["urlset"]["url"]]
            dfSingle = pd.DataFrame(dataSingle, columns=["loc"])
            dfSitemap = pd.concat([dfSitemap,dfSingle])
            print(len(dfSitemap))
    except:
        print("something went wrong at: " + r["loc"])

    breakcounter += 1
    if breakcounter == maxSubsitemapsToCrawl:
        break

推荐阅读