首页 > 解决方案 > 如何获取重定向的 URL?

问题描述

我正在尝试获取https://trade.ec.europa.eu/doclib/html/153814.htm导致的重定向 URL(pdf 文件)。

到目前为止我已经尝试过

r = requests.get('https://trade.ec.europa.eu/doclib/html/153814.htm', allow_redirects = True)
print(r.url) 

它输出相同的旧 URL。我需要重定向的 URL,即https://trade.ec.europa.eu/doclib/docs/2015/september/tradoc_153814.pdf

标签: pythonweb-scrapingpython-requests

解决方案


请尝试此代码,看看它是否适合您

import urllib.request
import re
import requests
import PyPDF2
import io
from requests_html import HTMLSession
from urllib.parse import urlparse
from PyPDF2 import PdfFileReader
 
# Get Domain Name With urlparse
url = "https://trade.ec.europa.eu/doclib/html/153814.htm"
parsed_url = urlparse(url)
domain = parsed_url.scheme + "://" + parsed_url.netloc
 
# Get URL 
session = HTMLSession()
r = session.get(url)
 
# Extract Links
jlinks = r.html.xpath('//a/@href')
 
# Remove bad links and replace relative path for absolute path
updated_links = []
 
for link in jlinks:
    if re.search(".*@.*|.*javascript:.*|.*tel:.*",link):
        link = ""
    elif re.search("^(?!http).*",link):
        link = domain + link
        updated_links.append(link)
    else:
        updated_links.append(link)
r = requests.get(updated_links[0])
f = io.BytesIO(r.content)
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText() 
print(contents)

推荐阅读