python - Requests / BeautifulSoup Facebook language error
问题描述
I want to scrape facebook companies for their date (if they have). problem is that when I try to retrieve the HTML, I get the Hebrew version of it (I'm located in Israel)
this is part of the result:
�1u�9X�/.������~�O+$B\^����y�����e�;�+
Code:
import requests
from bs4 import BeautifulSoup
headers = {'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
url = 'https://www.facebook.com/pg/google/about/'
def fetch(URL):
try:
response = requests.get(url=URL, headers=headers).text
print(response)
except:
print('Could not retrieve data, or connect')
fetch(url)
Is there a way to check the EN website? any subdomain? or i should use proxy in the request?
解决方案
What are you seeing isn't Hebrew version of the site, but compressed response from the server. As quick solution, you can remove accept-encoding
header from the request:
import requests
from bs4 import BeautifulSoup
headers = {
'accept': '*/*',
# 'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
}
url = 'https://www.facebook.com/pg/google/about/'
def fetch(URL):
try:
response = requests.get(url=URL, headers=headers).text
print(response)
except:
print('Could not retrieve data, or connect')
fetch(url)
Prints the uncompressed page:
<!DOCTYPE html>
<html lang="en" id="facebook" class="no_js">
<head><meta charset="utf-8" /><meta name="referrer" content="origin-when-crossorigin" id="meta_referrer" /><script>window._cstart=+new Date();</script><script>function envFlush(a){function b(b){for(var c in a)b[
...and so on.
推荐阅读
- python - 如何在 Python 中与 Flask 并行运行 python 文件
- sql - SQL Server:在特定条件下连接表和重复数据
- c - 一个我看不懂它的结果的小程序
- javascript - 无法从原型获取更新属性
- python - 如何从数据框中的列中删除括号?
- csv - 什么格式适用于 Hive LazySimpleSerDe
- tfs - 添加新文件后 VS 2022 Preview 总是崩溃
- ansible - 如何在剧本运行结束时打印 ansible 错误?
- python - 如何使用 ffmpeg gpu 编码将视频帧保存到内存中?
- r - 如何自动为多个系列的系列的最后一个数据分配时间值0并制作图表