首页 > 解决方案 > Requests / BeautifulSoup Facebook language error

问题描述

I want to scrape facebook companies for their date (if they have). problem is that when I try to retrieve the HTML, I get the Hebrew version of it (I'm located in Israel)

this is part of the result:

�1u�9X�/.������~�O+$B\^����y�����e�;�+

Code:

import requests
from bs4 import BeautifulSoup

headers = {'accept': '*/*',
           'accept-encoding': 'gzip, deflate, br',
           'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
           'cache-control': 'no-cache',
           'dnt': '1',
           'pragma': 'no-cache',
           'referer': 'https',
           'sec-fetch-mode': 'no-cors',
           'sec-fetch-site': 'cross-site',
           'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
           }

url = 'https://www.facebook.com/pg/google/about/'

def fetch(URL):
    try:
        response = requests.get(url=URL, headers=headers).text
        print(response)
    except:
        print('Could not retrieve data, or connect')

fetch(url)

Is there a way to check the EN website? any subdomain? or i should use proxy in the request?

标签: pythonpython-3.xbeautifulsouppython-requests

解决方案


What are you seeing isn't Hebrew version of the site, but compressed response from the server. As quick solution, you can remove accept-encoding header from the request:

import requests
from bs4 import BeautifulSoup

headers = {
            'accept': '*/*',
           # 'accept-encoding': 'gzip, deflate, br',
           'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
           'cache-control': 'no-cache',
           'dnt': '1',
           'pragma': 'no-cache',
           'referer': 'https',
           'sec-fetch-mode': 'no-cors',
           'sec-fetch-site': 'cross-site',
           'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
           }

url = 'https://www.facebook.com/pg/google/about/'

def fetch(URL):
    try:
        response = requests.get(url=URL, headers=headers).text
        print(response)
    except:
        print('Could not retrieve data, or connect')

fetch(url)

Prints the uncompressed page:

<!DOCTYPE html>
<html lang="en" id="facebook" class="no_js">
<head><meta charset="utf-8" /><meta name="referrer" content="origin-when-crossorigin" id="meta_referrer" /><script>window._cstart=+new Date();</script><script>function envFlush(a){function b(b){for(var c in a)b[

...and so on.

推荐阅读