首页 > 解决方案 > Is there a (simple) way to calculate the percentage (physical) space occupied by an ad in a webpage using Python?

问题描述

The problem statement goes this way: Find the % physical occupancy of ads on a webpage.

Eg. Say I have a URL which when opened has its content and 3 ads - one is an image ad and the other 2 are 'image and text' ad. (I have been given many such URLs with an unknown number of ads). I count the number of ads based on the bin class that had 'ad' or 'sponsored' in it and so I know there are 3 ads on its page. Now, I need to find the occupancy of these ads as a percentage of the entire web page i.e., say all three ads together occupy 20% of the page. How do I do it?

I understand that elements don't render the same in different browsers and I actually do not care about that. I just need a rough percentage based on Chrome (or Firefox - anything is okay).

A similar question asked back in 2013 How to programmatically measure the elements' sizes in HTML source code using python? has only 2 solutions and not much information. I found the API for the suggested package Ghost (the one agreed to by the asker as helpful) pretty difficult to understand.

I was asked to 'render a website' using a headless browser without ads first and then with ads and find a difference. Problem is, I don't know how. I also am just hoping that in the last 8 years someone to have come up with a simpler solution to this problem.

Since I am new to using Python for "scraping" in this manner - if it can even be called "scraping" - I could use any resources/ideas/documentations that you might know of.

标签: pythonhtmlseleniumweb-scraping

解决方案


我们可以使用方法计算所有元素的高度和宽度.size

xpath定位所有元素:

//*

然后我们可以计算广告、高度和宽度,因为它们是网络元素,我们可以使用相同的.size 方法。

下面的演示

driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://stackoverflow.com/questions/68453828/is-there-a-simple-way-to-calculate-the-percentage-physical-space-occupied-by?noredirect=1#comment120979267_68453828")
wait = WebDriverWait(driver, 10)
width = []
height = []
for element in driver.find_elements(By.XPATH, "//*"):
    size = element.size
    w, h = size['width'], size['height']
    width.append(w)
    height.append(h)

total_width = sum(width)
total_height = sum(height)

print(total_width, total_height)

# Now calculate the width and heights of ads,

first_ad = wait.until(EC.visibility_of_element_located((By.XPATH, "//img")))
first_ad_size = first_ad.size
first_ad_w, first_ad_h = first_ad_size['width'], first_ad_size['height']

print(first_ad_w, first_ad_h)

total_page_area = total_width * total_height
print(total_page_area)

image_area = first_ad_w * first_ad_h
print(image_area)

percentage = (image_area * 100 )/total_page_area
print(percentage)

进口:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

PS:我已经采取了first image as an ad(我知道这并不理想,只是为了给 OP 一种实现此功能的方法)

如果您可以使用通用定位器(xpath、css)定位所有广告,它会变得更加容易。


推荐阅读