python - 由于网页中的广告,使用 Beautiful soup 抓取网站会产生大量空白
问题描述
这是我试图抓取的链接作为示例: Livemint 新闻
这是试图实现它的功能:
t = []
try:
temp = []
data = bs.find_all(class_=['contentSec'])
# logging.info(data)
for i in data:
temp = temp + (i.find_all('p'))
for i in temp:
t.append(i.get_text())
except Exception as e:
print(e)
return t
发生的情况是,如果我在 find all 参数中包含 text = True,它会忽略带有链接的 paras(带有 href 标记)。否则,它在内容字段中给了我巨大的空白,可能是因为网站中的广告也在 para 标记中。我附上了示例输出。
我错过了什么?
解决方案
您要查找的数据(即文章的内容)可直接在div
with 类下的页面源中获得mainArea
。您所要做的就是获取该 div 的文本并清理它。对于您需要的数据,我认为根本不需要找到script
标签和使用json
模块。但如果需要datePublished等数据,@chitown88的回答更全面。
from bs4 import BeautifulSoup
import requests
url='https://www.livemint.com/Companies/Ot1UTmQ8EMe0DTWSiJCgfJ/Google-teams-with-HDFC-Bank-ICICI-others-for-instant-loans.html'
r=requests.get(url)
soup=BeautifulSoup(r.text,'html.parser')
data=soup.find('div',class_='mainArea')
#let's just clen up the data
cleaned_data=data.text.split('\n\n')[0].strip()
print(cleaned_data)
输出
New Delhi: Google India on Tuesday said it has rebranded its Indian payments app Tez as Google Pay and is partnering four banks to provide instant loans for the app’s users. In the coming weeks, Google Pay users will be able to access customised loans from HDFC Bank Ltd, ICICI Bank Ltd, Federal Bank and Kotak Mahindra Bank Ltd with minimal paperwork, said Caesar Sengupta, vice-president of Google’s Next Billion Users Initiative and Payments, at the Google for India event in New Delhi. Once users holding accounts with these banks accept the bank’s terms, the money will be transferred to their accounts.“We have learnt that when we build for India, we build for the world, and we believe that many of the innovations and features we have pioneered with Tez will work globally," Caesar Sengupta said.Google Tez, which was launched in September, will also expand services for merchants and retailers. About 15,000 retail stores in India will have Google Pay enabled by Diwali 2018, Caesar Sengupta said.Google claims that over 1.2 million small businesses in India are already using Google Pay. In a bid to help their business grow further, Google is building a dedicated merchant experience where they will be discovered through Google Search and Maps, and communicate with their customers through messages and offers.“We are testing these features with merchants in Bangalore and Delhi, and on-boarding more neighbourhoods in the following months," said Sengupta.Google Pay has rivals in Paytm and Facebook Inc.’s WhatsApp targeting the Indian payments market. On Tuesday, Mint reported that Warren Buffett’s Berkshire Hathaway Inc. has sealed a deal with Paytm, marking the legendary investor’s first investment in the country. A string of other big-name players are also expanding in India’s digital payments market including its banks, India Post Payments Bank, and Mukesh Ambani’s Reliance Jio.“The real competition is actually user habits and cash," said Sengupta. “So, all of us (referring to the other players) are all in many ways brothers-in-arms who are trying to move people’s habits away from cash to digital so that we can move India to a digital economy. At Google, we focus on the users so we don’t think so much of the competition."Since its launch, over 55 million people have downloaded Google Tez and more than 22 million people and businesses actively use the app for digital transactions every month, according to company data and figures quoted by Sengupta. Collectively, they have made more than 750 million transactions, with an annual run rate of over $30 billion.The search giant also announced other initiatives including expanding its Google Station internet access programme to 12,000 villages and cities across Andhra Pradesh, potentially reaching 10 million people; the launch of Project Navlekha, where Google will work with Indian publishers to bring more relevant content online; and a new feature in Google Go app that can pull up any webpage and let users listen to it with each word lighting up as it is read.
推荐阅读
- database - 获取日期范围内每天仅存在一次的唯一记录
- pcre - /R 在 snort 的 pcre 规则选项中是什么意思?
- keras - Keras lstm和密集层
- json - 如何反序列化 Twitter 对 List 的响应
推特4j - r - 如何矢量化 R 中的成对命令?
- git - git reset HEAD~1 --soft 让我的暂存文件列表充满了我没有接触过的文件
- angular - Ionic 4 -Angular:阻止 Youtube 视频“跨源读取阻止 (CORB)”
- python - 如何在Python中获取日期列和同一列或不同列的最大日期之间的天差?
- python-3.x - 获取指定数量的用户输入并将每个输入存储在一个变量中
- bash - 在查找中,如果与 `-prune` 一起使用,`-type d` 不仅可以获取文件目录