python - Python 如何抓取图像、文本和音频文件 url 的链接
问题描述
我正在尝试从以下 url ( http://www.ancient-hebrew.org/m/dictionary/1000.html ) 中抓取数据。
因此,每个希伯来语单词部分都以 img urls 开头,然后是 2 个文本,即实际的希伯来语单词及其发音。例如 url 中的第一个条目是以下“img1 img2 img3 אֶלֶף e-leph” 希伯来语单词是使用 wget 下载 html 后的 unicode
我正在尝试收集这些信息,以便我首先获取图像文件,然后是希伯来语单词,然后是发音。最后我想找到音频文件的 URL。
此外,每个单词的每一行似乎都以 < A 标签开头。
我是网络抓取的新手,所以以下是我所能做的。
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = '1000.html'
try:
page = urlopen(url)
except:
print("Error opening the URL")
soup = BeautifulSoup(page, 'html.parser')
content = soup.find('<!--501-1000-->', {"<A Name= "})
images = ''
for i in content.findAll('*.jpg'):
images = images + ' ' + i.text
with open('scraped_text.txt', 'w') as file:
file.write(images)
如您所见,我的代码并没有真正完成这项工作。最后,我想获取 URL 中每个单词的信息,并将其保存为文本文件或 json 文件,无论哪个更容易。
例如,图片:URLsOfImages,希伯来语:txt,发音:txt,URLtoAudio:txt
以及下一个单词等等。
解决方案
我写了一个脚本可以帮助你。它包含您要求的所有信息。由于希伯来字母,这不能保存为 json 文件,否则它会被存储为字节。我知道您不久前发布了这个问题,但我今天找到了它并决定试一试。无论如何,这里是:
import requests
from bs4 import BeautifulSoup
import re
import json
url = 'http://www.ancient-hebrew.org/m/dictionary/1000.html'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
def images():
#Gathers all the images (this includes unwanted gifs)
imgs = soup.find_all('img')
#Gets the src attribute to form the full url
srcs = [img['src'] for img in imgs]
base_url = 'https://www.ancient-hebrew.org/files/'
imgs = {}
section = 0
#Goes through each source of all the images
for src in srcs:
#Checks if it is a gif, these act as a separator
if src.endswith('.gif'):
#If it is a gif, change sections (acts as separator)
section += 1
else:
#If it is a letter image, use regex to extract the part of src we want and form full url
actual_link = re.search(r'files/(.+\.jpg)', src)
imgs.setdefault(section, []).append(base_url + actual_link.group(1))
return imgs
def hebrew_letters():
#Gets hebrew letters, strips whitespace, reverses letter order since hebrew letters get messed up
h_letters = [h_letter.text.strip() for h_letter in soup.find_all('font', attrs={'face': 'arial'})]
return h_letters
def english_letters():
#Gets english letters by regex, this part was difficult because these letters are not surrounded by tags in the html
letters = ''.join(str(content) for content in soup.find('table', attrs={'width': '90%'}).td.contents)
search_text = re.finditer(r'/font>\s+(.+?)\s+<br/>', letters)
e_letters = [letter.group(1) for letter in search_text]
return e_letters
def get_audio_urls():
#Gets all the mp3 hrefs for the audio part
base_url = 'https://www.ancient-hebrew.org/m/dictionary/'
links = soup.find_all('a', href=re.compile(r'\d+\s*.mp3$'))
audio_urls = [base_url+link['href'].replace('\t','') for link in links]
return audio_urls
def main():
#Gathers scraped data
imgs = images()
h_letters = hebrew_letters()
e_letters = english_letters()
audio_urls = get_audio_urls()
#Encodes data into utf-8 (due to hebrew letters) and saves it to text file
with open('scraped_hebrew.txt', 'w', encoding='utf-8') as text_file:
for img, h_letter, e_letter, audio_url in zip(imgs.values(), h_letters, e_letters, audio_urls):
text_file.write('Image Urls: ' + ' - '.join(im for im in img) + '\n')
text_file.write('Hebrew Letters: ' + h_letter + '\n')
text_file.write('English Letters: ' + e_letter + '\n')
text_file.write('Audio Urls: ' + audio_url + '\n\n')
if __name__ == '__main__':
main()
推荐阅读
- php - 如何优化包含具有 3000 个选项的选择下拉列表的网页,加载时间过长?
- javascript - 如何确认请求是并行/并发的
- elasticsearch - 查询到 elasticsearch v6.0.1 时,将新字段添加到 _source 对象中的无痛脚本
- python - 如何使用'select_related'从查询集中的多个模型中获取数据?
- c# - GetPreSignedUrlRequest 可以只使用部分密钥吗?
- docker - Docker 上的 Hadoop 集群。Zookeeper 启动服务命令不能从外部容器工作,但可以从内部工作
- mysql - 按值更新 MySql 表
- java - 如何修复 getPostalCode-null?
- python - 如何从没有 None 字段的类创建字典?
- python - 从 __init__.py 导入时出现 Python 导入错误