amazon-web-services - scheduling notebooks within EMR and issue installing libraries aws
问题描述
We are having some issues with AWS EMR. We're attempting to create a very simple data pipeline. Our process usually is to make a few API calls, parse the response (json schema) of those API calls and determine if additional calls are required or not. The data would save to S3 buckets, and we could have a PySpark job run to manipulate the various data pulled from multiple APIs to create one final joined / cleaned view.
Challenges with AWS EMR that we are facing: 1) is it possible to schedule the notebooks to run periodically or once a day? We envision that the EMR cluster would start, and some how we have a Python job run and PySpark job run. Once complete, terminate the cluster 2) we were facing the issue that we could not pip install and if we attempted to do a http get request using the requests library (on a python notebook not pyspark notebook), nothing was being returned. It just seems like the notebook did not have an internet connection or was having an issue trying to make a request.
import requests
r = request.get('http://www.google.com')
解决方案
推荐阅读
- javascript - setState() 不更新值
- ionic-framework - 将集合添加到 Firestore 中的现有文档
- powerbi - Power BI:向现有组添加新值
- c - devc++程序执行失败
- ruby - 尝试在不使用正则表达式的情况下删除标点符号
- javascript - 如何克隆一个复选框,现在取消选中隐藏该克隆框并取消选中父级并取消选中父级隐藏克隆框
- shell - 通过 shell 脚本将值返回到 SNMP 服务器
- java - 使用 lambda 会阻碍类型变量的推断
- macos - Spotlight 因 API 滥用而崩溃
- android - 开发应用程序免费和付费版本的正确方法