首页 > 解决方案 > scheduling notebooks within EMR and issue installing libraries aws

问题描述

We are having some issues with AWS EMR. We're attempting to create a very simple data pipeline. Our process usually is to make a few API calls, parse the response (json schema) of those API calls and determine if additional calls are required or not. The data would save to S3 buckets, and we could have a PySpark job run to manipulate the various data pulled from multiple APIs to create one final joined / cleaned view.

Challenges with AWS EMR that we are facing: 1) is it possible to schedule the notebooks to run periodically or once a day? We envision that the EMR cluster would start, and some how we have a Python job run and PySpark job run. Once complete, terminate the cluster 2) we were facing the issue that we could not pip install and if we attempted to do a http get request using the requests library (on a python notebook not pyspark notebook), nothing was being returned. It just seems like the notebook did not have an internet connection or was having an issue trying to make a request.

import requests
r = request.get('http://www.google.com')

标签: amazon-web-servicesamazon-emr

解决方案


推荐阅读