amazon-s3 - EMR JupyterHub:笔记本的 S3 持久性不工作
问题描述
我正在尝试使用 JupyterHub 和 S3 持久性设置 EMR 集群。我有以下分类:
{
"Classification": "jupyter-s3-conf",
"Properties": {
"s3.persistence.enabled": "true",
"s3.persistence.bucket": "my-persistence-bucket"
}
}
我正在dask
使用以下步骤进行安装(否则,打开笔记本会导致500
错误):
command-runner.jar
- 论据:
/usr/bin/sudo /usr/bin/docker exec jupyterhub conda install dask
但是,当我打开一个新笔记本时,它并没有持久化。桶保持空。集群确实可以访问 S3,因为当运行具有相同配置的 Spark 作业时,它可以使用相同的存储桶读取和写入 S3。
但是,当查看我的 master 上的 jupyter 日志时,我看到了这个:
[E 2019-08-07 12:27:14.609 SingleUserNotebookApp application:574] Exception while loading config file /etc/jupyter/jupyter_notebook_config.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 562, in _load_config_files
config = loader.load_config()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 457, in load_config
self._read_file_as_dict()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
py3compat.execfile(conf_filename, namespace)
File "/opt/conda/lib/python3.6/site-packages/ipython_genutils/py3compat.py", line 198, in execfile
exec(compiler(f.read(), fname, 'exec'), glob, loc)
File "/etc/jupyter/jupyter_notebook_config.py", line 5, in <module>
from s3contents import S3ContentsManager
File "/opt/conda/lib/python3.6/site-packages/s3contents/__init__.py", line 15, in <module>
from .gcsmanager import GCSContentsManager
File "/opt/conda/lib/python3.6/site-packages/s3contents/gcsmanager.py", line 8, in <module>
from s3contents.gcs_fs import GCSFS
File "/opt/conda/lib/python3.6/site-packages/s3contents/gcs_fs.py", line 3, in <module>
import gcsfs
File "/opt/conda/lib/python3.6/site-packages/gcsfs/__init__.py", line 4, in <module>
from .dask_link import register as register_dask
File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 56, in <module>
register()
File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 51, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
我错过了什么,出了什么问题?
解决方案
It turned out it was a chain reaction of upgrading and installing custom packages breaking compatibility. I install additional packages in my cluster with the command-runner
where I had some issues - I could only run one conda install
command, the second one failed with no module named 'conda'
.
So I updated Anaconda first by doing /usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda
with the command-runner. This caused jinja2
not finding markupsafe
. Installing markupsafe
pulled jupyterhub
to 1.0.0 which broke even more things.
So here is how I got it to work (executed in order with command-runner.jar
):
/usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda
updates Anaconda./usr/bin/sudo /usr/bin/docker exec jupyterhub conda install --freeze-installed markupsafe
installsmarkupsafe
which is needed after step 1.- Installed my desired additional packages into the container, but always with
--freeze-installed
option to circumvent breaking anything installed by EMR - A custom bootstrap action that runs a script from S3 installs my desired packages from step 3 with
pip-3.6
as well so they work for PySpark (for it to work, they have to be installed on all nodes directly)
推荐阅读
- java - Return all attributes in junit result xml using xpath and Java DOM Parser
- javascript - 在 Google 地图自动填充功能中排除某些位置
- c# - asp.net core docker container using Oracle Managed Driver Core. throws ORA-00604 and ORA-01882 when opening connection
- gimp - What does the 'gimp_histogram' procedure require to work?
- python - 添加表示熊猫数据框中每个组的中位数的列
- java - 远程使用 document4j 将 DOCX 转换为 PDF
- kotlin - 带有 Kafka 的 Axon 4.0。事件未处理第二次服务
- r - 使用ggplotGrob时如何避免grob名称中的随机后缀?
- java - Android studio java: HashMap 一串一数组
- javascript - Search through arrays as an object's properties and return the corresponding key