首页 > 解决方案 > EMR JupyterHub:笔记本的 S3 持久性不工作

问题描述

我正在尝试使用 JupyterHub 和 S3 持久性设置 EMR 集群。我有以下分类:

    {
        "Classification": "jupyter-s3-conf",
        "Properties": {
            "s3.persistence.enabled": "true",
            "s3.persistence.bucket": "my-persistence-bucket"
        }
    }

我正在dask使用以下步骤进行安装(否则,打开笔记本会导致500错误):

但是,当我打开一个新笔记本时,它并没有持久化。桶保持空。集群确实可以访问 S3,因为当运行具有相同配置的 Spark 作业时,它可以使用相同的存储桶读取和写入 S3。

但是,当查看我的 master 上的 jupyter 日志时,我看到了这个:

[E 2019-08-07 12:27:14.609 SingleUserNotebookApp application:574] Exception while loading config file /etc/jupyter/jupyter_notebook_config.py
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 562, in _load_config_files
        config = loader.load_config()
      File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 457, in load_config
        self._read_file_as_dict()
      File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
        py3compat.execfile(conf_filename, namespace)
      File "/opt/conda/lib/python3.6/site-packages/ipython_genutils/py3compat.py", line 198, in execfile
        exec(compiler(f.read(), fname, 'exec'), glob, loc)
      File "/etc/jupyter/jupyter_notebook_config.py", line 5, in <module>
        from s3contents import S3ContentsManager
      File "/opt/conda/lib/python3.6/site-packages/s3contents/__init__.py", line 15, in <module>
        from .gcsmanager import GCSContentsManager
      File "/opt/conda/lib/python3.6/site-packages/s3contents/gcsmanager.py", line 8, in <module>
        from s3contents.gcs_fs import GCSFS
      File "/opt/conda/lib/python3.6/site-packages/s3contents/gcs_fs.py", line 3, in <module>
        import gcsfs
      File "/opt/conda/lib/python3.6/site-packages/gcsfs/__init__.py", line 4, in <module>
        from .dask_link import register as register_dask
      File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 56, in <module>
        register()
      File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 51, in register
        dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
    AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'

我错过了什么,出了什么问题?

标签: amazon-s3jupyter-notebookamazon-emr

解决方案


It turned out it was a chain reaction of upgrading and installing custom packages breaking compatibility. I install additional packages in my cluster with the command-runner where I had some issues - I could only run one conda install command, the second one failed with no module named 'conda'.

So I updated Anaconda first by doing /usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda with the command-runner. This caused jinja2 not finding markupsafe. Installing markupsafe pulled jupyterhub to 1.0.0 which broke even more things.

So here is how I got it to work (executed in order with command-runner.jar):

  1. /usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda updates Anaconda.
  2. /usr/bin/sudo /usr/bin/docker exec jupyterhub conda install --freeze-installed markupsafe installs markupsafe which is needed after step 1.
  3. Installed my desired additional packages into the container, but always with --freeze-installed option to circumvent breaking anything installed by EMR
  4. A custom bootstrap action that runs a script from S3 installs my desired packages from step 3 with pip-3.6 as well so they work for PySpark (for it to work, they have to be installed on all nodes directly)

推荐阅读