首页 > 解决方案 > 如何将 Hadoop 作为 Spring 应用程序测试套件的一部分运行?

问题描述

我想设置一个简单的“Hello, World!” 了解如何使用基本的 Hadoop 功能,例如使用 HDFS 存储/读取文件。

是否有可能:

我想为此设置一个最小的 Spring Boot。为此所需的最小 Spring 配置是什么?有足够的示例说明如何使用 HDFS 读取/写入文件,但我仍然无法计算出我需要的 Spring 配置。很难弄清楚一个人真正需要什么库,因为 Spring Hadoop 示例似乎已经过时了。任何帮助将非常感激。

标签: springhadoop

解决方案


You can easily use Hadoop Filesystem API 1 2 with any local POSIX filesystem without Hadoop cluster. The Hadoop API is very generic and provides many concrete implementations for different storage systems such as HDFS, S3, Azure Data Lake Store, etc.

You can embed HDFS within your application (i.e run Namenode and Datanodes withing single JVM process), but this is only reasonable for tests. There is Hadoop Minicluster which you can start from command-line (CLI MiniCluster) 3 or via Java API in your unit-tests with MiniDFSCluster class 4 found in hadoop-minicluster package.

You can start Mini Cluster with Spring by making a separate configuration for it and using it as @ContextConfiguration with your unit tests.

@org.springframework.context.annotation.Configuration
public class MiniClusterConfiguration {

    @Bean(name = "temp-folder", initMethod = "create", destroyMethod = "delete")
    public TemporaryFolder temporaryFolder() {
        return new TemporaryFolder();
    }

    @Bean
    public Configuration configuration(final TemporaryFolder temporaryFolder) {
        final Configuration conf = new Configuration();
        conf.set(
            MiniDFSCluster.HDFS_MINIDFS_BASEDIR,
            temporaryFolder.getRoot().getAbsolutePath()
        );
        return conf;
    }

    @Bean(destroyMethod = "shutdown")
    public MiniDFSCluster cluster(final Configuration conf) throws IOException {
        final MiniDFSCluster cluster = new MiniDFSCluster.Builder(conf)
            .clusterId(String.valueOf(this.hashCode()))
            .build();
        cluster.waitClusterUp();
        return cluster;
    }

    @Bean
    public FileSystem fileSystem(final MiniDFSCluster cluster) throws IOException {
        return cluster.getFileSystem();
    }

    @Bean
    @Primary
    @Scope(BeanDefinition.SCOPE_PROTOTYPE)
    public Path temp(final FileSystem fs) throws IOException {
        final Path path = new Path("/tmp", UUID.randomUUID().toString());
        fs.mkdirs(path);
        return path;
    }
}

You will inject FileSystem and a temporary Path into your tests, and as I've mentioned above, there's no difference from API stand point in either it's a real cluster, mini-cluster, or local filesystem. Note that there is a startup cost of this, so you likely want to annotated your tests with @DirtiesContext(classMode = DirtiesContext.ClassMode.AFTER_EACH_TEST_METHOD) in order to prevent cluster restart for each test.

In you want this code to run on Windows you will need a compatibility layer called wintuils 5 (which makes possible to access Windows filesystem in a POSIX way). You have to point environment variable HADOOP_HOME to it, and depending on version load its shared library

String HADOOP_HOME = System.getenv("HADOOP_HOME");
System.setProperty("hadoop.home.dir", HADOOP_HOME);
System.setProperty("hadoop.tmp.dir", System.getProperty("java.io.tmpdir"));
final String lib = String.format("%s/lib/hadoop.dll", HADOOP_HOME);
System.load(lib);

推荐阅读