搭建hadoop集群

hadoop的架构

HDFS + MapReduce = Hadoop
MapReduce = Mapper + Reducer

hadoop的生态系统

准备四个节点，系统版本为CentOS7.3
192.168.135.170 NameNode,SecondaryNameNode,ResourceManager
192.168.135.171 DataNode,NodeManager
192.168.135.169 DataNode,NodeManager
192.168.135.172 DataNode,NodeManager

1、修改各节点hosts

# vim /etc/hosts
192.168.135.170     node1 master
192.168.135.171     node2
192.168.135.169     node3
192.168.135.172     node4

2、校对时间

# yum install -y ntp ntpdate && ntpdate pool.ntp.org

3、安装java环境

# yum install -y java java-1.8.0-openjdk-devel
# vim /etc/profile.d/java.sh
export JAVA_HOME=/usr
# source /etc/profile.d/java.sh

4、修改各节点环境变量

# vim /etc/profile.d/hadoop.sh
export HADOOP_PREFIX=/bdapps/hadoop
export PATH=$PATH:${HADOOP_PREFIX}/bin:${HADOOP_PREFIX}/sbin
export HADOOP_YARN_HOME=${HADOOP_PREFIX}
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
# source /etc/profile.d/hadoop.sh

# scp /etc/profile.d/hadoop.sh node2:/etc/profile.d/hadoop.sh
# scp /etc/profile.d/hadoop.sh node3:/etc/profile.d/hadoop.sh
# scp /etc/profile.d/hadoop.sh node4:/etc/profile.d/hadoop.sh

5、创建用户

# useradd hadoop
# echo 'hadoop' | passwd --stdin hadoop

6、设置ssh互信

# su - hadoop
$ ssh-keygen
$ ssh-copy-id node1
$ ssh-copy-id node2
$ ssh-copy-id node3
$ ssh-copy-id node4

7、配置master节点，即node1

a、创建目录

# mkdir -pv /bdapps
# mkdir -pv /data/hadoop/hdfs/{nn,snn,dn}
# chown hadoop.hadoop -R /data/hadoop/hdfs

b、下载程序包

# wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
# tar xvf hadoop-2.6.5.tar.gz -C /bdapps/
# cd /bdapps/
# ln -sv hadoop-2.6.5/ hadoop
# cd hadoop
# mkdir logs
# chmod g+w logs
# chown -R hadoop.hadoop /bdapps/hadoop

c、配置NameNode

# cd etc/hadoop/
# vim core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.135.170:8020</value>
        <final>true</final>
    </property>
</configuration>

d、配置yarn

# vim yarn-site.xml
<configuration>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>192.168.135.170:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>192.168.135.170:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>192.168.135.170:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>192.168.135.170:8033</value>
    </property> 
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>192.168.135.170:8088</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>
</configuration>

e、配置HDFS

# vim hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///data/hadoop/hdfs/nn</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data/hadoop/hdfs/dn</value>
    </property>
    <property>
        <name>fs.checkpoint.dir</name>
        <value>file:///data/hadoop/hdfs/snn</value>
    </property>
    <property>
        <name>fs.checkpoint.edits.dir</name>
        <value>file:///data/hadoop/hdfs/snn</value>
    </property>		
</configuration>

f、配置MapReduce framework

# cp mapred-site.xml.template mapred-site.xml
# vim mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

g、定义slaves

# vim slaves 
192.168.135.171
192.168.135.169
192.168.135.172

8、配置node2,node3,node4

a、创建目录

# mkdir -pv /bdapps
# mkdir -pv /data/hadoop/hdfs/{nn,snn,dn}
# chown hadoop.hadoop -R /data/hadoop/hdfs

b、下载程序包

# wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
# tar xvf hadoop-2.6.5.tar.gz -C /bdapps/
# cd /bdapps/
# ln -sv hadoop-2.6.5/ hadoop
# cd hadoop
# mkdir logs
# chmod g+w logs
# chown -R hadoop.hadoop /bdapps/hadoop/logs

c、从node1上复制配置文件

# su - hadoop
$ scp /bdapps/hadoop/etc/hadoop/* node2:/bdapps/hadoop/etc/hadoop/
$ scp /bdapps/hadoop/etc/hadoop/* node3:/bdapps/hadoop/etc/hadoop/
$ scp /bdapps/hadoop/etc/hadoop/* node4:/bdapps/hadoop/etc/hadoop/

9、格式化HDFS，需要以hadoop用户身份在master节点上执行

# su - hadoop
$ hdfs --help
http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
$ hdfs namenode -format
common.Storage: Storage directory /data/hadoop/hdfs/nn has been successfully formatted.
$ ll /data/hadoop/hdfs/nn/current/

10、启动hadoop，有两种方式

a、在各节点上分别启动各服务

master节点需要启动HDFS的NameNode服务和yarn的ResourceManager服务。

$ hadoop-daemon.sh start namenode
$ hadoop-daemon.sh start secondarynamenode
$ yarn-daemon.sh start resourcemanager

各slave节点需要启动HDFS的DataNode服务和yarn的NodeManager服务。

$ hadoop-daemon.sh start datanode
$ yarn-daemon.sh start nodemanager

b、在master节点上用脚本控制集群中的各节点启动

$ start-dfs.sh
Starting namenodes on [node1]
node1: starting namenode, logging to /bdapps/hadoop/logs/hadoop-hadoop-namenode-node1.out
192.168.135.172: starting datanode, logging to /bdapps/hadoop/logs/hadoop-hadoop-datanode-node4.out
192.168.135.171: starting datanode, logging to /bdapps/hadoop/logs/hadoop-hadoop-datanode-node2.out
192.168.135.169: starting datanode, logging to /bdapps/hadoop/logs/hadoop-hadoop-datanode-node3.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is 38:28:13:e9:f0:e7:06:37:b9:3e:96:b5:ce:b9:06:fb.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /bdapps/hadoop/logs/hadoop-hadoop-secondarynamenode-node1.out

尝试上传一个文件

$ hdfs dfs -ls /
$ hdfs dfs -mkdir /test
$ hdfs dfs -put /etc/fstab /test/
$ hdfs dfs -lsr /
drwxr-xr-x   - hadoop supergroup          0 2017-04-06 02:19 /test
-rw-r--r--   2 hadoop supergroup        541 2017-04-06 02:19 /test/fstab
$ hdfs dfs -cat /test/fstab

查看hdfs信息

http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin

-report [-live] [-dead] [-decommissioning]：Reports basic filesystem information and statistics. Optional flags may be used to filter the list of displayed DataNodes.

$ hdfs dfsadmin -report

查看yarn信息
hadoop2引入了yarn框架，对每个slave节点可以通过NodeManager进行管理，启动NodeManager进程后，即可加入集群。

$ yarn node -list
17/04/07 03:33:33 INFO client.RMProxy: Connecting to ResourceManager at /192.168.135.170:8032
Total Nodes:3
         Node-Id       Node-State Node-Http-Address Number-of-Running-Containers
     node4:46842          RUNNING        node4:8042                            0
     node2:35812          RUNNING        node2:8042                            0
     node3:33280          RUNNING        node3:8042                            0

$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /bdapps/hadoop/logs/yarn-hadoop-resourcemanager-node1.out
192.168.135.172: starting nodemanager, logging to /bdapps/hadoop/logs/yarn-hadoop-nodemanager-node4.out
192.168.135.171: starting nodemanager, logging to /bdapps/hadoop/logs/yarn-hadoop-nodemanager-node2.out
192.168.135.169: starting nodemanager, logging to /bdapps/hadoop/logs/yarn-hadoop-nodemanager-node3.out

在master节点上的进程
$ jps
2272 NameNode
2849 ResourceManager
2454 SecondaryNameNode
3112 Jps

在slave节点上的进程
$ jps
12192 Jps
12086 NodeManager
11935 DataNode

11、查看WebUI

$ netstat -tnlp

a、HDFS的WebUI
http://192.168.135.170:50070

b、yarn的WebUI
http://192.168.135.170:8088

12、运行测试程序

# su - hdfs
$ cd /bdapps/hadoop/share/hadoop/mapreduce
$ yarn jar hadoop-mapreduce-examples-2.6.5.jar 
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

$ yarn jar hadoop-mapreduce-examples-2.6.5.jar wordcount /test/fstab /test/fstab.out
17/04/06 02:40:06 INFO client.RMProxy: Connecting to ResourceManager at /192.168.135.170:8032
17/04/06 02:40:12 INFO input.FileInputFormat: Total input paths to process : 1
17/04/06 02:40:12 INFO mapreduce.JobSubmitter: number of splits:1
17/04/06 02:40:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1491416651117_0001
17/04/06 02:40:14 INFO impl.YarnClientImpl: Submitted application application_1491416651117_0001
17/04/06 02:40:17 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1491416651117_0001/
17/04/06 02:40:17 INFO mapreduce.Job: Running job: job_1491416651117_0001
17/04/06 02:40:47 INFO mapreduce.Job: Job job_1491416651117_0001 running in uber mode : false
17/04/06 02:40:47 INFO mapreduce.Job:  map 0% reduce 0%
17/04/06 02:41:19 INFO mapreduce.Job:  map 100% reduce 0%
17/04/06 02:41:33 INFO mapreduce.Job:  map 100% reduce 100%
17/04/06 02:41:34 INFO mapreduce.Job: Job job_1491416651117_0001 completed successfully
17/04/06 02:41:34 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=585
		FILE: Number of bytes written=215501
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=644
		HDFS: Number of bytes written=419
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=29830
		Total time spent by all reduces in occupied slots (ms)=10691
		Total time spent by all map tasks (ms)=29830
		Total time spent by all reduce tasks (ms)=10691
		Total vcore-milliseconds taken by all map tasks=29830
		Total vcore-milliseconds taken by all reduce tasks=10691
		Total megabyte-milliseconds taken by all map tasks=30545920
		Total megabyte-milliseconds taken by all reduce tasks=10947584
	Map-Reduce Framework
		Map input records=12
		Map output records=60
		Map output bytes=648
		Map output materialized bytes=585
		Input split bytes=103
		Combine input records=60
		Combine output records=40
		Reduce input groups=40
		Reduce shuffle bytes=585
		Reduce input records=40
		Reduce output records=40
		Spilled Records=80
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=281
		CPU time spent (ms)=8640
		Physical memory (bytes) snapshot=291602432
		Virtual memory (bytes) snapshot=4209983488
		Total committed heap usage (bytes)=149688320
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=541
	File Output Format Counters 
		Bytes Written=419

$ hdfs dfs -ls /test/fstab.out
Found 2 items
-rw-r--r--   2 hadoop supergroup          0 2017-04-06 02:41 /test/fstab.out/_SUCCESS
-rw-r--r--   2 hadoop supergroup        419 2017-04-06 02:41 /test/fstab.out/part-r-00000

$ hdfs dfs -cat /test/fstab.out/part-r-00000
#	7
'/dev/disk'	1
/	1
/boot	1
/dev/mapper/cl-home	1
/dev/mapper/cl-root	1
/dev/mapper/cl-swap	1
/etc/fstab	1
/home	1
0	8
01:15:45	1
11	1
2017	1
Accessible	1
Created	1
Mar	1
Sat	1
See	1
UUID=b76be3cf-613c-478a-ab8b-d1eaa67a061a	1
anaconda	1
and/or	1
are	1
blkid(8)	1
by	2
defaults	4
filesystems,	1
findfs(8),	1
for	1
fstab(5),	1
info	1
maintained	1
man	1
more	1
mount(8)	1
on	1
pages	1
reference,	1
swap	2
under	1
xfs	3

搭建hadoop集群

hadoop的架构

hadoop的生态系统

推荐阅读