This guide is shows step by step how to set up a multi nod cluster with Hadoop and HDFS 2.4.1 on Ubuntu 14.04. It is an update, and takes many parts from previous guides about installing Hadoop&HDFS versions 2.2 and 2.3 on Ubuntu.
The text here is quite lengthy, I will soon provide a script to auomate some parts.
Assume we have a 3 nodes cluster, my test case is the following (with IP addresses and shortnames) :
10.10.10.104 mynode1 10.10.10.105 mynode2 10.10.10.106 mynode3
Make sure you have Oracle JDK 7 or 8 installed. The following are the commands for java 8, to install java 7 you just need to change the version number
sudo add-apt-repository ppa:webupd8team/java -y sudo apt-get update sudo apt-get install oracle-java8-installer sudo apt-get install oracle-java8-set-default
Note: I know some of you are trying to run this guide with debian. I am not sure how much of these guide will apply to that OS, but for this specific case, for debian, the instructions to install Java 8 are here.
While we are installing software, you can find useful to install alsoscreento start sessions of work on remote servers, andnmapto check server ports in case something is not working in the cluster networking
sudo apt-get install screen nmap
Repeat this installation procedure, up to this point, on every node you have in the cluster
The following will be necessary only on the first node: Then we start a screen to work remotely without fear of losing work if disconnected.
screen -S installing
After the-Syou can put whatever name for your sessions
Now we are going to actually install the software needed withmavenwith libraries to compile hdfs&hadoop.
sudo apt-get install maven build-essential zlib1g-dev cmake pkg-config libssl-dev protobuf-compiler
Among these files,protocor also calledprotobuf-compilermay cause some problems with the version depending on your operating system version. In that case, you can compile and install the correct version (2.5.0) from the source.
Next, let‘s createhadoopgroup and the userhduser, which will be also in the sudoers, the following commands have to be run one at at time. In the second step theadduserwill also ask the login password forhduser:
sudo addgroup hadoop sudo adduser --ingroup hadoop hduser sudo adduser hduser sudo
Repeat this procedure, up to this point, on every node you have in the cluster
We now log in as the newhduseron one node and we will create SSH keys to access the other servers:
sudo su - hduser
From now on, in the rest of this guide, all commands will be run as thehduser.
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now let‘s copy these files on the other nodes, e.g, frommynode1tomynode2andmynode3
scp -r ~/.ssh hduser@10.10.10.106:~/
The following steps will be needed only once. Download hadoop2.Xstable, to do so you navigate in the List of Mirrors select one and decide what version to download. Withwgetyou can run something like the following for hadoop2.4.1- from europe:
wget http://www.eu.apache.org/dist/hadoop/core/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz
From the U.S. instead
wget http://apache.mirror.anlx.net/hadoop/core/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz
Once it has been downloaded, unpack it
tar -xvf hadoop-2.4.1-src.tar.gz
Then enter the directory and compile
cd hadoop-2.4.1-src/ mvn package -Pdist,native -Dmaven.javadoc.skip=true -DskipTests -Dtar
Notice that, if you are behind a proxy, maven needs asettings.xml file in the configuration directory in~/.m2that contains the basic information of your proxy configuration.
Compiled files will be found inhadoop-dist/target/hadoop-2.4.1.tar.gzjust put them in the home
mv hadoop-dist/target/hadoop-2.4.1.tar.gz ~/
Now let‘s copy these files on the other nodes, e.g, frommynode1tomynode2andmynode3
scp ~/hadoop-2.4.1.tar.gz hduser@10.10.10.105:~/ scp ~/hadoop-2.4.1.tar.gz hduser@10.10.10.106:~/
The following steps will be needed on all the machines We unpack the compiled version and put it n/usr/localand we create a shortcut called/usr/local/hadoop
sudo tar -xvf ~/hadoop-2.4.1.tar.gz -C /usr/local/ sudo ln -s /usr/local/hadoop-2.4.1 /usr/local/hadoop sudo chown -R hduser:hadoop /usr/local/hadoop-2.4.1
The following steps will be needed on all the machines We update the profile of the shell, i.e., we edit the.profilefile to put some enviroment variables, in order to upset equallyvimandemacsuser we will use a text editor callednano
nano ~/.profile
And we add, at the end, the following
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") export HADOOP_INSTALL=/usr/local/hadoop export HADOOP_HOME=$HADOOP_INSTALL export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop" export YARN_HOME=$HADOOP_INSTALL alias hfs="hdfs dfs"
(To saveCTRL+oENTERandCTRL+x)
Note: If you installed somewhere else hadoop, chec the proper directory path for$HADOOP_INSTALL, but do not change$HADOOP_CONF_DIR.
Now we made the edit operative by reloading the.profilefile with
source ~/.profile
We also have to edithadoop-env.shfiles with for the same$JAVA_HOMEvariable, that they seem not able to set up properly, so we open the file in
nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
and around line 27 we can replace
export JAVA_HOME=${JAVA_HOME}
with
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
If you want to be sure it worked, you can paste some values, like
echo $JAVA_HOME echo $HADOOP_HOME
We create the directory wherehdfsdata files and logs will be stored, you can create them wherever you like
The first directory is actually needed only on the NameNode (main) machine
mkdir -pv /usr/local/hadoop/data/namenode
These steps will be needed on all the machines
mkdir -pv /usr/local/hadoop/data/datanode mkdir -pv $HADOOP_INSTALL/logs
These steps will be needed only on the main machine, then we will copy the entire conf directory on the other machines
Then we put this information in thehdfs-site.xmlfile with
nano $HADOOP_INSTALL/etc/hadoop/hdfs-site.xml
And paste the following between<configuration>tag:
<property> <name>dfs.datanode.data.dir</name> <value>file:///usr/local/hadoop/data/datanode</value> <description>DataNode directory</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/local/hadoop/data/namenode</value> <description>NameNode directory for namespace and transaction logs storage.</description> </property>
The following are additional configuration parameters to put alongside the previous ones, among them the replication parameter to match the number redundant copy we want - it does not necessarily match the number of nodes in the cluster.
<property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.datanode.use.datanode.hostname</name> <value>false</value> </property> <property> <name>dfs.namenode.datanode.registration.ip-hostname-check</name> <value>false</value> </property>
Notice: when you will start your HDFS distributed filesystem, you will have a mainNameNodeand aSecondary NameNode. TheSecondary NameNodeis *not what you think it is*.
The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot connect to the secondary name-node, and in no event it can replace the primary name-node in case of its failure. – From Hadoop FAQ
In any case you may want to put the secondary name node on a different machine that is not the master, but maybe one of the workers. Assume you decide your cluster main node is
10.10.10.104 mynode1
and assume you decide your cluster to have the Secondary NameNode on
10.10.10.105 mynode2
then we add the following to thehdfs-site.xmlfile :
<property> <name>dfs.namenode.http-address</name> <value>10.10.10.104:50070</value> <description>Your NameNode hostname for http access.</description> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>10.10.10.105:50090</value> <description>Your Secondary NameNode hostname for http access.</description> </property>
I thank my colleague Sabeur for helping me with this bit on the Secondary NameNode
Then we also point tomynode1IP to for the Hadoop cluster to tell where we host the hadoopNameNodeby editing:
nano $HADOOP_INSTALL/etc/hadoop/core-site.xml
and we add inside the<configuration>tag the following
<property> <name>fs.defaultFS</name> <value>hdfs://10.10.10.104/</value> <description>NameNode URI</description> </property>
We put the IP addresses of the nodes to be used asDataNodesin theslavesfile, we open it with
nano $HADOOP_INSTALL/etc/hadoop/slaves
And we put the list of server addresses one per line, note that in this case also the master is used, so we put there the following list:
10.10.10.104 10.10.10.105 10.10.10.106
Up to here was mainly aboutHDFS, now we configure theyarncluster, i.e., the execution engine, we then edit theyarn-site.xml.
nano $HADOOP_INSTALL/etc/hadoop/yarn-site.xml
Again we add the following inside the<configuration>tag
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>10.10.10.104:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>10.10.10.104:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>10.10.10.104:8050</value> </property>
Now is time to update all the nodes with this news configuration, thus we copy frommynode1tomynode2andmynode3the directory with the following command (note the destination directory)
scp -r $HADOOP_INSTALL/etc/hadoop hduser@10.10.10.105:$HADOOP_INSTALL/etc/ scp -r $HADOOP_INSTALL/etc/hadoop hduser@10.10.10.106:$HADOOP_INSTALL/etc/
These commands will be used only on the main node
If all went well we should be able to run the following command
hadoop version
and obtain something like
Hadoop 2.4.1 Subversion Unknown -r Unknown Compiled by hduser on 2014-08-23T15:29Z Compiled with protoc 2.5.0 From source with checksum bb7ac0a3c73dc131f4844b873c74b630 This command was run using /usr/local/hadoop-2.4.1/share/hadoop/common/hadoop-common-2.4.1.jar
Now the first step is toformatthe NameNode, this will basically initialize thehdfsfile system. So on the main node you run:
hdfs namenode -format
Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. In short, it keeps the metadata related to datanodes. When we format namenode it formats the meta-data related to data-nodes. – From StackOverflow
These commands will be used only on the main node Now we can start thehdfscluster with the command
start-dfs.sh
And if the preivious command didn‘t complain about anythign, we can create a random directory in ourHDFSfilesystem with
hadoop fs -mkdir -p /datastore
Note that we used the fullhadoop fscommand, but in our profile we added an alias withhfs
now check the size of the files inside thedatanodedirectory
du -sh /usr/local/hadoop/data/datanode
and we can put inside a new directory and, as a test, the.tar.gzfile of hadoop
hfs -mkdir -p /datastore/test hfs -copyFromLocal ~/hadoop-2.4.1.tar.gz /datastore/
now check again the size of the files inside thedatanodedirectory, you can run the same command on all nodes, and see that the file is also on those other servers (all of it or part, it depends on the replication level and the number of nodes you have)
du -sh /usr/local/hadoop/data/datanode
You can check the content of thehdfsdirectory with
hfs -ls /datastore
and remove the all the files with
hfs -rm /datastore/test/*
In case you want to delete an entire directory you can instead use
hfs -rm -r /datastore/test
This is the distributed file system running, and you can check the processes with
jps
Which will give you, on the main node, something like
18755 DataNode 18630 NameNode 18969 SecondaryNameNode 19387 Jps
Up to here we set up the distributed filesystem, this will be come handy not only forhadoop, but also for other distributed computation engines, like Spark or Flink - which was Stratosphere.
Finally to start the actualhadoopyarnexecution engine you just go with
start-yarn.sh
As a side note, in this guide, we used IP addresses in configuration files, if you want to use instead the shortnames you shall first update the/etc/hostsso that all of them are listed with their shortname.
10.10.10.104 mynode1 10.10.10.105 mynode2 10.10.10.106 mynode3
In this case, make sure that there, the only appeareance of the ip127.0.0.1is withlocalhost. This is very important, so if in youhostsfile there is a line like
127.0.0.1 mynode1
delete it!
原文:http://my.oschina.net/letiantian/blog/406338