There are many tutorials over the Web on how to install Hadoop (and Nutch) in cluster environment. But for me, not one was 100% enough. So I’m going to give you another tutorial. Hope it could helps.
The tutorial is tested for CentOS-powered cluster.
The tutorial assumes a cluster has only one Master node. However, in real medium and big-sized application there are more than one master machines
- Check whether all needed packages are installed: SUN java, sshd
- Give all machines hostnames(for example devcluster.master0, devcluster.slave0, devcluster.slave1 etc). Set hostname to /etc/hostname file with
sudo nano /etc/hostname(then reboot or relogin or restart network daemon)
- Append hostnames resolution to /etc/hosts file on every(!) machine. Delete other hostnames associated with a machine IP from /etc/hosts (for example, CentOS xxx.xxx.xxx.xxx for CentOS)
- If firewall is installed , open ports 54310, 54311, 50030, 50060, 50070
- Test whether all is ok with ports and hostnames. From a machine try to ping an another machine:
- Create Hadoop user (on every machine, of course):
useradd hadoop(don’t set password)
su hadoop(further commands will be for “hadoop” user if other not specified)
- Install certificate to make login possible without password:
ssh-keygen(on one machine, master, for example. Enter no passphrase!)
- Now spread generated certificate over a cluster:
ssh-copy-id firstname.lastname@example.org(etc… to all machines)
- Check password-less certificate-based SSH login works correctly
ssh devcluster.slave0and so on for all machines
Add to .bashrc HOME_HADOOP=/etc/hadoop PATH=$PATH:$JAVA_HOME/bin:$HOME_HADOOP/bin
Download Hadoop 1.0.3, make changes to configuration files according to official documentation . If you are installing Nutch, download it, build and copy .job file (and bin/nutch) to,say,
nutchfolder under a Hadoop folder. Then pack Hadoop folder again and move it to the place visible from all machines
- Unpack package to /etc/hadoop (for all machines) sudo mkdir /etc/hadoop sudo chown hadoop:hadoop /etc/hadoop cd /etc/hadoop wget http://[repository-url]/hadoop.tar.gz tar zxvf hadoop.tar.gz sudo chown -R hadoop:hadoop *
- Ensure there are no any problems by simply launching
bin/hadoop. If some problems occurred(probably, JAVA_HOME is not set), now it’s time to fix.
- Format HDFS (run this command on master node only)
bin/hadoop namenode -format
- Start Hadoop cluster:
bin/start-all.sh(check whether it’s started ok).
- Run WordCount example v. 2.0 from official tutorial
- If you installed Nutch, try to run a Nutch task e.g. crawl with seed URLs given in urllist.txt file: mkdir urls cp [pathtourllist file]/urllist.txt urls/ bin/hadoop dfs -put urls urls bin/hadoop dfs -cat urls/urllist.txt nutch/bin/nutch crawl urls -dir crawled -depth 2 bin/hadoop dfs -copyToLocal crawled ../results
- Stop cluster:
- Check /etc/hadoop/logs folder for keywords “error” and “exception”
- Check JobTracker status by opening http://[masterNode]:50030/jobtracker.jsp in browser
- Check HDFS status by opening http://[masterNode]:50070/dfshealth.jsp in browser
- Check a slave node’s task tracker status with http://[slaveNode]:50060/tasktracker.jsp
In case of eroors see logs for exceptions then google them. Or click ‘hire me’ for consultation.