Alexander Chepurnoy

The Web of Mind

How to Install Hadoop 1.0.3 on Cluster (+Nutch)

| Comments

There are many tutorials over the Web on how to install Hadoop (and Nutch) in cluster environment. But for me, not one was 100% enough. So I’m going to give you another tutorial. Hope it could helps.

The tutorial is tested for CentOS-powered cluster.

The tutorial assumes a cluster has only one Master node. However, in real medium and big-sized application there are more than one master machines

Installation

  • Check whether all needed packages are installed: SUN java, sshd
  • Give all machines hostnames(for example devcluster.master0, devcluster.slave0, devcluster.slave1 etc). Set hostname to /etc/hostname file with sudo nano /etc/hostname (then reboot or relogin or restart network daemon)
  • Append hostnames resolution to /etc/hosts file on every(!) machine. Delete other hostnames associated with a machine IP from /etc/hosts (for example, CentOS xxx.xxx.xxx.xxx for CentOS)
  • If firewall is installed , open ports 54310, 54311, 50030, 50060, 50070
  • Test whether all is ok with ports and hostnames. From a machine try to ping an another machine: ping devcluster.slave1
  • Create Hadoop user (on every machine, of course): useradd hadoop (don’t set password)
  • su hadoop (further commands will be for “hadoop” user if other not specified)
  • Install certificate to make login possible without password: ssh-keygen (on one machine, master, for example. Enter no passphrase!)
  • Now spread generated certificate over a cluster: ssh-copy-id hadoop@devcluster.slave0 (etc… to all machines)
  • Check password-less certificate-based SSH login works correctly ssh localhost then ssh devcluster.slave0 and so on for all machines
  • Add to .bashrc HOME_HADOOP=/etc/hadoop PATH=$PATH:$JAVA_HOME/bin:$HOME_HADOOP/bin

  • Download Hadoop 1.0.3, make changes to configuration files according to official documentation . If you are installing Nutch, download it, build and copy .job file (and bin/nutch) to,say, nutch folder under a Hadoop folder. Then pack Hadoop folder again and move it to the place visible from all machines

  • Unpack package to /etc/hadoop (for all machines) sudo mkdir /etc/hadoop sudo chown hadoop:hadoop /etc/hadoop cd /etc/hadoop wget http://[repository-url]/hadoop.tar.gz tar zxvf hadoop.tar.gz sudo chown -R hadoop:hadoop *
  • Ensure there are no any problems by simply launching bin/hadoop . If some problems occurred(probably, JAVA_HOME is not set), now it’s time to fix.
  • Format HDFS (run this command on master node only) bin/hadoop namenode -format

Installation done!

Usage

  • Start Hadoop cluster: bin/start-all.sh (check whether it’s started ok).
  • Run WordCount example v. 2.0 from official tutorial
  • If you installed Nutch, try to run a Nutch task e.g. crawl with seed URLs given in urllist.txt file: mkdir urls cp [pathtourllist file]/urllist.txt urls/ bin/hadoop dfs -put urls urls bin/hadoop dfs -cat urls/urllist.txt nutch/bin/nutch crawl urls -dir crawled -depth 2 bin/hadoop dfs -copyToLocal crawled ../results
  • Stop cluster: bin/stop-all.sh

Monitoring

  • Check /etc/hadoop/logs folder for keywords “error” and “exception”
  • Check JobTracker status by opening http://[masterNode]:50030/jobtracker.jsp in browser
  • Check HDFS status by opening http://[masterNode]:50070/dfshealth.jsp in browser
  • Check a slave node’s task tracker status with http://[slaveNode]:50060/tasktracker.jsp

Troubleshooting

In case of eroors see logs for exceptions then google them. Or click ‘hire me’ for consultation.

Comments