Running Hadoop on Amazon EC2

Version 0.17of Hadoop includes a few changes that provide support for multiple simultaneous clusters, quicker startup times for large clusters, and includes a pre-configured Ganglia installation. These differences are noted below.

Note:Cloudera also provides their distribution for Hadoop as an EC2 AMI with single command deployment and support for Hive/Pig out of the box.

Security

Clusters of Hadoop instances are created in a security group. Instances within the group have unfettered access to one another. Machines outside the group (such as your workstation), can only access the instance via ports 22 (for SSH), port 50030 (for the JobTracker), and port 50060 (for the TaskTracker).

Setting up

Unpack the latest Hadoop distribution on your system
Edit all relevant variables in src/contrib/ec2/bin/hadoop-ec2-env.sh.
- Amazon Web Services variables (AWS_ACCOUNT_ID, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- Security variables (EC2_KEYDIR, KEY_NAME, PRIVATE_KEY_PATH, SSH_OPTS)
- AMI selection (HADOOP_VERSION, S3_BUCKET)
  - These two variables control which AMI is used.
  - To see which versions are publicly available type:
```
% ec2-describe-images -x all | grep hadoop
```
  - The default value for S3_BUCKET (hadoop-ec2-images) is for public images. Images for Hadoop version 0.17.1 and later are in the hadoop-images bucket, so you should change this variable if you want to use one of these images. You also need to change this if you want to use a private image you have built yourself.

Running a job on a cluster

Open a command prompt in src/contrib/ec2.
Launch a EC2 cluster and start Hadoop with the following command. You must supply a cluster name (test-cluster) and the number of slaves (2). After the cluster boots, the public DNS name will be printed to the console.
```
% bin/hadoop-ec2 launch-cluster test-cluster 2
```
You can login to the master node from your workstation by typing:
```
% bin/hadoop-ec2 login test-cluster
```
You will then be logged into the master node where you can start your job.
- For example, to test your cluster, try
```
# cd /usr/local/hadoop-*
# bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
```
You can check progress of your job at http://<MASTER_HOST>:50030/. Where MASTER_HOST is the host name returned after the cluster started, above.
When you have finished, shutdown the cluster with the following:
```
% bin/hadoop-ec2 terminate-cluster test-cluster
```
Keep in mind that the master node is started first and configured, then all slaves nodes are booted simultaneously with boot parameters pointing to the master node. Even though the launch-cluster command has returned, the whole cluster may not have yet 'booted'. You should monitor the cluster via port 50030 to make sure all nodes are up.

Running a job on a cluster from a remote machine

In some cases it's desirable to be able to submit a job to a hadoop cluster running in EC2 from a machine that's outside EC2 (for example a personal workstation). Similarly - it's convenient to be able to browse/cat files in HDFS from a remote machine. One of the advantages of this setup is that it obviates the need to create custom AMIs that bundle stock Hadoop AMIs and user libraries/code. All the non-Hadoop code can be kept on the remote machine and can be made available to Hadoop during job submission time (in the form of jar files and other files that are copied into Hadoop's distributed cache). The only downside being the cost of copying these data sets into EC2 and the latency involved in doing so.

The recipe for doing this is well documented in this Cloudera blog post and involves configuring hadoop to use a ssh tunnel through the master hadoop node. In addition - this recipe only works when using EC2 scripts from versions of Hadoop that have the fix for HADOOP-5839 incorporated. (Alternatively, users can apply patches from this JIRA to older versions of Hadoop that do not have this fix).

Troubleshooting

Running Hadoop on EC2 involves a high level of configuration, so it can take a few goes to get the system working for your particular set up.

If you are having problems with the Hadoop EC2 launch-cluster command then you can run the following in turn, which have the same effect but may help you to see where the problem is occurring:

% bin/hadoop-ec2 launch-master <cluster-name>
% bin/hadoop-ec2 launch-slaves <cluster-name> <num slaves>

Note you can call the launch-slaves command as many times as necessary to grow your cluster. Shrinking a cluster is more tricky and should be done by hand (after balancing file replications etc).

To browse all your nodes via a web browser, starting at the 50030 status page, start the following command in a new shell window:

% bin/hadoop-ec2 proxy <cluster-name>

This command will start a SOCKS tunnel through your master node, and print out all the URLs you can reach from you web browser. For this to work, you must configure your browser to send requests over SOCKS to the local proxy on port 6666. The FireFox plugin FoxyProxy is great for this.

Currently, the scripts don't have much in the way of error detection or handling. If a script produces an error, then you may need to use the Amazon EC2 tools for interacting with instances directly - for example, to shutdown an instance that is mis-configured.

Another technique for debugging is to manually run the scripts line-by-line until the error occurs. If you have feedback or suggestions, or need help then please use the Hadoop mailing lists.

If you are finding that all your nodes are not showing up, you can point your browser to the Ganglia status page for your cluster at http://<MASTER_HOST>/ganglia/, after starting the proxy command.