Running Hadoop on Amazon EC2
Version 0.17of Hadoop includes a few changes that provide support for multiple simultaneous clusters, quicker startup times for large clusters, and includes a
pre-configured Ganglia installation. These differences are noted below.
Note:Cloudera also provides their distribution
for Hadoop as an EC2
AMI with single command deployment and support for Hive/Pig out of the
box.
Security
Clusters
of Hadoop instances are created in a security group. Instances within the
group have unfettered access to one another. Machines outside the group
(such as your workstation), can only access the instance via ports 22 (for SSH),
port 50030 (for the JobTracker), and port 50060 (for the TaskTracker).
Setting
up
- Unpack the
latest Hadoop distribution on
your system
- Edit all relevant
variables in src/contrib/ec2/bin/hadoop-ec2-env.sh.
- Amazon Web Services
variables (AWS_ACCOUNT_ID, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- Security variables (EC2_KEYDIR, KEY_NAME, PRIVATE_KEY_PATH, SSH_OPTS)
- AMI selection (HADOOP_VERSION, S3_BUCKET)
Running
a job on a cluster
- Open a command prompt in src/contrib/ec2.
- Launch a EC2 cluster and
start Hadoop with the following command. You must supply a cluster
name (test-cluster) and the number of slaves (2). After the cluster
boots, the public DNS name will be printed to the console.
% bin/hadoop-ec2 launch-cluster test-cluster 2
- You can login to the
master node from your workstation by typing:
% bin/hadoop-ec2 login test-cluster
- You will then be logged into the master node where you can start your
job.
- You can check progress of
your job at http://<MASTER_HOST>:50030/.
Where MASTER_HOST is the host name returned after the cluster started,
above.
- When you have finished,
shutdown the cluster with the following:
% bin/hadoop-ec2 terminate-cluster test-cluster
- Keep in mind that the
master node is started first and configured, then all slaves nodes are
booted simultaneously with boot parameters pointing to the master
node. Even though the launch-cluster command
has returned, the whole cluster may not have yet 'booted'. You should
monitor the cluster via port 50030 to make sure all nodes are up.
Running
a job on a cluster from a remote machine
In
some cases it's desirable to be able to submit a job to a hadoop cluster
running in EC2 from a machine that's outside EC2 (for example a personal
workstation). Similarly - it's convenient to be able to browse/cat files
in HDFS from a remote machine. One of the advantages of this setup is that
it obviates the need to create custom AMIs that bundle stock Hadoop AMIs
and user libraries/code. All the non-Hadoop code can be kept on the remote
machine and can be made available to Hadoop during job submission time (in
the form of jar files and other files that are copied into Hadoop's
distributed cache). The only downside being the cost
of copying these data sets into EC2 and the latency involved in doing
so.
The
recipe for doing this is well documented in this
Cloudera blog post and
involves configuring hadoop to use a ssh tunnel through the master hadoop
node. In addition - this recipe only works when using EC2 scripts from
versions of Hadoop that have the fix for HADOOP-5839 incorporated. (Alternatively,
users can apply patches from this JIRA to older versions of Hadoop that do
not have this fix).
Troubleshooting
Running
Hadoop on EC2 involves a high level of configuration, so it can take a few
goes to get the system working for your particular set up.
If
you are having problems with the Hadoop EC2 launch-cluster command
then you can run the following in turn, which have the same effect but may
help you to see where the problem is occurring:
% bin/hadoop-ec2 launch-master <cluster-name>
% bin/hadoop-ec2 launch-slaves <cluster-name> <num slaves>
Note
you can call the launch-slaves command as many times as
necessary to grow your cluster. Shrinking a cluster is more tricky and
should be done by hand (after balancing file replications etc).
To
browse all your nodes via a web browser, starting at the 50030 status
page, start the following command in a new shell window:
% bin/hadoop-ec2 proxy <cluster-name>
This
command will start a SOCKS tunnel through your master node, and print out
all the URLs you can reach from you web browser. For this to work, you
must configure your browser to send requests over SOCKS to the local proxy
on port 6666. The FireFox plugin FoxyProxy is great for this.
Currently,
the scripts don't have much in the way of error detection or handling. If
a script produces an error, then you may need to use the Amazon EC2 tools
for interacting with instances directly - for example, to shutdown an
instance that is mis-configured.
Another
technique for debugging is to manually run the scripts line-by-line until
the error occurs. If you have feedback or suggestions, or need help then
please use the Hadoop mailing lists.
If
you are finding that all your nodes are not showing up, you can point your
browser to the Ganglia status page for your cluster at http://<MASTER_HOST>/ganglia/, after
starting the proxy command.
|