Setting up a single node Apache Hadoop instance on OS X is pretty simple and much the same as on any other Linux/Unix machines, with a small bit of customer configuration. See here for official instructions. This tutorial provides a quick way of getting your OS X Hadoop instance up and running.
Java is installed by default on OSX. As Hadoop 1.0.3 still recommends Java 6, it can be installed from the command line if desired. You can check the java version with the following command:
Configure SSH
Hadoop uses SSH access to manage its nodes. For our single node setup instance, we need to configure SSH access to our local machine for our Hadoop user. In OS X, this requires that the Remote Login option is enabled on the Sharing Preference.
First, we have to generate an SSH key for our user. From the command line, run the following command:
Save the public and private keys to the default location. When the keys are generated, run the following command:
The last step is to verify that the SSH is working correctly. To do this, run the following:
This should result in a successful login. Repeat with your actual host name.
Downloading and Installing Hadoop
Hadoop may be downloaded from http://www.apache.org/dyn/closer.cgi/hadoop/common/. Select the 1.0.3 release; it will be named hadoop-1.0.3.tar.gz. I like to install Hadoop under /opt/hadoop, but this is a matter of preference; adjust the following to fit your preferences. If /opt doesn’t exist, create it:
Now untar the Hadoop tar file:
Now create a symbolic link to hadoop:
Finally change the ownership to your user:
Configuring and Testing Hadoop
Once installed, there will be four configuration files that will need to be update. See the documentation for what these files can do
Configure: hadoop-env.sh
With your favorite editor, open the file /opt/hadoop/conf/config/hadoop-env.sh to make some environmental updates. Uncomment #JAVA_HOME and specify the command path to dynamically load your Java
# The java implementation to use. Required.
Next, uncomment HADOOP_HEAPSIZE and make it 2000. Technically this is optional, but recommended.
# The maximum amount of heap to use, in MB. Default is 1000.
Starting with OS X Lion, a bug was introduced that caused issues when working with the name node. The error typically shows up as:
To fix this issue, add the following to your hadoop-env.sh file:
To recap, the hadoop-env.sh should contain the following.
export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_HEAPSIZE=2000
Configuring: core-site.xml
This file controls the default file system and where the temporary files are stored. Remember, the directories for the temp files must be writeable by the Hadoop user. Note, replace vader.local with your system’s host name or with localhost.
|
1 2 3 4 5 6 7 8 9 10 11 |
<configuration> <property> <name>fs.default.name</name> <value>hdfs://vader.local:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> </configuration> |
Configuring: hdfs-site.xml
Next, we need to configure HDFS. The hdfs-site.xml is used to configure HDFS itself. In this case, I specify for HDFS to only store one copy of the file and tell HDFS where to store its data.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
<configuration> <property> <name>dfs.name.dir</name> <value>/opt/HDFS/name</value> </property> <property> <name>dfs.data.dir</name> <value>/opt/HDFS/data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> |
Note, the dfs related directories must be writable by the Hadoop user.
Configuring: mapred-site.xml
Specify the job tracker location and also set the maximum map and reduce jobs. In this example, the max number of jobs is limited to 2, but can be changed depending on your system.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
<configuration> <property> <name>mapred.job.tracker</name> <value>vader.local:9001</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> </property> </configuration> |
Initializing HDFS
We need to initialize HDFS before we can use it. This also verifies that the Hadoop user can access the directories. From /opt/hadoop, run the following command:
You should see output like the following:
arun:hadoop prasath$ bin/hadoop namenode -format
12/10/03 21:03:59 INFO namenode.NameNode: STARTUP_MSG:
…
21:03:59 INFO common.Storage: Storage directory /opt/HDFS/name has been successfully formatted.
…
Shutting down NameNode at arun.local/192.168.1.13 ************************************************************/
This completes the setup. Now it is time to start up Hadoop and verify that it all works.
Starting Hadoop
From /opt/hadoop, run the following command to start all the Hadoop services.
You will see each service start; if there are no errors, continue on to run an example test job.
Find Pi to verify Hadoop
To test the installation, run the example Pi calculation job. Again, from /opt/hadoop run the following command: hadoop {your username}
You should see output similar to:
|
1 2 3 4 5 6 7 |
Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 … Estimated value of Pi is 3.14800000000000000000 |






