Setting up Hadoop File System with Ansible

3 min readJan 6, 2021

What is Hadoop?

Hadoop is an Opensource tool by apache which helps you create a distributed processing system on the network on a special protocol called HDFS. This software is a solution for the storage of big data. With the help of a distributed processing system, we can combine multiple processing units and run a map-reduce program on each which helps us to store and large amounts of data and also allows easy scaling both vertically and horizontally.

The architecture is a master node arch but a bit different from others. Here we will be having a client that does most of the work like passing the data to all the data nodes and the master node or the name node only provides the location of data nodes.

What is ansible?

Ansible is an automation tool. But nothing like those old scripts we used to run for automation using python, shell or Perl. The new tools like Ansible, Puppet and Chef are more advanced and are efficient and mostly idempotent in nature.

Main setup

First, we need to define our inventories for this task

[hadoop_master]
192.168.226.132 ansible_ssh_password=Santhi@1

[hadoop_slave]
192.168.226.131 ansible_ssh_password=Santhi@1

[hadoop_instances:children]
hadoop_master
hadoop_slave

After that, we now start defining the main playbook and install the common configurations

All Hadoop Instances

- name: create hadoop dir
  file:
    path: "{{software_path}}"
    state: directory- name: copy jdk.rpm file
   copy:
     src: "./{{jdkpackage}}"
     dest: "{{software_path}}/{{jdkpackage}}"- name: copy hadoop.rpm file
  copy:
     src: "./{{hadooppackage}}"
     dest: "{{software_path}}/{{hadooppackage}}"

In the above tasks, we create a directory inside the managed nodes and then share the rpm packages of java and Hadoop v1

After this, we need to install the packages using the rpm command available in RHEL

Now, although we can use yum module, there is some problem with Hadoop v1 so we need to force install it using rpm command. since the command module is not idempotent, we make that idempotent by first checking a command exists or not.

- name: check for hadoop software
  shell: "hadoop-daemon.sh"
  register: hadoopexists
  ignore_errors: true- name: check for jdk software
  command: "jps"
  register: javaexists
  ignore_errors: true- name: install jdk
  shell: "rpm -iv {{software_path}}/{{jdkpackage}}"
  when: javaexists.rc != 0- name: install hadoop
  shell: "rpm -iv {{software_path}}/{{hadooppackage}}"
  when: hadoopexists.rc == 127

After installing the packages now we need to configure the name node and data node independently

Name Node configuration

We first add a variable in this host called nndir to mention where the files should be stored in hdfs.

- name: copy core-site.xml
  template:
    src: "./core-site_master.xml"
    dest: "/etc/hadoop/core-site.xml"- name: copy hdfs-site.xml
  copy:
    src: "./hdfs-site_master.xml"
    dest: "/etc/hadoop/hdfs-site.xml"

Now we enter the core-site and hdfs-site files and share them.

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>dfs.name.dir</name>
<value>{{ nndir }}</value>
</property>
</configuration>

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9001</value>
</property></configuration>

Now these files are going to be exported to the namenode.

After this, we need to format the namenode and then start the name node.

- name: format the dir
  shell: "echo Y | hadoop namenode -format"- name: start namenode
  shell: hadoop-daemon.sh start namenode

Now we configure the data node.

Data node configuration

we also add a variable in this host called dndir to mention where the files should be stored in the hdfs data node.

- name: copy core-site.xml
  template:
    src: "./core-site_slave.xml"
    dest: "/etc/hadoop/core-site.xml"- name: copy hdfs-site.xml
  template:
    src: "./hdfs-site_slave.xml"
    dest: "/etc/hadoop/hdfs-site.xml"

In the above code, for data node, we need to copy the files to the data nodes too.

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>dfs.data.dir</name>
<value>{{ dndir }}</value>
</property>
</configuration>

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://{{ groups['hadoop_master'][0] }}:9001</value>
</property>
</configuration>

In the above file, we use groups variable retrieved from the inventory from which we take the IP of the name node and then use it in the config file.

Now run the playbook using the command.

ansible-playbook hadoop.yml

Hope you have enjoyed this article and found it useful.

Originally published at https://smc181002.hashnode.dev.