Setting up Hadoop File System with Ansible
What is Hadoop?
Hadoop is an Opensource tool by apache which helps you create a distributed processing system on the network on a special protocol called HDFS. This software is a solution for the storage of big data. With the help of a distributed processing system, we can combine multiple processing units and run a map-reduce program on each which helps us to store and large amounts of data and also allows easy scaling both vertically and horizontally.
The architecture is a master node arch but a bit different from others. Here we will be having a client that does most of the work like passing the data to all the data nodes and the master node or the name node only provides the location of data nodes.
What is ansible?
Ansible is an automation tool. But nothing like those old scripts we used to run for automation using python, shell or Perl. The new tools like Ansible, Puppet and Chef are more advanced and are efficient and mostly idempotent in nature.
Main setup
First, we need to define our inventories for this task
[hadoop_master]
192.168.226.132 ansible_ssh_password=Santhi@1
[hadoop_slave]
192.168.226.131 ansible_ssh_password=Santhi@1
[hadoop_instances:children]
hadoop_master
hadoop_slave
After that, we now start defining the main playbook and install the common configurations
All Hadoop Instances
- name: create hadoop dir
file:
path: "{{software_path}}"
state: directory- name: copy jdk.rpm file
copy:
src: "./{{jdkpackage}}"
dest: "{{software_path}}/{{jdkpackage}}"- name: copy hadoop.rpm file
copy:
src: "./{{hadooppackage}}"
dest: "{{software_path}}/{{hadooppackage}}"
In the above tasks, we create a directory inside the managed nodes and then share the rpm packages of java and Hadoop v1
After this, we need to install the packages using the rpm
command available in RHEL
Now, although we can use yum
module, there is some problem with Hadoop v1 so we need to force install it using rpm
command. since the command module is not idempotent, we make that idempotent by first checking a command exists or not.
- name: check for hadoop software
shell: "hadoop-daemon.sh"
register: hadoopexists
ignore_errors: true- name: check for jdk software
command: "jps"
register: javaexists
ignore_errors: true- name: install jdk
shell: "rpm -iv {{software_path}}/{{jdkpackage}}"
when: javaexists.rc != 0- name: install hadoop
shell: "rpm -iv {{software_path}}/{{hadooppackage}}"
when: hadoopexists.rc == 127
After installing the packages now we need to configure the name node and data node independently
Name Node configuration
We first add a variable in this host called nndir
to mention where the files should be stored in hdfs.
- name: copy core-site.xml
template:
src: "./core-site_master.xml"
dest: "/etc/hadoop/core-site.xml"- name: copy hdfs-site.xml
copy:
src: "./hdfs-site_master.xml"
dest: "/etc/hadoop/hdfs-site.xml"
Now we enter the core-site and hdfs-site files and share them.
hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>dfs.name.dir</name>
<value>{{ nndir }}</value>
</property>
</configuration>
core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9001</value>
</property></configuration>
Now these files are going to be exported to the namenode.
After this, we need to format the namenode and then start the name node.
- name: format the dir
shell: "echo Y | hadoop namenode -format"- name: start namenode
shell: hadoop-daemon.sh start namenode
Now we configure the data node.
Data node configuration
we also add a variable in this host called dndir
to mention where the files should be stored in the hdfs data node.
- name: copy core-site.xml
template:
src: "./core-site_slave.xml"
dest: "/etc/hadoop/core-site.xml"- name: copy hdfs-site.xml
template:
src: "./hdfs-site_slave.xml"
dest: "/etc/hadoop/hdfs-site.xml"
In the above code, for data node, we need to copy the files to the data nodes too.
hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>dfs.data.dir</name>
<value>{{ dndir }}</value>
</property>
</configuration>
core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://{{ groups['hadoop_master'][0] }}:9001</value>
</property>
</configuration>
In the above file, we use groups variable retrieved from the inventory from which we take the IP of the name node and then use it in the config file.
Now run the playbook using the command.
ansible-playbook hadoop.yml
Hope you have enjoyed this article and found it useful.
Originally published at https://smc181002.hashnode.dev.