Skip to main content



Installing Hadoop 2.x cluster with multiple nodes

1) Follow steps as below
We are going to set up 3 node cluster for Hadoop to start with follow below steps as written in http://techdevins.blogspot.com/2015/12/installing-single-node-hadoop-220-on.html
1) Prerequisite
2) Add Hadoop Group and User
3) Setup SSH Certificate
4) Disabling IPv6
5) Install/ Setup Hadoop
6) Setup environment variable for hadoop
7) Login using hduser and verify hadoop version

** Please make sure to complete the steps only till step 7)

2) Networking
Update /etc/hosts on each of 3 boxes and add below lines:
172.26.34.91    slave2
192.168.64.96   slave1
172.26.34.126   master

3) SSH access
Setup ssh in every node such that they can communicate with one another without any prompt for password. Since you have followed step 1) on every node. ssh keys has been setup. What we need to do is right now is to access slave1 and slave 2 from master. So, we just have to add the hduser@master’s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of hduser@slave1 and hduser@slave2(in this user’s $HOME/.ssh/authorized_keys)

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@slave1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@slave2
$ chmod 0600 ~/.ssh/authorized_keys

4) Configuration for master node

cd /usr/local/hadoop/etc/hadoop/
$ vi slaves
Add below entries...
master
slave1
slave2

$ vi hdfs-site.xml

<property>
 <name>dfs.replication</name>
 <value>2</value>
 <description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
</property>

<property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/home/hduser/hadoopdata/hdfs/namenode</value>
 <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.</description>
</property>

<property>
 <name>dfs.datanode.address</name>
 <value>0.0.0.0:60010</value>
 <description>The datanode server address and port for data transfer.</description>
</property>

<property>
 <name>dfs.namenode.secondary.http-address</name>
 <value>0.0.0.0:60090</value>
 <description>The secondary namenode http server address and port.</description>
</property>

<property>
 <name>dfs.namenode.secondary.https-address</name>
 <value>0.0.0.0:60091</value>
 <description>The secondary namenode https server address and port.</description>
</property>

<property>
 <name>dfs.datanode.http.address</name>
 <value>0.0.0.0:60075</value>
 <description>The datanode http server address and port.</description>
</property>


<property>
 <name>dfs.datanode.ipc.address</name>
 <value>0.0.0.0:60020</value>
 <description>The datanode ipc server address and port.</description>
</property>

<property>
 <name>dfs.namenode.http-address</name>
 <value>0.0.0.0:60070</value>
 <description>The address and the base port where the dfs namenode web ui will listen on.</description>
</property>


<property>
 <name>dfs.datanode.data.dir</name>
 <value>file:/home/hduser/hadoopdata/hdfs/datanode</value>
 <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.</description>
</property>

$vi core-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hduser/tmp</value>
  <description>Temporary Directory.</description>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://master:54310</value>
  <description>Use HDFS as file storage engine</description>
</property>

<property>
<name>hadoop.proxyuser.hduser.hosts</name>
    <value>*</value>
</property>

<property>
     <name>hadoop.proxyuser.hduser.groups</name>
     <value>*</value>
</property>

$ vi yarn-site.xml

<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
 <name>yarn.nodemanager.localizer.address</name>
 <value>${yarn.nodemanager.hostname}:9040</value>
</property>

<property>
 <name>yarn.nodemanager.webapp.address</name>
 <value>${yarn.nodemanager.hostname}:9042</value>
</property>

<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>master:9030</value>
</property>

<property>
 <name>yarn.resourcemanager.address</name>
 <value>master:9032</value>
</property>

<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>master:9088</value>
</property>

<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>master:9031</value>
</property>

<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>master:9033</value>
</property>

<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>

<property>
  <name>yarn.nodemanager.pmem-check-enabled</name>
  <value>false</value>
</property>


$ mapred-site.xml

<property>
 <name>mapreduce.jobtracker.address</name>
 <value>master:54311</value>
 <description>The host and port that the MapReduce job tracker runs at. If .local., then jobs are run in-process as a single map and reduce task.</description>
</property>

<property>
 <name>mapreduce.shuffle.port</name>
 <value>13564</value>
 <description>Default port that the ShuffleHandler will run on. ShuffleHandler is a service run at the NodeManager to facilitate transfers of intermediate Map outputs to requesting Reducers.</description>
</property>

<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 <description>The framework for running mapreduce jobs</description>
</property>

<property>
 <name>mapreduce.jobhistory.address</name>
 <value>0.0.0.0:10030</value>
 <description>MapReduce JobHistory Server IPC host:port</description>
</property>

<property>
 <name>mapreduce.jobhistory.webapp.address</name>
 <value>0.0.0.0:18888</value>
 <description>MapReduce JobHistory Server Web UI host:port</description>
</property>

<!--property>
    <name>mapreduce.map.memory.mb</name>
    <value>4096</value>
</property>

<property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>8192</value>
</property-->

<property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx3072m</value>
</property>

<property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx6144m</value>
</property>


<!--property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx3072m</value>
</property>

<property>
    <name>io.sort.mb</name>
    <value>512</value>
</property-->

$ vi hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/jdk

5) Configuration for slave machines (slave1 and slave2)

$ vi mapred-site.xml
<property>
 <name>mapreduce.jobtracker.address</name>
 <value>master:54311</value>
 <description>The host and port that the MapReduce job tracker runs at. If .local., then jobs are run in-process as a single map and reduce task.</description>
</property>


<property>
 <name>mapreduce.shuffle.port</name>
 <value>13564</value>
 <description>Default port that the ShuffleHandler will run on. ShuffleHandler is a service run at the NodeManager to facilitate transfers of intermediate Map outputs to requesting Reducers.</description>
</property>


<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 <description>The framework for running mapreduce jobs</description>
</property>

<property>
 <name>mapreduce.jobhistory.address</name>
 <value>0.0.0.0:10030</value>
 <description>MapReduce JobHistory Server IPC host:port</description>
</property>

<property>
 <name>mapreduce.jobhistory.webapp.address</name>
 <value>0.0.0.0:18888</value>
 <description>MapReduce JobHistory Server Web UI host:port</description>
</property>

<!--property>
    <name>mapreduce.map.memory.mb</name>
    <value>4096</value>
</property>

<property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>8192</value>
</property-->

<property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx3072m</value>
</property>

<property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx6144m</value>
</property>

<!--property>
    <name>mapred.child.java.opts</name>
    <value> -Xmx1073741824</value>
</property>

<property>
    <name>io.sort.mb</name>
    <value>512</value>
</property-->

$ vi core-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hduser/tmp</value>
  <description>Temporary Directory.</description>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://master:54310</value>
  <description>Use HDFS as file storage engine</description>
</property>


<property>
<name>hadoop.proxyuser.hduser.hosts</name>
    <value>*</value>
</property>

<property>
     <name>hadoop.proxyuser.hduser.groups</name>
     <value>*</value>
</property>

$ vi yarn-site.xml

<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
 <name>yarn.nodemanager.localizer.address</name>
 <value>${yarn.nodemanager.hostname}:9040</value>
</property>

<property>
 <name>yarn.nodemanager.webapp.address</name>
 <value>${yarn.nodemanager.hostname}:9042</value>
</property>

<property>
 <name>yarn.resourcemanager.scheduler.address</name>
 <value>master:9030</value>
</property>

<property>
 <name>yarn.resourcemanager.address</name>
 <value>master:9032</value>
</property>

<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>master:9088</value>
</property>

<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>master:9031</value>
</property>

<property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>master:9033</value>
</property>

<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>

<property>
  <name>yarn.nodemanager.pmem-check-enabled</name>
  <value>false</value>
</property>

$ vi hdfs-site.xml

<property>
 <name>dfs.replication</name>
 <value>2</value>
 <description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
</property>

<property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/home/hduser/hadoopdata/hdfs/namenode</value>
 <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.</description>
</property>

<property>
 <name>dfs.datanode.address</name>
 <value>0.0.0.0:60010</value>
 <description>The datanode server address and port for data transfer.</description>
</property>

<property>
 <name>dfs.namenode.secondary.http-address</name>
 <value>0.0.0.0:60090</value>
 <description>The secondary namenode http server address and port.</description>
</property>

<property>
 <name>dfs.namenode.secondary.https-address</name>
 <value>0.0.0.0:60091</value>
 <description>The secondary namenode https server address and port.</description>
</property>

<property>
 <name>dfs.datanode.http.address</name>
 <value>0.0.0.0:60075</value>
 <description>The datanode http server address and port.</description>
</property>


<property>
 <name>dfs.datanode.ipc.address</name>
 <value>0.0.0.0:60020</value>
 <description>The datanode ipc server address and port.</description>
</property>

<property>
 <name>dfs.namenode.http-address</name>
 <value>0.0.0.0:60070</value>
 <description>The address and the base port where the dfs namenode web ui will listen on.</description>
</property>


<property>
 <name>dfs.datanode.data.dir</name>
 <value>file:/home/hduser/hadoopdata/hdfs/datanode</value>
 <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.</description>
</property>

$ vi hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/jdk

6) Formatting the HDFS filesystem via the NameNode
Before we start our new multi-node cluster, we must format Hadoop’s distributed filesystem (HDFS) via the NameNode. You need to do this the first time you set up an Hadoop cluster.To format run

hduser@master:/usr/local/hadoop$ bin/hadoop namenode -format
...
...

7) Start hadoop services
hduser@master:~$ start-dfs.sh
hduser@master:~$ start-yarn.sh
hduser@master:~$ mr-jobhistory-daemon.sh start historyserver

8) Check master , slave1 and slave2 if all java processes are running or not. Also open http://master:9088/cluster/nodes to see if it shows 3 nodes.

Comments

Popular posts

Python [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Missing Authority Key Identifier

  Error requests.exceptions.SSLError: HTTPSConnectionPool  Max retries exceeded with url:  (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Missing Authority Key Identifier (_ssl.c:1028)'))). Analysis & Solution Recently, we updated from Python 3.11 to 3.13, which resulted in error above. We did verify AKI = SKI in chain of certificates. Also, imported chain of certificates into certifi. Nothing worked for us. Seemingly, it is a bug with Python 3.13. So, we downgraded to Python 3.12 and it started working. Other problems and solution -  '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)'))) solution  pip install pip-system-certs [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired  (_ssl.c:1006) solution  1# openssl s_client -showcerts -connect  signin.aws.amazon.com:443  </dev/...




Spring MongoDB Rest API not returning response in 90 seconds which is leading to client timeout

  We have Spring Boot  Rest API deployed in Kubernetes cluster which integrates with MongoDB to fetch the data.  MongoDB is fed with data by a real time Spark & NiFi job.  Our clients complained that for a request what they send they don't have response within 90 seconds. Consider it like an OMS ( Order ManagEment System).  On further analysis, we found that Spark & NiFi processing is happenning within 10 seconds after consuming response data from Kafka. Thus, initally out thought was that it due to delay from upstream to produce data in to Kafka.  Thankfully, our data had create / request  timestamp, and when response was received, and when response was inserted into MongoDB. Subtracting response insert time from request time seemed to be well within 90 seconds. But, still client did timeout on not seeing a response within 90 seconds. This led to confusion on our side.  But, then we realized it was due to Read Preference . We updated this...




MongoDB Regex Query taking more time in Production but same query perform well in UAT

   We came across a situation where-in, MongoDB Query was taking more time in Production like 10 seconds and 4.2 seconds but same query performed well in UAT taking under 400 ms. The very first thought that was evident to us that it is because of amount of data which differed in UAT and Production. Then we ran following to see the execution plan -   db.collection.aggregate(<queries>).explain() This gave us Winning and Rejected Plans. Under which, we analyzed that although it was using 'IXSCAN.' But, it was incorrect index- as we had one compound index built on time field and other fields, and there was other index just on time field for TTL purposes. Winning plan picked TTL index rather than compound index. Thus, we dropped TTL index and built TTL index on a different time field.  That got our query performance time from 10 seconds to 726 ms. Also, for other query the performance came down from 8 seconds to 4.3 seconds. Then, we ran following -  ...




What is Leadership

 




Spark MongoDB Connector Not leading to correct count or data while reading

  We are using Scala 2.11 , Spark 2.4 and Spark MongoDB Connector 2.4.4 Use Case 1 - We wanted to read a Shareded Mongo Collection and copy its data to another Mongo Collection. We noticed that after Spark Job successful completion. Output MongoDB did not had many records. Use Case 2 -  We read a MongoDB collection and doing count on dataframe lead to different count on each execution. Analysis,  We realized that MongoDB Spark Connector is missing data on bulk read as a dataframe. We tried various partitioner, listed on page -  https://www.mongodb.com/docs/spark-connector/v2.4/configuration/  But, none of them worked for us. Finally, we tried  MongoShardedPartitioner  this lead to constant count on each execution. But, it was greater than the actual count of records on the collection. This seems to be limitation with MongoDB Spark Connector. But,  MongoShardedPartitioner  seemed closest possible solution to this kind of situation. But, it per...