– Software Engineers
– Application Developers
– IT Architects
– System Administrators
– Data Analysts and Scientists

HADOOP ADMINISTRATION

ABOUT HADOOP ADMINISTRATION

Hadoop is a free, Java -based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.

 

WHY IS HADOOP IMPORTANT?

  1. Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
  2. Computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes you use the more processing power you have.
  3. Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
  4. Flexibility. Unlike traditional relational database, you don’t have to preprocess data before storing it, you can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
  5. Low cost. The open-source framework is free and used commodity hardware to store large quantities of data.
  6. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

 

CERTIFICATE COURSE ON HADOOP ADMINISTRATION

The Hadoop Cluster Administration training course is designed to provide knowledge and skills to become a successful Hadoop Architect. It starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, configure, manage, monitor, and secure a Hadoop Cluster. There will be many challenging, practical and focused hands-on exercises for the learners. By the end of this Hadoop Cluster Administration training, you will be prepared to understand and solve real world administration problems that you may come across while working on Hadoop cluster.

 

COURSE OUTLINE

Introduction

 

The Case for Apache Hadoop

  • Why Hadoop?
  • Fundamental Concepts
  • Core Hadoop Components

 

Hadoop Cluster Installation

  • Cloudera Manager Features
  • Cloudera Manager Installation
  • Hadoop (CDH) Installation

 

The Hadoop Distributed File System (HDFS)

  • HDFS Features
  • Writing and Reading Files
  • NameNode Memory Considerations
  • Overview of HDFS Security
  • Web UIs for HDFS
  • Using the Hadoop File Shell

 

MapReduce and  YARN

  • YARN: The Cluster Resource Manager
  • MapReduce Concepts
  • Running Computational Frameworks on YARN
  • Exploring YARN Applications Through the Web UIs, and the Shell
  • YARN Application Logs

Hadoop Configuration and Daemon Logs

  • Cloudera Manager Constructs for Managing Configurations
  • Locating Configurations and Applying Configuration Changes
  • Managing Role Instances and Adding Services
  • Configuring the HDFS Service
  • Configuring Hadoop Daemon Logs
  • Configuring the YARN Service

 

Getting Data Into HDFS

  • Ingesting Data From External Sources With Flume
  • Ingesting Data From Relational Databases With Sqoop

 

Planning Your Hadoop Cluster

  • General Planning Considerations
  • Choosing the Right Hardware
  • Virtualization Options
  • Network Considerations
  • Configuring Nodes

 

Installing and Configuring Hive, Impala, and Pig

  • Hive
  • Impala
  • Pig

 

Advanced Cluster Configuration

  • Advanced Configuration Parameters
  • Configuring Hadoop Ports
  • Configuring HDFS for Rack Awareness
  • Configuring HDFS High Availability

 

Hadoop Security

  • Why Hadoop Security Is Important
  • Hadoop’s Security System Concepts
  • What Kerberos Is and how it Works
  • Securing a Hadoop Cluster With Kerberos
  • Other Security Concepts

 

Managing Resources

  • The FIFO Scheduler
  • The Fair Scheduler
  • The Capacity Scheduler
  • YARN Memory and CPU Settings

 

 

Cluster Maintenance

  • Checking HDFS Status
  • Copying Data Between Clusters
  • Adding and Removing Cluster Nodes
  • Rebalancing the Cluster
  • Directory Snapshots
  • Cluster Upgrading

 

Cluster Monitoring and Troubleshooting

  • Cloudera Manager Monitoring Features
  • Monitoring Hadoop Clusters
  • Troubleshooting Hadoop Clusters
  • Common Misconfigurations