VMware vSphere Big Data Extensions - 2.0 Administrator’s Guide

VMware vSphere Big Data Extensions
Administrator's and User's Guide
vSphere Big Data Extensions 2.0
This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent editions of this document, see http://www.vmware.com/support/pubs.
EN-001514-02
You can find the most up-to-date technical documentation on the VMware Web site at:
http://www.vmware.com/support/
The VMware Web site also provides the latest product updates.
If you have comments about this documentation, submit your feedback to:
docfeedback@vmware.com
Copyright © 2013, 2014 VMware, Inc. All rights reserved. Copyright and trademark information. This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 United States License
(http://creativecommons.org/licenses/by-nd/3.0/us/legalcode).
VMware, Inc.
3401 Hillview Ave. Palo Alto, CA 94304 www.vmware.com
2 VMware, Inc.

Contents

About This Book 7
Updated Information 9
About VMware vSphere Big Data Extensions 11
1
Getting Started with Big Data Extensions 11
Big Data Extensions and Project Serengeti 13
About Big Data Extensions Architecture 14
Big Data Extensions Support for Hadoop Features By Distribution 14
Hadoop Feature Support By Distribution 16
Installing Big Data Extensions 19
2
System Requirements for Big Data Extensions 20
Internationalization and Localization 22
Deploy the Big Data Extensions vApp in the vSphere Web Client 23
Install RPMs in the Serengeti Management Server Yum Repository 26
Install the Big Data Extensions Plug-In 27
Connect to a Serengeti Management Server 28
Install the Serengeti Remote Command-Line Interface Client 29
Access the Serengeti Command-Line Interface Using the Remote Command-Line Interface Client 30
Upgrading Big Data Extensions 33
3
Prepare to Upgrade Big Data Extensions 34
Upgrade Big Data Extensions Virtual Appliance 35
Upgrade the Big Data Extensions Plug-in 38
Upgrade the Serengeti Command-Line Interface 38
Upgrade the CentOS 6.x Template 39
Upgrade Big Data Extensions Virtual Machine Components 40
VMware, Inc.
Managing Hadoop Distributions 43
4
Hadoop Distribution Deployment Types 43
Configure a Tarball-Deployed Hadoop Distribution 44
Configuring Yum and Yum Repositories 46
Create a Hadoop Template Virtual Machine using RHEL Server 6.x and VMware Tools 57
Maintain a Customized Hadoop Template Virtual Machine 60
Managing the Big Data Extensions Environment 63
5
Add Specific User Names to Connect to the Serengeti Management Server 64
Change the Password for the Serengeti Management Server 64
Configure vCenter Single Sign-On Settings for the Serengeti Management Server 65
Create a User Name and Password for the Serengeti Command-Line Interface 66
3
Stop and Start Serengeti Services 66
Managing vSphere Resources for Hadoop and HBase Clusters 69
6
Add a Resource Pool with the Serengeti Command-Line Interface 70
Remove a Resource Pool with the Serengeti Command-Line Interface 70
Add a Datastore in the vSphere Web Client 70
Remove a Datastore in the vSphere Web Client 71
Add a Network in the vSphere Web Client 72
Reconfigure a Static IP Network in the vSphere Web Client 72
Remove a Network in the vSphere Web Client 73
Creating Hadoop and HBase Clusters 75
7
About Hadoop and HBase Cluster Deployment Types 76
About Cluster Topology 77
About HBase Database Access 77
Create a Hadoop or HBase Cluster in the vSphere Web Client 78
Create a Cluster with Topology Awareness with the Serengeti Command-Line Interface 80
Managing Hadoop and HBase Clusters 83
8
Stop and Start a Hadoop Cluster in the vSphere Web Client 84
Scale Out a Hadoop Cluster in the vSphere Web Client 84
Scale CPU and RAM in the vSphere Web Client 85
Reconfigure a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 86
Delete a Hadoop Cluster in the vSphere Web Client 88
About Resource Usage and Elastic Scaling 88
Use Disk I/O Shares to Prioritize Cluster Virtual Machines in the vSphere Web Client 92
About vSphere High Availability and vSphere Fault Tolerance 93
Recover from Disk Failure with the Serengeti Command-Line Interface Client 93
Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client 94
Change the User Password on All of a Cluster's Nodes 94
Monitoring the Big Data Extensions Environment 97
9
View Serengeti Management Server Initialization Status 97
View Provisioned Clusters in the vSphere Web Client 98
View Cluster Information in the vSphere Web Client 99
Monitor the Hadoop Distributed File System Status in the vSphere Web Client 100
Monitor MapReduce Status in the vSphere Web Client 101
Monitor HBase Status in the vSphere Web Client 101
Using Hadoop Clusters from the Serengeti Command-Line Interface 103
10
Run HDFS Commands with the Serengeti Command-Line Interface 103
Run MapReduce Jobs with the Serengeti Command-Line Interface 104
Run Pig and PigLatin Scripts with the Serengeti Command-Line Interface 104
Run Hive and Hive Query Language Scripts with the Serengeti Command-Line Interface 105
Accessing Hive Data with JDBC or ODBC 107
11
Configure Hive to Work with JDBC 107
Configure Hive to Work with ODBC 109
4 VMware, Inc.
Contents
Troubleshooting 111
12
Log Files for Troubleshooting 112
Configure Serengeti Logging Levels 113
Collect Log Files for Troubleshooting 113
Big Data Extensions Virtual Appliance Upgrade Fails 114
Troubleshooting Cluster Creation Failures 114
Cannot Restart or Reconfigure a Cluster After Changing Its Distribution 121
Cannot Restart or Reconfigure a Cluster Whose Time Is Not Synchronized 121
Virtual Machine Cannot Get IP Address 122
vCenter Server Connections Fail to Log In 122
SSL Certificate Error When Connecting to Non-Serengeti Server with the vSphere Console 123
Serengeti Operations Fail After You Rename a Resource in vSphere 123
A New Plug-In Instance with the Same or Earlier Version Number as a Previous Plug-In Instance
Does Not Load 123
MapReduce Job Fails to Run and Does Not Appear In the Job History 124
Cannot Submit MapReduce Jobs for Compute-Only Clusters with External Isilon HDFS 125
MapReduce Job Stops Responding on a PHD or CDH4 YARN Cluster 125
Unable to Connect the Big Data Extensions Plug-In to the Serengeti Server 125
Cannot Perform Serengeti Operations after Deploying Big Data Extensions 126
Host Name and FQDN Do Not Match for Serengeti Management Server 127
Upgrade Cluster Error When Using Cluster Created in Earlier Version of Big Data Extensions 128
Index 129
VMware, Inc. 5
6 VMware, Inc.

About This Book

VMware vSphere Big Data Extensions Administrator's and User's Guide describes how to install Big Data Extensions within your vSphere environment, and how to manage and monitor Hadoop and HBase clusters using the Big Data Extensions plug-in for vSphere Web Client.
VMware vSphere Big Data Extensions Administrator's and User's Guide also describes how to perform Hadoop and HBase operations using the Serengeti Command-Line Interface Client, which provides a greater degree of control for certain system management and Big Data cluster creation tasks.
Intended Audience
This guide is for system administrators and developers who want to use Big Data Extensions to deploy and manage Hadoop clusters. To successfully work with Big Data Extensions, you should be familiar with VMware® vSphere® and Hadoop and HBase deployment and operation.
VMware Technical Publications Glossary
VMware Technical Publications provides a glossary of terms that might be unfamiliar to you. For definitions of terms as they are used in VMware technical documentation, go to
http://www.vmware.com/support/pubs.
VMware, Inc.
7
8 VMware, Inc.

Updated Information

This VMware vSphere Big Data Extensions Administrator's and User's Guide is updated with each release of the product or when necessary.
This table provides the update history of the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Revision Description
EN-001514-02 Enhanced the description of how to access the upgrade source in the topic “Download the Upgrade
Source and Accept the License Agreement,” on page 36.
EN-001514-01 Expanded the information on preparing to upgrade and upgrading the Big Data Extensions vApp in
Chapter 3, “Upgrading Big Data Extensions,” on page 33.
EN-001514-00 Initial release.
VMware, Inc. 9
10 VMware, Inc.
About VMware vSphere Big Data
Extensions 1
VMware® vSphere™ Big Data Extensions lets you deploy and centrally operate Hadoop and HBase clusters running on VMware vSphere. Big Data Extensions simplifies the Hadoop and HBase deployment and provisioning process, and gives you a real time view of the running services and the status of their virtual hosts. It provides a central place from which to manage and monitor your Hadoop and HBase cluster, and incorporates a full range of tools to help you optimize cluster performance and utilization.
Getting Started with Big Data Extensions on page 11
n
VMware vSphere Big Data Extensions lets you deploy Hadoop and HBase clusters. The tasks in this section describe how to set up vSphere for use with Big Data Extensions, deploy the Big Data Extensions vApp, access the vCenter Server and command-line interface (CLI) administrative consoles, and configure a Hadoop distribution for use with Big Data Extensions.
Big Data Extensions and Project Serengeti on page 13
n
Big Data Extensions runs on top of Project Serengeti, the open source project initiated by VMware to automate the deployment and management of Hadoop and HBase clusters on virtual environments such as vSphere.
About Big Data Extensions Architecture on page 14
n
The Serengeti Management Server and Hadoop Template virtual machine work together to configure and provision Hadoop and HBase clusters.
Big Data Extensions Support for Hadoop Features By Distribution on page 14
n
Big Data Extensions provides different levels of feature support depending on the Hadoop distribution and version that you use.
Hadoop Feature Support By Distribution on page 16
n
Each Hadoop distribution and version provides differing feature support. Learn which Hadoop distributions support which features.

Getting Started with Big Data Extensions

VMware vSphere Big Data Extensions lets you deploy Hadoop and HBase clusters. The tasks in this section describe how to set up vSphere for use with Big Data Extensions, deploy the Big Data Extensions vApp, access the vCenter Server and command-line interface (CLI) administrative consoles, and configure a Hadoop distribution for use with Big Data Extensions.
Prerequisites
Understand what Project Serengeti and Big Data Extensions is so that you know how they fit into your
n
Big Data workflow and vSphere environment. See “Big Data Extensions and Project Serengeti,” on page 13.
VMware, Inc.
11
Verify that the Big Data Extensions features that you want to use, such as data-compute separated
n
clusters and elastic scaling, are supported by Big Data Extensions for the Hadoop distribution that you want to use. See “Big Data Extensions Support for Hadoop Features By Distribution,” on page 14.
Understand which features are supported by your Hadoop distribution. See “Hadoop Feature Support
n
By Distribution,” on page 16.
Procedure
1 Do one of the following.
Install Big Data Extensions for the first time. Review the system requirements, install vSphere, and
n
install the Big Data Extensions components: Big Data Extensions vApp, Big Data Extensions plug­in for vCenter Server, and Serengeti Remote Command-Line Interface Client. See Chapter 2,
“Installing Big Data Extensions,” on page 19.
Upgrade Big Data Extensions from a previous version. Perform the upgrade steps. See Chapter 3,
n
“Upgrading Big Data Extensions,” on page 33.
2 (Optional) Install and configure a distribution other than Apache Hadoop for use with Big Data
Extensions.
Apache Hadoop is included in the Serengeti Management Server, but you can use any Hadoop distribution that Big Data Extensions supports. See Chapter 4, “Managing Hadoop Distributions,” on page 43.
What to do next
After you have successfully installed and configured your Big Data Extensions environment, you can perform the following additional tasks, in any order.
Stop and start the Serengeti services, create user accounts, manage passwords, and log in to cluster
n
nodes to perform troubleshooting. See Chapter 5, “Managing the Big Data Extensions Environment,” on page 63.
Manage the vSphere resource pools, datastores, and networks that you use to create Hadoop and HBase
n
clusters. See Chapter 6, “Managing vSphere Resources for Hadoop and HBase Clusters,” on page 69.
Create, provision, and manage Hadoop and HBase clusters. See Chapter 7, “Creating Hadoop and
n
HBase Clusters,” on page 75 and Chapter 8, “Managing Hadoop and HBase Clusters,” on page 83.
Monitor the status of the clusters that you create, including their datastores, networks, and resource
n
pools, through the vSphere Web Client and the Serengeti Command-Line Interface. See Chapter 9,
“Monitoring the Big Data Extensions Environment,” on page 97.
On your Big Data clusters, run HDFS commands, Hive and Pig scripts, and MapReduce jobs, and access
n
Hive data. See Chapter 10, “Using Hadoop Clusters from the Serengeti Command-Line Interface,” on page 103.
If you encounter any problems when using Big Data Extensions, see Chapter 12, “Troubleshooting,” on
n
page 111.
12 VMware, Inc.

Big Data Extensions and Project Serengeti

Big Data Extensions runs on top of Project Serengeti, the open source project initiated by VMware to automate the deployment and management of Hadoop and HBase clusters on virtual environments such as vSphere.
Big Data Extensions and Project Serengeti provide the following components.
Chapter 1 About VMware vSphere Big Data Extensions
Project Serengeti
Serengeti Management Server
Serengeti Command­Line Interface Client
Big Data Extensions
An open source project initiated by VMware, Project Serengeti lets users deploy and manage Hadoop and Big Data clusters in a vCenter Server managed environment. The major components are the Serengeti Management Server, which provides cluster provisioning, software configuration, and management services; an elastic scaling framework; and command-line interface. Project Serengeti is made available under the Apache 2.0 license, under which anyone can modify and redistribute Project Serengeti according to the terms of the license.
Provides the framework and services to run Big Data clusters on vSphere. The Serengeti Management Server performs resource management, policy­based virtual machine placement, cluster provisioning, software configuration management, and environment monitoring.
The command-line interface (CLI) client provides a comprehensive set of tools and utilities with which to monitor and manage your Big Data deployment. If you are using the open source version of Serengeti without Big Data Extensions, the CLI is the only interface through which you can perform administrative tasks.
The commercial version of the open source Project Serengeti from VMware, Big Data Extensions, is delivered as a vCenter Server Appliance. Big Data Extensions includes all the Project Serengeti functions and the following additional features and components.
Enterprise level support from VMware.
n
Hadoop distribution from the Apache community.
n
NOTE VMware provides the Hadoop distribution as a convenience but does not provide enterprise-level support. The Apache Hadoop distribution is supported by the open source community.
The Big Data Extensions plug-in, a graphical user interface integrated
n
with vSphere Web Client. This plug-in lets you perform common Hadoop infrastructure and cluster management administrative tasks.
Elastic scaling lets you optimize cluster performance and utilization of
n
physical compute resources in a vSphere environment. Elasticity­enabled clusters start and stop virtual machines, adjusting the number of active compute nodes based on configuration settings that you specify, to optimize resource consumption. Elasticity is ideal in a mixed workload environment to ensure that workloads can efficiently share the underlying physical resources while high-priority jobs are assigned sufficient resources.
VMware, Inc. 13

About Big Data Extensions Architecture

The Serengeti Management Server and Hadoop Template virtual machine work together to configure and provision Hadoop and HBase clusters.
Big Data Extensions performs the following steps to deploy a Hadoop or HBase cluster.
1 The Serengeti Management Server searches for ESXi hosts with sufficient resources to operate the
cluster based on the configuration settings that you specify, and then selects the ESXi hosts on which to place Hadoop virtual machines.
2 The Serengeti Management Server sends a request to the vCenter Server to clone and configure virtual
machines to use with the Hadoop or HBase cluster.
3 The Serengeti Management Server configures the operating system and network parameters for the
new virtual machines.
4 Each virtual machine downloads the Hadoop software packages and installs them by applying the
distribution and installation information from the Serengeti Management Server.
5 The Serengeti Management Server configures the Hadoop parameters for the new virtual machines
based on the cluster configuration settings that you specify.
6 The Hadoop services are started on the new Hadoop virtual machines, at which point you have a
running cluster based on your configuration settings.

Big Data Extensions Support for Hadoop Features By Distribution

Big Data Extensions provides different levels of feature support depending on the Hadoop distribution and version that you use.
Support for Hadoop MapReduce v1 Distribution Features
Big Data Extensions provides differing levels of feature support depending on the Hadoop distribution and version that you configure for use. Table 1-1 lists the supported Hadoop MapReduce v1 distributions and indicates which features are supported when using the distribution with Big Data Extensions.
Table 11. Big Data Extensions Feature Support for Hadoop MapReduce v1 Distributions
Apache Hadoop Cloudera Hortonworks Intel MapR
Version 1.2 CDH4, CDH5 HDP 1.3 2.5.1, 3.0.2 3.0.2-3.1.0
Automatic Deployment
Scale Out Yes Yes Yes Yes Yes
Create Cluster with Multiple Networks
Data-Compute Separation
Compute-only Yes Yes Yes Yes No
Elastic Scaling of Compute Nodes
Hadoop Configuration
Yes Yes Yes Yes Yes
Yes Yes Yes Yes No
Yes Yes Yes Yes Yes
Yes Yes when
using MapReduce v1
Yes Yes Yes Yes No
Yes Yes No
14 VMware, Inc.
Chapter 1 About VMware vSphere Big Data Extensions
Table 11. Big Data Extensions Feature Support for Hadoop MapReduce v1 Distributions (Continued)
Apache Hadoop Cloudera Hortonworks Intel MapR
Hadoop Topology
Yes Yes Yes Yes No
Configuration
Run Hadoop
Yes No No No No Commands from the CLI
Hadoop
Yes No Yes No No Virtualization Extensions (HVE)
vSphere HA Yes Yes Yes Yes Yes
Service Level vSphere HA
Yes See “About
Service Level
Yes Yes No
vSphere HA for Cloudera,”
on page 16
vSphere FT Yes Yes Yes Yes Yes
Support for Hadoop MapReduce v2 (YARN) Distribution Features
Big Data Extensions provides differing levels of feature support depending on the Hadoop distribution and version that you configure for use. Table 1-2lists the supported Hadoop MapReduce v2 distributions and indicates which features are supported when using the distribution with Big Data Extensions.
Table 12. Big Data Extensions Feature Support for Hadoop MapReduce v2 (YARN) Distributions
Apache Bigtop
Apache Hadoop Cloudera Cloudera Hortonworks Pivotal
Version 0.7.0 2.0 CDH4 CDH5 HDP 2.0, 2.1 PHD 1.1,
2.0
Automatic
Yes Yes Yes Yes Yes Yes
Deployment
Scale Out Yes Yes Yes Yes Yes Yes
Create
Yes Yes Yes Yes Yes Yes Cluster with Multiple Networks
Data-
Yes Yes Yes Yes Yes Yes Compute Separation
Compute-
Yes Yes Yes Yes Yes Yes only
Elastic Scaling of Compute Nodes
Hadoop
Yes Yes No when
using MapReduce 2
No when using MapReduce 2
Yes No
Yes Yes Yes Yes Yes Yes Configuratio n
Hadoop
Yes Yes Yes Yes Yes Yes Topology Configuratio n
VMware, Inc. 15
Table 12. Big Data Extensions Feature Support for Hadoop MapReduce v2 (YARN) Distributions (Continued)
Run Hadoop
Apache
Bigtop
No No No No No No
Apache Hadoop Cloudera Cloudera Hortonworks Pivotal
Commands from the CLI
Hadoop Virtualizatio n Extensions (HVE)
Support only
for HDFS
Support only for HDFS
No Support
only for HDFS
Support only for HDFS. HDP 1.3 provides full support.
vSphere HA No No No No No No
Service Level vSphere HA
No No See “About
Service Level vSphere HA for Cloudera,”
on page 16
See “About
Service Level vSphere HA for Cloudera,”
No No
on page 16
vSphere FT No No No No No No
About Service Level vSphere HA for Cloudera
The Cloudera distributions offer the following support for Service Level vSphere HA.
Cloudera using MapReduce v1 provides service level vSphere HA support for JobTracker.
n
Yes
Cloudera provides its own service level HA support for NameNode through HDFS2.
n

Hadoop Feature Support By Distribution

Each Hadoop distribution and version provides differing feature support. Learn which Hadoop distributions support which features.
Hadoop Features
The table illustrates which Hadoop distributions support which features.
Table 13. Hadoop Feature Support
Apache Hadoop
Version 1.2 2.0 CDH4,
HDFS1 Yes Yes No Yes Yes Yes No No
HDFS2 No Yes Yes No No No No Yes
MapReducev1Yes Yes Yes Yes Yes Yes Yes No
MapReduce v2 (YARN)
Pig Yes Yes Yes Yes Yes Yes Yes Yes
Hive Yes Yes Yes Yes Yes Yes Yes Yes
Hive Server Yes Yes Yes Yes Yes Yes Yes Yes
No Yes Yes No Yes No No Yes
Apache Bigtop Cloudera
CDH5
Hortonwork s Hortonworks Intel MapR Pivotal
HDP 1.3 HDP 2.0-2.1 2.5.1, 3.0.2 3.0.2-3.1.0 PHD 1.1,
2.0
16 VMware, Inc.
Table 13. Hadoop Feature Support (Continued)
Chapter 1 About VMware vSphere Big Data Extensions
Apache Hadoop
Apache Bigtop Cloudera
Hortonwork s Hortonworks Intel MapR Pivotal
HBase Yes Yes Yes Yes Yes Yes Yes Yes
ZooKeeper Yes Yes Yes Yes Yes Yes Yes Yes
VMware, Inc. 17
18 VMware, Inc.

Installing Big Data Extensions 2

To install Big Data Extensions so that you can create and provision Hadoop and HBase clusters, you must install the Big Data Extensions components in the order described.
Procedure
1 System Requirements for Big Data Extensions on page 20
Before you begin the Big Data Extensions deployment tasks, your system must meet all of the prerequisites for vSphere, clusters, networks, storage, hardware, and licensing.
2 Internationalization and Localization on page 22
Big Data Extensions supports internationalization (I18N) level 1. However, there are resources you specify that do not provide UTF-8 support. You can use only ASCII attribute names consisting of alphanumeric characters and underscores (_) for these resources.
3 Deploy the Big Data Extensions vApp in the vSphere Web Client on page 23
Deploying the Big Data Extensions vApp is the first step in getting your Hadoop cluster up and running with vSphere Big Data Extensions.
4 Install RPMs in the Serengeti Management Server Yum Repository on page 26
Install the wsdl4j and mailx RPM packages within the Serengeti Management Server's internal Yum repository.
VMware, Inc.
5 Install the Big Data Extensions Plug-In on page 27
To enable the Big Data Extensions user interface for use with a vCenter Server Web Client, register the plug-in with the vSphere Web Client. The Big Data Extensions graphical user interface is supported only when using vSphere Web Client 5.1 and later. If you install Big Data Extensions on vSphere 5.0, perform all administrative tasks using the Serengeti Command-Line Interface Client.
6 Connect to a Serengeti Management Server on page 28
To use the Big Data Extensions plug-in to manage and monitor Big Data clusters and Hadoop distributions, you must connect the Big Data Extensions plug-in to the Serengeti Management Server in your Big Data Extensions deployment.
7 Install the Serengeti Remote Command-Line Interface Client on page 29
Although the Big Data Extensions Plug-in for vSphere Web Client supports basic resource and cluster management tasks, you can perform a greater number of the management tasks using the Serengeti Command-line Interface Client.
8 Access the Serengeti Command-Line Interface Using the Remote Command-Line Interface Client on
page 30
You can access the Serengeti Command-Line Interface using the Serengeti Remote Command-Line Interface Client. The Serengeti Remote Command-Line Interface Client lets you access the Serengeti Management Server to deploy, manage, and use Hadoop.
19
What to do next
If you want to create clusters on any Hadoop distribution other than Apache Hadoop, which is included in the Serengeti Management Server, install and configure the distribution for use with Big Data Extensions. See Chapter 4, “Managing Hadoop Distributions,” on page 43.

System Requirements for Big Data Extensions

Before you begin the Big Data Extensions deployment tasks, your system must meet all of the prerequisites for vSphere, clusters, networks, storage, hardware, and licensing.
Big Data Extensions requires that you install and configure vSphere and that your environment meets minimum resource requirements. Make sure that you have licenses for the VMware components of your deployment.
vSphere Requirements
Before you install Big Data Extensions, set up the following VMware products.
Install vSphere 5.0 (or later) Enterprise or Enterprise Plus.
n
NOTE The Big Data Extensions graphical user interface is supported only when using vSphere Web Client 5.1 and later. If you install Big Data Extensions on vSphere 5.0, perform all administrative tasks using the Serengeti Command-Line Interface.
When installing Big Data Extensions on vSphere 5.1 or later, use
n
VMware® vCenter™ Single Sign-On to provide user authentication. When logging in to vSphere 5.1 or later you pass authentication to the vCenter Single Sign-On server, which you can configure with multiple identity sources such as Active Directory and OpenLDAP. On successful authentication, your user name and password is exchanged for a security token that is used to access vSphere components such as Big Data Extensions.
Configure all ESXi hosts to use the same Network Time Protocol (NTP)
n
server.
On each ESXi host, add the NTP server to the host configuration, and
n
from the host configuration's Startup Policy list, select Start and stop with host. The NTP daemon ensures that time-dependent processes
occur in sync across hosts.
Cluster Settings
Configure your cluster with the following settings.
Enable vSphere HA and VMware vSphere® Distributed Resource
n
Scheduler™.
Enable Host Monitoring.
n
Enable Admission Control and set the policy you want. The default
n
policy is to tolerate one host failure.
Set the virtual machine restart priority to High.
n
Set the virtual machine monitoring to virtual machine and Application
n
Monitoring.
Set the Monitoring sensitivity to High.
n
Enable vMotion and Fault Tolerance Logging.
n
All hosts in the cluster have Hardware VT enabled in the BIOS.
n
20 VMware, Inc.
Chapter 2 Installing Big Data Extensions
The Management Network VMkernel Port has vMotion and Fault
n
Tolerance Logging enabled.
Network Settings
Big Data Extensions can deploy clusters on a single network or use multiple networks. The environment determines how Port Groups that are attached to NICs are configured and which network backs each Port Group.
You can use either a vSwitch or vSphere Distributed Switch (vDS) to provide the Port Group backing a Serengeti cluster. vDS acts as a single virtual switch across all attached hosts while a vSwitch is per-host and requires the Port Group to be configured manually.
When configuring your networks to use with Big Data Extensions, verify that the following ports are open as listening ports.
Ports 8080 and 8443 are used by the Big Data Extensions plug-in user
n
interface and the Serengeti Command-Line Interface Client.
Port 5480 is used by vCenter Single Sign-On for monitoring and
n
management.
Port 22 is used by SSH clients.
n
To prevent having to open a network firewall port to access Hadoop
n
services, log into the Hadoop client node, and from that node you can access your cluster.
To connect to the Internet (for example, to create an internal Yum
n
repository from which to install Hadoop distributions), you may use a proxy.
To enable communications, be sure that firewalls and Web filters do not
n
block the Serengeti Management Server or other Serengeti nodes.
Direct Attached Storage
Resource Requirements for the vSphere Management Server and Templates
Resource Requirements for the Hadoop Cluster
Attach and configure Direct Attached Storage on the physical controller to present each disk separately to the operating system. This configuration is commonly described as Just A Bunch Of Disks (JBOD). Create VMFS Datastores on Direct Attached Storage using the following disk drive recommendations.
8-12 disk drives per host. The more disk drives per host, the better the
n
performance.
1-1.5 disk drives per processor core.
n
7,200 RPM disk Serial ATA disk drives.
n
Resource pool with at least 27.5GB RAM.
n
40GB or more (recommended) disk space for the management server
n
and Hadoop template virtual disks.
Datastore free space is not less than the total size needed by the Hadoop
n
cluster, plus swap disks for each Hadoop node that is equal to the memory size requested.
Network configured across all relevant ESXi hosts, and has connectivity
n
with the network in use by the management server.
VMware, Inc. 21
vSphere HA is enabled for the master node if vSphere HA protection is
n
needed. To use vSphere HA or vSphere FT to protect the Hadoop master node, you must use shared storage.
Hardware Requirements for the vSphere and Big Data Extensions Environment
Host hardware is listed in the VMware Compatibility Guide. To run at optimal performance, install your vSphere and Big Data Extensions environment on the following hardware.
Dual Quad-core CPUs or greater that have Hyper-Threading enabled. If
n
you can estimate your computing workload, consider using a more powerful CPU.
Use High Availability (HA) and dual power supplies for the master
n
node's host machine.
4-8 GBs of memory for each processor core, with 6% overhead for
n
virtualization.
Use a 1GB Ethernet interface or greater to provide adequate network
n
bandwidth.
Tested Host and Virtual Machine Support
The maximum host and virtual machine support that has been confirmed to successfully run with Big Data Extensions is 128 physical hosts running a total of 512 virtual machines.
vSphere Licensing
You must use a vSphere Enterprise license or above to use VMware vSphere HA and vSphere DRS.

Internationalization and Localization

Big Data Extensions supports internationalization (I18N) level 1. However, there are resources you specify that do not provide UTF-8 support. You can use only ASCII attribute names consisting of alphanumeric characters and underscores (_) for these resources.
Big Data Extensions Supports Unicode UTF-8
vCenter Server resources you specify using both the CLI and vSphere Web Client can be expressed with underscore (_), hyphen (-), blank spaces, and all letters and numbers from any language. For example, you can specify resources such as datastores labeled using non-English characters.
When using a Linux operating system VMware recommend's configuring the system for use with UTF-8 encoding specific to your locale. For example, to use U.S. English, specify the following locale encoding: en_US.UTF-8. See your vendor's documentation for information on configuring UTF-8 encoding for your Linux environment.
Special Character Support
The following vCenter Server resources can have a period (.) in their name, letting you select them using both the CLI and vSphere Web Client.
portgroup name
n
cluster name
n
resource pool name
n
datastore name
n
The use of a period is not allowed in the Serengeti resource name.
22 VMware, Inc.
Chapter 2 Installing Big Data Extensions
Resources Excluded From Unicode UTF-8 Support
The Serengeti cluster specification file, manifest file, and topology racks-hosts mapping file do not provide UTF-8 support. When creating these files to define the nodes and resources for use by the cluster, use only ASCII attribute names consisting of alphanumeric characters and underscores (_).
The following resource names
cluster name
n
nodeGroup name
n
node name
n
virtual machine name
n
The following attributes in the Serengeti cluster specification file.
distro name
n
role
n
cluster configuration
n
storage type
n
haFlag
n
instanceType
n
groupAssociationsType
n
The rack name in the topology racks-hosts mapping file, and the placementPolicies field of the Serengeti cluster specification file.

Deploy the Big Data Extensions vApp in the vSphere Web Client

Deploying the Big Data Extensions vApp is the first step in getting your Hadoop cluster up and running with vSphere Big Data Extensions.
Prerequisites
Install and configure vSphere. See “System Requirements for Big Data Extensions,” on page 20.
n
Configure all ESXi hosts to use the same NTP server.
n
On each ESXi host, add the NTP server to the host configuration, and from the host configuration's
n
Startup Policy list, select Start and stop with host. The NTP daemon ensures that time-dependent processes occur in sync across hosts.
When installing Big Data Extensions on vSphere 5.1 or later, use vCenter Single Sign-On to provide
n
user authentication.
Verify that you have one vSphere Enterprise license for each host on which you deploy virtual Hadoop
n
nodes. You manage your vSphere licenses in the vSphere Web Client or in vCenter Server.
Install the Client Integration plug-in for the vSphere Web Client. This plug-in enables OVF deployment
n
on your local file system.
NOTE Depending on the security settings of your browser, you might have to approve the plug-in when you use it the first time.
Download the Big Data Extensions OVA from the VMware download site.
n
VMware, Inc. 23
Verify that you have at least 40GB disk space available for the OVA. You need additional resources for
n
the Hadoop cluster.
Ensure that you know the vCenter Single Sign-On Look-up Service URL for your vCenter Single Sign-
n
On service.
If you are installing Big Data Extensions on vSphere 5.1 or later, ensure that your environment includes vCenter Single Sign-On. Use vCenter Single Sign-On to provide user authentication on vSphere 5.1 or later.
See “System Requirements for Big Data Extensions,” on page 20 for a complete list.
Procedure
1 In the vSphere Web Client vCenter Hosts and Clusters view, select Actions > All vCenter Actions >
Deploy OVF Template.
2 Choose the location where the Big Data Extensions OVA resides and click Next.
Option Description
Deploy from File
Deploy from URL
Browse your file system for an OVF or OVA template.
Type a URL to an OVF or OVA template located on the internet. For example: http://vmware.com/VMTN/appliance.ovf.
3 View the OVF Template Details page and click Next.
4 Accept the license agreement and click Next.
5 Specify a name for the vApp, select a target datacenter for the OVA, and click Next.
The only valid characters for Big Data Extensions vApp names are alphanumeric and underscores. The vApp name must be < 60 characters. When you choose the vApp name, also consider how you will name your clusters. Together the vApp and cluster names must be < 80 characters.
6 Select a vSphere resource pool for the OVA and click Next.
Select a top-level resource pool. Child resource pools are not supported by Big Data Extensions even though you can select a child resource pool. If you select a child resource pool, you will not be able to create clusters from Big Data Extensions.
7 Select shared storage for the OVA and click Next.
If shared storage is not available, local storage is acceptable.
8 For each network specified in the OVF template, select a network in the Destination Networks column
in your infrastructure to set up the network mapping.
The first network lets the Management Server communicate with your Hadoop cluster. The second network lets the Management Server communicate with vCenter Server. If your vCenter Server deployment does not use IPv6, you can specify the same IPv4 destination network for use by both source networks.
24 VMware, Inc.
Chapter 2 Installing Big Data Extensions
9 Configure the network settings for your environment, and click Next.
a Enter the network settings that let the Management Server communicate with your Hadoop
cluster.
Use a static IPv4 (IP) network. An IPv4 address is four numbers separated by dots as in aaa.bbb.ccc.ddd, where each number ranges from 0 to 255. You must enter a netmask, such as
255.255.255.0, and a gateway address, such as 192.168.1.253.
If the vCenter Server or any ESXi host or Hadoop distribution repository is resolved using a fully qualified domain name (FQDN), you must enter a DNS address. Enter the DNS server IP address as DNS Server 1. If there is a secondary DNS server, enter its IP address as DNS Server 2.
NOTE You cannot use a shared IP pool with Big Data Extensions.
b (Optional) If you are using IPv6 between the Management Server and vCenter Server, select the
Enable Ipv6 Connection? checkbox.
Enter the IPv6 address, or FQDN, of the vCenter Server. The IPv6 address size is 128 bits. The preferred IPv6 address representation is: xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx where each x is a hexadecimal digit representing 4 bits. IPv6 addresses range from 0000:0000:0000:0000:0000:0000:0000:0000 to ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff. For convenience, an IPv6 address may be abbreviated to shorter notations by application of the following rules.
Remove one or more leading zeroes from any groups of hexadecimal digits. This is usually
n
done to either all or none of the leading zeroes. For example, the group 0042 is converted to 42.
Replace consecutive sections of zeroes with a double colon (::). You may only use the double
n
colon once in an address, as multiple uses would render the address indeterminate. RFC 5952 recommends that a double colon not be used to denote an omitted single section of zeroes.
The following example demonstrates applying these rules to the address
2001:0db8:0000:0000:0000:ff00:0042:8329.
Removing all leading zeroes results in the address 2001:db8:0:0:0:ff00:42:8329.
n
Omitting consecutive sections of zeroes results in the address 2001:db8::ff00:42:8329.
n
See RFC 4291 for more information on IPv6 address notation.
10 Verify that the Initialize Resources check box is selected and click Next.
If the check box is unselected, the resource pool, data store, and network connection assigned to the vApp will not be added to Big Data Extensions.
If you do not add the resource pool, datastore, and network when you deploy the vApp, use the vSphere Web Client or the Serengeti Command-Line Interface Client to specify the resource pool, datastore, and network information before you create a Hadoop cluster.
11 Verify the vService bindings and click Next.
12 Verify the installation information and click Finish.
vCenter Server deploys the Big Data Extensions vApp. When deployment finishes, two virtual machines are available in the vApp.
The Management Server virtual machine, management-server (also called the Serengeti
n
Management Server), which is started as part of the OVA deployment.
VMware, Inc. 25
The Hadoop Template virtual machine, hadoop-template, which is not started. Big Data Extensions
n
clones Hadoop nodes from this template when provisioning a cluster. Do not start or stop this virtual machine without good reason. The template does not include a Hadoop distribution.
IMPORTANT Do not delete any files under the /opt/serengeti/.chef directory. If you delete any of these files, such as the sernegeti.pem file, subsequent upgrades to Big Data Extensions might fail without displaying error notifications.
What to do next
Install the Big Data Extensions plug-in within the vSphere Web Client. See “Install the Big Data Extensions
Plug-In,” on page 27.
If the Initialize Resources check box is not selected, add resources to the Big Data Extensions server before you create a Hadoop cluster.

Install RPMs in the Serengeti Management Server Yum Repository

Install the wsdl4j and mailx RPM packages within the Serengeti Management Server's internal Yum repository.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
Client,” on page 23.
Procedure
1 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as the
user serengeti.
2 Download and install the wsdl4j and mailx RPM packages.
If the Serengeti Management Server can connect to the Internet, run the commands as shown in the
n
example below to download the RPMs, copy the files to the required directory, and create a repository.
cd /opt/serengeti/www/yum/repos/centos/6/base/RPMS/ wget http://mirror.centos.org/centos/6/os/x86_64/Packages/mailx-12.4-7.el6.x86_64.rpm wget http://mirror.centos.org/centos/6/os/x86_64/Packages/wsdl4j-1.5.2-7.8.el6.noarch.rpm createrepo ..
If the Serengeti Management Server can not connect to the Internet, you must manually download
n
the RPMs, copy the files to the required directory, and create a repository.
a Download the RPM files as shown in the example below.
http://mirror.centos.org/centos/6/os/x86_64/Packages/mailx-12.4-7.el6.x86_64.rpm http://mirror.centos.org/centos/6/os/x86_64/Packages/wsdl4j-1.5.2-7.8.el6.noarch.rpm
b Copy the RPM files to /opt/serengeti/www/yum/repos/centos/6/base/RPMS/.
c Run the createrepo command to create a repository from the RPMs you downloaded.
createrepo /opt/serengeti/www/yum/repos/centos/6/base/
What to do next
Install the Serengeti Remote Command-Line Interface client. See “Install the Serengeti Remote Command-
Line Interface Client,” on page 29.
26 VMware, Inc.

Install the Big Data Extensions Plug-In

To enable the Big Data Extensions user interface for use with a vCenter Server Web Client, register the plug­in with the vSphere Web Client. The Big Data Extensions graphical user interface is supported only when using vSphere Web Client 5.1 and later. If you install Big Data Extensions on vSphere 5.0, perform all administrative tasks using the Serengeti Command-Line Interface Client.
The Big Data Extensions plug-in provides a graphical user interface that integrates with the vSphere Web Client. Using the Big Data Extensions plug-in interface you can perform common Hadoop infrastructure and cluster management tasks.
NOTE Use only the Big Data Extensions plug-in interface in the vSphere Web Client or the Serengeti Command-Line Interface Client to monitor and manage your Big Data Extensions environment. Performing management operations in vCenter Server might cause the Big Data Extensions management tools to become unsynchronized and unable to accurately report the operational status of your Big Data Extensions environment.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
n
Client,” on page 23.
Chapter 2 Installing Big Data Extensions
Ensure that you have login credentials with administrator privileges for the vCenter Server system with
n
which you are registering Big Data Extensions.
NOTE The user name and password you use to login can not contain characters whose UTF-8 encoding is greater than 0x8000.
If you want to use the vCenter Server IP address to access the vSphere Web Client, and your browser
n
uses a proxy, add the vCenter Server IP address to the list of proxy exceptions.
Procedure
1 Open a Web browser and go to the URL of the vSphere Web Client 5.1 or later.
https://hostname-or-ip-address:port/vsphere-client
The hostname can be either the DNS hostname or IP address of vCenter Server. By default the port is 9443, but this might have been changed during installation of the vSphere Web Client.
2 Type the user name and password with administrative privileges that has permissions on vCenter
Server, and click Login.
3 Using the vSphere Web Client Navigator panel, locate the Serengeti Management Server that you want
to register with the plug-in.
You can find the Serengeti Management Server under the datacenter and resource pool into which you deployed it in the previous task.
4 From the inventory tree, select management-server to display information about the Serengeti
Management Server in the center pane.
Click the Summary tab in the center pane to access additional information.
5 Note the IP address of the Serengeti Management Server virtual machine.
6 Open a Web browser and go to the URL of the management-server virtual machine.
https://management-server-ip-address:8443/register-plugin
The management-server-ip-address is the IP address you noted in Step 5.
VMware, Inc. 27
7 Enter the information to register the plug-in.
Option Description
Register or Unregister Radio Button
vCenter Server host name or IP address
User Name and Password
Big Data Extensions Package URL
Select the Install radio button to install the plug-in. Select Uninstall to un­install the plug-in.
Type the server host name or IP address of vCenter Server. NOTE Do not include http:// or https:// when you type the host name
or IP address.
Type the user name and password with administrative privileges that you use to connect to vCenter Server. The user name and password can not contain characters whose UTF-8 encoding is greater than 0x8000.
The URL with the IP address of the management-server virtual machine where the Big Data Extensions plug-in package is located:
https://management-server-ip-address/vcplugin/serengeti­plugin.zip
8 Click Submit.
The Big Data Extensions plug-in registers with vCenter Server and with the vSphere Web Client.
9 Log out of the vSphere Web Client, and log back in using your vCenter Server user name and
password.
The Big Data Extensions icon appears in the list of objects in the inventory.
10 From the Inventory pane, click Big Data Extensions.
What to do next
Connect the Big Data Extensions plug-in to the Big Data Extensions instance that you want to manage by connecting to its Serengeti Management Server.

Connect to a Serengeti Management Server

To use the Big Data Extensions plug-in to manage and monitor Big Data clusters and Hadoop distributions, you must connect the Big Data Extensions plug-in to the Serengeti Management Server in your Big Data Extensions deployment.
You can deploy multiple instances of the Serengeti Management Server in your environment. However, you can connect the Big Data Extensions plug-in with only one Serengeti Management Server instance at a time. You can change which Serengeti Management Server instance the plug-in connects to using this procedure, and use the Big Data Extensions plug-in interface to manage and monitor multiple Hadoop and HBase distributions deployed in your environment.
IMPORTANT The Serengeti Management Server that you connect to is shared by all users of the Big Data Extensions plug-in interface within the vSphere Web Client. If a user connects to a different Serengeti Management Server, all other users are affected by this change.
Prerequisites
Verify that the Big Data Extensions vApp deployment was successful and that the Serengeti
n
Management Server virtual machine is running. See “Deploy the Big Data Extensions vApp in the
vSphere Web Client,” on page 23.
The version of the Serengeti Management Server and the Big Data Extensions plug-in must be the same.
Ensure that vCenter Single Sign-On is enabled and configured for use by Big Data Extensions for
n
vSphere 5.1 and later. See “Deploy the Big Data Extensions vApp in the vSphere Web Client,” on page 23.
Install the Big Data Extensions plug-in. See “Install the Big Data Extensions Plug-In,” on page 27.
n
28 VMware, Inc.
Chapter 2 Installing Big Data Extensions
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 Click the Summary tab.
4 In the Connected Server dialog, click the Connect Server link.
5 Navigate to the Serengeti Management Server virtual machine within the Big Data Extensions vApp to
which you want to connect, select it, and click OK.
The Big Data Extensions plug-in communicates using SSL with the Serengeti Management Server. When you connect to a Serengeti server instance, the plug-in verifies that the SSL certificate in use by the server is installed, valid, and trusted.
The Serengeti server instance appears as the connected server in the Summary tab of the Big Data Extensions Home.
What to do next
You can add additional resource pool, datastore, and network resources to your Big Data Extensions deployment, and create Hadoop and HBase clusters that you can provision for use.

Install the Serengeti Remote Command-Line Interface Client

Although the Big Data Extensions Plug-in for vSphere Web Client supports basic resource and cluster management tasks, you can perform a greater number of the management tasks using the Serengeti Command-line Interface Client.
IMPORTANT You can only run Hadoop commands from the Serengeti CLI on a cluster running the Apache Hadoop 1.2.1 distribution. To use the command-line to run Hadoop administrative commands for clusters running other Hadoop distributions, such as cfg, fs, mr, pig, and hive, use a Hadoop client node to run these commands.
Prerequisites
Verify that the Big Data Extensions vApp deployment was successful and that the Management Server
n
is running.
Verify that you have the correct user name and password to log into the Serengeti Command-line
n
Interface Client.
If you are deploying on vSphere 5.1 or later, the Serengeti Command-line Interface Client uses your
n
vCenter Single Sign-On credentials.
If you are deploying on vSphere 5.0, the Serengeti Command-line Interface Client uses the default
n
vCenter Server administrator credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment, and that its location is
n
in your PATH environment variable.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 Click the Getting Started tab, and click the Download Serengeti CLI Console link.
A ZIP file containing the Serengeti Command-line Interface Client downloads to your computer.
VMware, Inc. 29
4 Unzip and examine the download, which includes the following components in the cli directory.
The serengeti-cli-version JAR file, which includes the Serengeti Command-line Interface Client.
n
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
5 Open a command shell, and navigate to the directory where you unzipped the Serengeti Command-line
Interface Client download package.
6 Change to the cli directory, and run the following command to open the Serengeti Command-line
Interface Client:
java -jar serengeti-cli-version.jar
What to do next
To learn more about using the Serengeti Command-line Interface Client, see the VMware vSphere Big Data Extensions Command-line Interface Guide.

Access the Serengeti Command-Line Interface Using the Remote Command-Line Interface Client

You can access the Serengeti Command-Line Interface using the Serengeti Remote Command-Line Interface Client. The Serengeti Remote Command-Line Interface Client lets you access the Serengeti Management Server to deploy, manage, and use Hadoop.
IMPORTANT You can only run Hadoop commands from the Serengeti CLI on a cluster running the Apache Hadoop 1.2.1 distribution. To use the command-line to run Hadoop administrative commands for clusters running other Hadoop distributions, such as cfg, fs, mr, pig, and hive, use a Hadoop client node to run these commands.
Prerequisites
Use the vSphere Web Client to log in to the vCenter Server instance on which you deployed the
n
Serengeti vApp.
Verify that the Big Data Extensions vApp deployment was successful and that the Serengeti
n
Management Server is running.
Verify that you have the correct password to log in to Serengeti Command-Line Interface Client. See
n
“Create a User Name and Password for the Serengeti Command-Line Interface,” on page 66.
The Serengeti Command-Line Interface Client uses its vCenter Server credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment and that its location is
n
in your PATH environment variable.
Procedure
1 Open a Web browser to connect to the Serengeti Management Server cli directory.
http://ip_address/cli
2 Download the ZIP file for your version and build.
The filename is in the format VMware-Serengeti-cli-version_number-build_number.ZIP.
3 Unzip the download.
The download includes the following components.
The serengeti-cli-version_number JAR file, which includes the Serengeti Remote Command-Line
n
Interface Client.
30 VMware, Inc.
Chapter 2 Installing Big Data Extensions
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
4 Open a command shell, and change to the directory where you unzipped the package.
5 Change to the cli directory, and run the following command to enter the Serengeti CLI.
java -jar serengeti-cli-version_number.jar
6 Connect to the Serengeti service.
You must run the connect host command every time you begin a command-line session, and after the 30 minute session timeout. You must run this command, or you cannot run any other commands.
a Run the connect command.
connect --host xx.xx.xx.xx:8443
b At the prompt, type your user name, which might be different from your login credentials for the
Serengeti Management Server.
c Type your password.
A command shell opens, and the Serengeti Command-Line Interface prompt appears.
To display a list of available commands, type help.
n
To get help for a specific command, append the name of the command to the help command.
n
help cluster create
Press the Tab key to complete a command.
n
VMware, Inc. 31
32 VMware, Inc.

Upgrading Big Data Extensions 3

You can use VMware vSphere® Update Manager™ to upgrade Big Data Extensions from earlier versions.
Procedure
1 Prepare to Upgrade Big Data Extensions on page 34
As a prerequisite to upgrading Big Data Extensions, you must prepare your system to ensure that you have all necessary software installed and configured properly, and that all components are in the correct state.
2 Upgrade Big Data Extensions Virtual Appliance on page 35
You must perform several tasks to complete the upgrade of the Big Data Extensions virtual appliance. Because the versions of the virtual appliance, the Serengeti Command-Line Interface, and the Big Data Extensions plug-in must all be the same, it is important that you upgrade all components to the new version.
3 Upgrade the Big Data Extensions Plug-in on page 38
You must use the same version of the Serengeti Management Server and the Big Data Extensions plug­in.
4 Upgrade the Serengeti Command-Line Interface on page 38
You must upgrade the Serengeti Command-Line Interface because the Serengeti Command-Line Interface must be the same version as your Big Data Extensions deployment.
VMware, Inc.
5 Upgrade the CentOS 6.x Template on page 39
If your clusters are deployed with a Hadoop Template virtual machine that has a customized version of the CentOS 6.x operating system that includes VMware Tools, you can upgrade the previous CentOS 6.x template so that it is compatible with the Big Data Extensions upgrade.
6 Upgrade Big Data Extensions Virtual Machine Components on page 40
To enable the Serengeti Management Server to manage clusters created in a previous version of Big Data Extensions, you must upgrade the components in each cluster's virtual machines. The Serengeti Management Server uses these components to control the cluster nodes.
33

Prepare to Upgrade Big Data Extensions

As a prerequisite to upgrading Big Data Extensions, you must prepare your system to ensure that you have all necessary software installed and configured properly, and that all components are in the correct state.
Data from nonworking Big Data Extensions deployments is not migrated during the upgrade process. If Big Data Extensions is not working and you cannot recover according to the troubleshooting procedures, do not try to perform the upgrade. Instead, uninstall the previous Big Data Extensions components and install the new version. See Chapter 2, “Installing Big Data Extensions,” on page 19.
IMPORTANT Do not delete any files in the /opt/serengeti/.chef directory. If you delete any of these files, such as the sernegeti.pem file, subsequent upgrades to Big Data Extensions might fail without displaying error notifications.
Prerequisites
Install vSphere Update Manager. For more information, see the vSphere Update Manager
n
documentation.
Verify that your previous Big Data Extensions deployment is working normally.
n
Verify that you can create a default Hadoop cluster.
n
Procedure
1 Install vSphere Update Manager on a Windows Server.
Use the same version of vSphere Update Manager as vCenter Server. For example, if you are using
n
vCenter Server 5.5, use vSphere Update Manager 5.5.
vSphere Update Manager requires network connectivity with vCenter Server. Each
n
vSphere Update Manager instance must be registered with a single vCenter Server instance.
2 Log in to vCenter Server with the vSphere Web Client.
3 Power on the Hadoop Template virtual machine.
4 If the Serengeti Management Server is configured to use a static IP network, make sure that the Hadoop
Template virtual machine receives a valid IP address.
You must have a valid IP address and be connected to the network for the Hadoop Template virtual machine to connect to vSphere Update Manager.
5 For each cluster that is in AUTO scaling mode, change the scaling mode to MANUAL.
a Open a command shell and log in to the Serengeti Management Server as user serengeti.
b Set the scaling mode of the cluster to MANUAL and the --targetComputeNodeNum parameter value
to the number of provisioned compute nodes in the cluster.
cluster setParam --name cluster-name --elasticityMode manual --targetComputeNodeNum num­provisioned-compute-nodes
6 Verify that all Hadoop clusters are in one of the following states:
RUNNING
n
STOPPED
n
CONFIGURE_ERROR
n
If the status of the cluster is PROVISIONING, wait for the process to finish and for the state of the cluster to change to RUNNING.
34 VMware, Inc.
7 Make sure that the host name of the Serengeti Management Server matches its fully qualified domain
name (FQDN).
8 Access the Admin view of vSphere Update Manager.
a Start vSphere Update Manager.
b On the Home page of the vSphere Web Client, select Hosts and Clusters.
c Click Update Manager.
d Open the Admin view.
Perform the upgrade tasks in the Admin view.

Upgrade Big Data Extensions Virtual Appliance

You must perform several tasks to complete the upgrade of the Big Data Extensions virtual appliance. Because the versions of the virtual appliance, the Serengeti Command-Line Interface, and the Big Data Extensions plug-in must all be the same, it is important that you upgrade all components to the new version.
Prerequisites
Complete the preparation steps for upgrading Big Data Extensions. See “Prepare to Upgrade Big Data
Extensions,” on page 34.
Chapter 3 Upgrading Big Data Extensions
Procedure
1 Configure Proxy Settings on page 35
You must have access to the Internet to upgrade your Big Data Extensions virtual appliance. If your site uses a proxy server to access the Internet, you must configure vSphere Update Manager to use the proxy server.
2 Download the Upgrade Source and Accept the License Agreement on page 36
To start the Big Data Extensions upgrade process, you download the upgrade source from the URL that was supplied to you, either in an email notification or by your VMware representative, and accept the license agreement (EULA).
3 Create an Upgrade Baseline on page 36
When you upgrade Big Data Extensions virtual appliances, you must create a custom virtual appliance upgrade baseline.
4 Specify Upgrade Compliance Settings on page 37
Upgrade compliance settings ensure that the upgrade baseline does not conflict with the current state of your Big Data Extensions virtual appliance.
5 Configure the Upgrade Remediation Task and Run the Upgrade Process on page 37
The upgrade remediation task is the process by which vSphere Update Manager applies patches, extensions, and upgrades to the Big Data Extensions virtual appliance. You configure and run the remediation task to finish the Big Data Extensions virtual appliance upgrade process.

Configure Proxy Settings

You must have access to the Internet to upgrade your Big Data Extensions virtual appliance. If your site uses a proxy server to access the Internet, you must configure vSphere Update Manager to use the proxy server.
If you do not use a proxy server, continue to “Download the Upgrade Source and Accept the License
Agreement,” on page 36.
VMware, Inc. 35
Prerequisites
Verify that you have obtained the values for the proxy server URL and port from your network administrator.
Procedure
1 In the Admin view of vSphere Update Manager, click Configuration and then select Download
Settings.
2 In the Proxy Settings section, click Use proxy.
3 Enter the values for the proxy URL and port.
4 Click Test Connection to ensure that the settings are correct.
5 If the settings are correct, click Apply.
vSphere Update Manager can now access the Web using the proxy server for your site.

Download the Upgrade Source and Accept the License Agreement

To start the Big Data Extensions upgrade process, you download the upgrade source from the URL that was supplied to you, either in an email notification or by your VMware representative, and accept the license agreement (EULA).
Prerequisites
Verify that you have the URL from which to download the upgrade source.
For information about the upgrade source and the URL where you can download the upgrade source, see the VMware knowledge base article at http://kb.vmware.com/ and search on article number 1004543.
Procedure
1 In the Admin view of vSphere Update Manager, click Configuration and then select Download
Settings.
2 On the Download Settings page, click Add Download Source.
3 Enter the upgrade source URL in the Source URL text box.
4 Click Validate URL to verify connectivity to the upgrade URL.
5 Click OK to add the download source to vSphere Update Manager.
6 Click Apply.
7 Click Download Now.
8 On the VA Upgrades tab, select the upgrade.
9 Click EULA to accept the end user license agreement.
The upgrade source is downloaded.

Create an Upgrade Baseline

When you upgrade Big Data Extensions virtual appliances, you must create a custom virtual appliance upgrade baseline.
Prerequisites
Verify that you are logged in to a vSphere Web Client as an administrator and that the vSphere Web Client is connected to a vCenter Server system with which vSphere Update Manager is registered.
36 VMware, Inc.
Chapter 3 Upgrading Big Data Extensions
Procedure
1 On the Baselines and Groups tab, click VMs/Vas to review the existing baselines and groups.
2 Click Create.
3 Enter a meaningful name, such as Big Data Extensions VA Upgrade 1.5, and click Next.
4 Click Add Multiple Rules to create a set of rules that determine the target upgrade version for virtual
appliances.
5 Review the baseline settings and click Finish.

Specify Upgrade Compliance Settings

Upgrade compliance settings ensure that the upgrade baseline does not conflict with the current state of your Big Data Extensions virtual appliance.
Prerequisites
Verify that you are logged in to a vSphere Web Client as an administrator and that the vSphere Web Client is connected to a vCenter Server system with which vSphere Update Manager is registered.
Procedure
1 In vSphere Web Client, navigate to VMs and Templates and click Upgrade Manager.
2 Open Compliance View and select the virtual appliance to upgrade.
3 Click Attach.
4 Select the upgrade baseline.
5 Click Attach again.
6 Verify that the virtual appliance needs to be updated.
a In the inventory list, right-click the baseline.
b Select Scan for Updates.
vSphere Update Manager scans the baseline against the virtual appliance and determines whether the virtual appliance is up-to-date with the latest Big Data Extensions version. A vSphere Update Manager scan result of 100 percent indicates that your Big Data Extensions version is up-to-date.
What to do next
If the Big Data Extensions virtual appliance is up-to-date, discontinue the upgrade process. If the Big Data Extensions virtual appliance is not up-to-date, continue the upgrade process. See “Configure the
Upgrade Remediation Task and Run the Upgrade Process,” on page 37.

Configure the Upgrade Remediation Task and Run the Upgrade Process

The upgrade remediation task is the process by which vSphere Update Manager applies patches, extensions, and upgrades to the Big Data Extensions virtual appliance. You configure and run the remediation task to finish the Big Data Extensions virtual appliance upgrade process.
Prerequisites
Verify that you logged in to a vSphere Web Client as an administrator and that the vSphere Web Client is connected to a vCenter Server system with which vSphere Update Manager is registered.
NOTE The upgrade can take a few hours to finish.
VMware, Inc. 37
Procedure
1 In the left pane of the VMs and Templates view, right-click the virtual appliance to upgrade and select
Remediate.
All virtual machines and appliances in the container are also remediated.
2 On the Remediation Selection page of the Remediate wizard, select the baseline group and upgrade
baselines to apply.
3 Select the virtual machines and appliances that you want to remediate and click Next.
4 On the Schedule page, enter a unique name and an optional description for the task.
5 Select Immediately to begin the upgrade process immediately after the configuration is finished and
click Next.
6 Configure the rollback options.
7 Specify the snapshot backup to roll back to and click Next.
8 Review the task definition and click Finish.
Big Data Extensions restarts when the upgrade remediation task finishes.
9 Verify that the Big Data Extensions virtual appliance upgrade was successful.
You can view the version of the Big Data Extensions virtual appliance in the vCenter client.
Some upgrade process errors are written to the Serengeti virtual appliance deployment logs in vCenter Server rather than appearing as error messages.

Upgrade the Big Data Extensions Plug-in

You must use the same version of the Serengeti Management Server and the Big Data Extensions plug-in.
Procedure
1 Open a Web browser and go to the URL of the Serengeti Management Server plug-in manager service.
https://management-server-ip-address:8443/register-plugin
2 Select Uninstall and click Submit.
3 Select Install.
4 Enter the information to register the new plug-in, and click Submit.

Upgrade the Serengeti Command-Line Interface

You must upgrade the Serengeti Command-Line Interface because the Serengeti Command-Line Interface must be the same version as your Big Data Extensions deployment.
Procedure
1 Log in to the vSphere Web Client.
2 Select Big Data Extensions from the navigation panel.
3 Click the Summary tab.
4 In the Connected Server panel, click Connect Server.
5 Select the Serengeti Management Server virtual machine in the Big Data Extensions vApp to which you
want to connect and click OK.
38 VMware, Inc.
Chapter 3 Upgrading Big Data Extensions
6 Click the Getting Started tab, and click Download Serengeti CLI Console.
A ZIP file containing the Serengeti Command-Line Interface Client downloads to your computer.
7 Unzip and examine the ZIP file, which includes the following components in the CLI directory:
The serengeti-cli-version JAR file, which includes the Serengeti Command-Line Interface Client.
n
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
8 Open a command shell and navigate to the directory where you unzipped the
Serengeti Command-Line Interface Client download package.
9 Change to the CLI directory, and run the following command to open the
Serengeti Command-Line Interface Client:
java -jar serengeti-cli-version.jar
What to do next
1 If your clusters are deployed with a Hadoop Template virtual machine that has a customized version of
the CentOS 6.x operating system that includes VMware Tools, you must customize a new CentOS 6.x template to use after you upgrade Big Data Extensions. See “Upgrade the CentOS 6.x Template,” on page 39.
2 To enable the Serengeti Management Server to manage clusters that you created in a previous version
of Big Data Extensions, you must upgrade each cluster. See “Upgrade Big Data Extensions Virtual
Machine Components,” on page 40.

Upgrade the CentOS 6.x Template

If your clusters are deployed with a Hadoop Template virtual machine that has a customized version of the CentOS 6.x operating system that includes VMware Tools, you can upgrade the previous CentOS 6.x template so that it is compatible with the Big Data Extensions upgrade.
You can upgrade a CentOS 6.x template from a previous version of Big Data Extensions so that the CentOS
6.x template is compatible with the Big Data Extensions upgrade. Alternatively, you can create a new CentOS 6.x template virtual machine that has CentOS 6.x and VMware Tools and that is compatible with the Big Data Extensions upgrade. See “Create a Hadoop Template Virtual Machine using RHEL Server 6.x and
VMware Tools,” on page 57 .
Prerequisites
There must be a previous version of the CentOS 6.x template on the CentOS Template virtual machine.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Power on the CentOS Template virtual machine.
3 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
4 From the command shell on the Serengeti Management Server, copy the template upgrade scripts to the
CentOS Template virtual machine.
scp /opt/serengeti/www/nodeupgrade/serengeti-node-scripts.tar.gz serengeti@template_ip:/tmp
5 Open another command shell, such as Bash or PuTTY, and log in to the CentOS Template virtual
machine.
VMware, Inc. 39
6 From the command shell on the CentOS Template virtual machine, run the template upgrade scripts.
cd /tmp sudo tar xvzf ./serengeti-node-scripts.tar.gz -C / sudo ln -sf /opt/serengeti/sbin/run_script_series /opt/serengeti/sbin/postupgrade sudo bash /opt/serengeti/sbin/postupgrade rm -rf /tmp/serengeti-node-scripts.tar.gz cp -f /opt/serengeti/sbin/setup-ip.py /opt/vmware/sbin/setup-ip.py
7 Remove the /etc/udev/rules.d/70-persistent-net.rules file to prevent increasing the eth number
during the clone operation.
If you do not remove this file, virtual machines cloned from the template cannot get IP addresses. If you power on the Hadoop Template virtual machine to make changes, remove this file before shutting down this virtual machine.
8 From the vSphere Web Client, power off the CentOS Template virtual machine.
9 From the vSphere Web Client, delete any snapshots named serengeti-snapshot from the CentOS
Template virtual machine.
10 Synchronize the Hadoop Template virtual machine's time with vCenter Server.
a In the vSphere Web Client, right-click the Hadoop Template virtual machine and select Edit
Settings.
b On the VM Options tab, click VMware Tools and select Synchronize guest time with host.
11 From the command shell on the Serengeti Management Server, restart the tomcat service.
sudo /sbin/service tomcat restart
What to do next
Upgrade the tools in your Hadoop and HBase clusters' virtual machines. See “Upgrade Big Data Extensions
Virtual Machine Components,” on page 40.

Upgrade Big Data Extensions Virtual Machine Components

To enable the Serengeti Management Server to manage clusters created in a previous version of Big Data Extensions, you must upgrade the components in each cluster's virtual machines. The Serengeti Management Server uses these components to control the cluster nodes.
When you upgrade from an earlier version of Big Data Extensions, clusters that you need to upgrade are shown with an alert icon next to the cluster name. When you click the alert icon the error message "Upgrade the cluster to the latest version" displays as a tool tip. See “View Provisioned Clusters in the vSphere Web
Client,” on page 98.
You can also identify clusters you need to upgrade using the cluster list command. When you run the
cluster list command the message "Earlier" displays where the cluster version normally appears.
Prerequisites
You must be upgrading a cluster created using a previous version of Big Data Extensions.
40 VMware, Inc.
Chapter 3 Upgrading Big Data Extensions
Procedure
1 For each cluster that you created in a previous version of Big Data Extensions, make sure that all of the
cluster's nodes are in the following states: RUNNING, STOPPED, ERROR, CONFIGURE_ERROR and UPGRADE_ERROR.
If a node does not have a valid IP address, it cannot be upgraded to the new version of Big Data Extensions virtual machine tools.
a Log into the vSphere Web Client connected to vCenter Server and navigate to Hosts and Clusters.
b Select the cluster's resource pool, select the Virtual Machines tab, and power on the cluster's
virtual machines.
IMPORTANT It might take up to five minutes for vCenter Server to assign valid IP addresses to the Big Data cluster nodes. Do not perform the remaining upgrade steps until the nodes have received their IP addresses.
2 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
3 Run the cluster upgrade command for each cluster created in a previous version of Big Data
Extensions.
cluster upgrade --name cluster-name
4 If the upgrade fails for a node, make sure that the failed node has a valid IP address, and then rerun the
cluster upgrade command.
You can rerun the command as many times as you need to upgrade all the nodes.
What to do next
Stop and restart your Hadoop and HBase clusters.
VMware, Inc. 41
42 VMware, Inc.

Managing Hadoop Distributions 4

The Serengeti Management Server includes the Apache Hadoop distribution, but you can add any supported Hadoop distribution to your Big Data Extensions environment.
Hadoop Distribution Deployment Types on page 43
n
You can choose which Hadoop distribution to use when you deploy a cluster. The type of distribution you choose determines how you configure it for use with Big Data Extensions. When you deploy the Big Data Extensions vApp, the Apache Hadoop 1.2.1 and Bigtop 0.7.0 distributions are included in the OVA that you download and deploy.
Configure a Tarball-Deployed Hadoop Distribution on page 44
n
You can add and configure Hadoop distributions other than those included with the Big Data Extensions vApp using the command line. You can configure multiple Hadoop distributions from different vendors.
Configuring Yum and Yum Repositories on page 46
n
You can deploy Cloudera CDH4, Intel, MapR, and Pivotal PHD Hadoop distributions using Yellowdog Updater, Modified (Yum). Yum enables automatic updates and package management of RPM-based software distributions. To deploy a Hadoop distribution using Yum, you must create and configure a Yum repository.
Create a Hadoop Template Virtual Machine using RHEL Server 6.x and VMware Tools on page 57
n
You can create a Hadoop Template virtual machine that has a customized version of the RHEL Server
6.x operating system that includes VMware Tools. Although only a few Hadoop distributions require a custom version of RHEL Server 6.x, you can customize RHEL Server 6.x for any Hadoop distribution.
Maintain a Customized Hadoop Template Virtual Machine on page 60
n
You can modify or update the Hadoop Template virtual machine operating system. When you make updates, you must remove the snapshot that is created by the virtual machine.

Hadoop Distribution Deployment Types

You can choose which Hadoop distribution to use when you deploy a cluster. The type of distribution you choose determines how you configure it for use with Big Data Extensions. When you deploy the Big Data Extensions vApp, the Apache Hadoop 1.2.1 and Bigtop 0.7.0 distributions are included in the OVA that you download and deploy.
Depending on which Hadoop distribution that you want to configure to use with Big Data Extensions, use either a tarball or Yum repository to install your distribution. The table lists the supported Hadoop distributions, and the distribution name, vendor abbreviation, and version number to use as input parameters when configuring the distribution for use with Big Data Extensions.
VMware, Inc.
43
Table 41. Hadoop Deployment Types
Hadoop Distribution Version Number
Apache 1.2.1 Apache Tarball Yes
Bigtop 0.7.0 and later BIGTOP Yum No
Pivotal HD 1.1-2.0 PHD Yum Yes
Hortonworks Data Platform 1.3.1 HDP Tarball Yes
2.0-2.1.1 HDP Yum No
Cloudera CDH4 and CDH5 4.3.0-5.0 CDH Yum No
MapR 3.0-3.1.0 MAPR Yum No
Intel 2.5-3.0.2 INTEL Yum No
About Hadoop Virtualization Extensions
Hadoop Virtualization Extensions (HVE), developed by VMware, improves Hadoop performance in virtual environments by enhancing Hadoop’s topology awareness mechanism to account for the virtualization layer.
Vendor Abbreviation Deployment Type HVE Support?
Configure Hadoop 2.x and Later Distributions with DNS Name Resolution
When creating clusters using Hadoop distributions based on Hadoop 2.0 and later, the DNS server in your network must provide forward and reverse FQDN/IP resolution. Without valid DNS and FQDN settings, the cluster creation process might fail, or the cluster is created but does not function. Hadoop distributions based on Hadoop 2.0 and later include Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, Intel 3.x, and Pivotal PHD 1.1 and later releases.

Configure a Tarball-Deployed Hadoop Distribution

You can add and configure Hadoop distributions other than those included with the Big Data Extensions vApp using the command line. You can configure multiple Hadoop distributions from different vendors.
Refer to your Hadoop distribution vendor's Web site to obtain the download URLs to use for the components that you want to install. If you are behind a firewall, you might need to modify your proxy settings to allow the download. Before you install and configure tarball-based deployments, ensure that you have the vendor's URLs from which to download the different Hadoop components. Use these URLs as input parameters to the config-distro.rb configuration utility.
If you have a local Hadoop distribution and your server does not have access to the Internet, you can manually upload the distribution.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
n
Client,” on page 23.
Review the different Hadoop distributions so you know which distribution name abbreviation, vendor
n
name abbreviation, and version number to use as an input parameter, and whether the distribution supports Hadoop Virtualization Extension (HVE). See “Hadoop Distribution Deployment Types,” on page 43.
(Optional) Set the password for the Serengeti Management Server. See “Change the Password for the
n
Serengeti Management Server,” on page 64.
44 VMware, Inc.
Chapter 4 Managing Hadoop Distributions
Procedure
1 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
2 Run the /opt/serengeti/sbin/config-distro.rb Ruby script.
config-distro.rb --name distro_name --vendor vendor_name --version version_number
--hadoop hadoop_package_url --pig pig_package_url --hive hive_package_url
--hbase hbase_package_url --zookeeper zookeeper_package_URL --hve {true | false} --yes
Option Description
--name
--vendor
--version
--hadoop
--pig
--hive
--hbase
--zookeeper
--hve {true | false}
--yes
Name to identify the Hadoop distribution that you are downloading. For example, hdp for Hortonworks. This name can include alphanumeric characters ([a-z], [A-Z], [0-9]) and underscores ("_").
Vendor name whose Hadoop distribution you want to use. For example, HDP for Hortonworks.
Version of the Hadoop distribution that you want to use. For example,
1.3.
URL from which to download the Hadoop distribution tarball package from the Hadoop vendor's Web site.
URL from which to download the Pig distribution tarball package from the Hadoop vendor's Web site.
URL from which to download the Hive distribution tarball package from the Hadoop vendor's Web site.
(Optional) URL from which to download the HBase distribution tarball package from the Hadoop vendor's Web site.
(Optional) URL from which to download the ZooKeeper distribution tarball package from the Hadoop vendor's Web site.
(Optional) Specifies whether the Hadoop distribution supports HVE
(Optional) Specifies that all confirmation prompts from the config- distro.rb script are answered with a "yes" response.
The example downloads the tarball version of Hortonworks Data Platform (HDP), which consists of Hortonworks Hadoop, Hive, HBase, Pig, and ZooKeeper distributions. Note that you must provide the download URL for each of the software components you wish to configure for use with Big Data Extensions.
config-distro.rb --name hdp --vendor HDP --version 1.3.2
--hadoop http://public­repo-1.hortonworks.com/HDP/centos6/1.x/updates/1.3.2.0/tars/hadoop-1.2.0.1.3.2.0-111.tar.gz
--pig http://public­repo-1.hortonworks.com/HDP/centos6/1.x/updates/1.3.2.0/tars/pig-0.11.1.1.3.2.0-111.tar.gz
--hive http://public­repo-1.hortonworks.com/HDP/centos6/1.x/updates/1.3.2.0/tars/hive-0.11.0.1.3.2.0-111.tar.gz
--hbase http://public­repo-1.hortonworks.com/HDP/centos6/1.x/updates/1.3.2.0/tars/hbase-0.94.6.1.3.2.0-111­security.tar.gz
--zookeeper http://public­repo-1.hortonworks.com/HDP/centos6/1.x/updates/1.3.2.0/tars/zookeeper-3.4.5.1.3.2.0-111.tar.g z
--hve true
The script downloads the files.
VMware, Inc. 45
3 When the download finishes, explore the /opt/serengeti/www/distros directory, which includes the
following directories and files.
Item Description
name
manifest
manifest.example
Directory that is named after the distribution. For example, apache.
The manifest file generated by config-distro.rb that is used to download the Hadoop distribution.
Example manifest file. This file is available before you perform the download. The manifest file is a JSON file with three sections: name, version, and packages.
4 To enable Big Data Extensions to use the added distribution, restart the tomcat service.
sudo /sbin/service tomcat restart
The Serengeti Management Server reads the revised manifest file and adds the distribution to those from which you can create a cluster.
5 Return to the Big Data Extensions Plug-in for vSphere Web Client, and click Hadoop Distributions to
verify that the Hadoop distribution is available to use to create a cluster.
The distribution and the corresponding role appear.
The distribution is added to the Serengeti Management Server, but is not installed in the Hadoop Template virtual machine. The agent is preinstalled on each virtual machine that copies the distribution components that you specify from the Serengeti Management Server to the nodes during the Hadoop cluster creation process.
What to do next
You can add datastore and network resources for the Hadoop clusters that you will create. See Chapter 6,
“Managing vSphere Resources for Hadoop and HBase Clusters,” on page 69.
You can create and deploy Hadoop or HBase clusters using your chosen Hadoop distribution. See “Create a
Hadoop or HBase Cluster in the vSphere Web Client,” on page 78.

Configuring Yum and Yum Repositories

You can deploy Cloudera CDH4, Intel, MapR, and Pivotal PHD Hadoop distributions using Yellowdog Updater, Modified (Yum). Yum enables automatic updates and package management of RPM-based software distributions. To deploy a Hadoop distribution using Yum, you must create and configure a Yum repository.
Yum Repository Configuration Values on page 47
n
You use the Yum repository configuration values in a file that you create to update the Yum repositories used to install or update Hadoop software on CentOS and other operating systems that use Red Hat Package Manager (RPM).
Create a Local Yum Repository for Apache Bigtop, Cloudera, Hortonworks, and MapR Hadoop
n
Distributions on page 50
Although publically available Yum repositories exist for Apache Bigtop, Cloudera, Hortonworks, and MapR distributions, creating your own Yum repository can result in faster download times and greater control over the repository.
Create a Local Yum Repository for the Intel Hadoop Distribution on page 52
n
Intel does not provide a publically available Yum repository. Creating your own Yum repository for Intel provides you with better access and control over installing and updating your Intel Hadoop distribution software.
46 VMware, Inc.
Chapter 4 Managing Hadoop Distributions
Create a Local Yum Repository for the Pivotal Hadoop Distribution on page 54
n
Pivotal does not provide a publically available Yum repository. Creating your own Yum repository for Pivotal provides you with better access and control over installing and updating your Pivotal HD distribution software.
Configure a Yum-Deployed Hadoop Distribution on page 55
n
You can install Hadoop distributions that use Yum repositories (as opposed to tarballs) for use with Big Data Extensions. When you create a cluster for a Yum-deployed Hadoop distribution, the Hadoop nodes download and install Red Hat Package Manager (RPM) packages from the distribution's official Yum repositories or your local Yum repositories.

Yum Repository Configuration Values

You use the Yum repository configuration values in a file that you create to update the Yum repositories used to install or update Hadoop software on CentOS and other operating systems that use Red Hat Package Manager (RPM).
About Yum Repository Configuration Values
To create a local Yum repository, you create a configuration file that identifies a distribution's file and package names to download and deploy. When you create the configuration file, you replace a set of placeholder values with values that correspond to your Hadoop distribution. The table lists the values to use for the Apache Bigtop, Cloudera, Hortonworks, Intel, MapR, and Pivotal distributions.
NOTE If you copy-and-paste values from the table, be sure to include all required information. Some values appear on two lines in the table, for example, "maprtech maprecosystem", and they must be combined into a single line when you use them.
Apache Bigtop Yum Repository Configuration Values
Table 42. Apache Bigtop Yum Repository Placeholder Values
Placeholder Value
repo_file_name bigtop.repo
package_info [bigtop]
name=Bigtop
enabled=1
gpgcheck=1
type=NONE
baseurl=http://bigtop.s3.amazonaws.com/releases/0.7.0/redhat/6/x86_64
gpgkey=http://archive.apache.org/dist/bigtop/KEYS
NOTE If you use a version other than 0.7.0, use the exact version number of your Apache Bigtop distribution in the pathname.
mirror_cmds reposync -r bigtop
default_rpm_dir bigtop
target_rpm_dir bigtop
local_repo_info [bigtop]
name=Apache Bigtop
baseurl=http://ip_of_yum_repo_webserver/bigtop/
enabled=1
gpgcheck=0
Cloudera Yum Repository Configuration Values
VMware, Inc. 47
Table 43. Cloudera Yum Repository Placeholder Values
Placeholder Value
repo_file_name cloudera-cdh.repo
package_info If you use CDH4, use the values below.
[cloudera-cdh]
name=Cloudera's Distribution for Hadoop
http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/
gpkey=http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck=1
If you use CDH5, use the values below.
[cloudera-cdh]
name=Cloudera's Distribution for Hadoop
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck=1
mirror_cmds reposync -r cloudera-cdh4
default_rpm_dir cloudera-cdh/RPMS
target_rpm_dir cdh/version_number
local_repo_info [cloudera-cdh]
name=Cloudera's Distribution for Hadoop
baseurl=http://ip_of_yum_repo_webserver/cdh/version_number/
enabled=1
gpgcheck=0
Intel Yum Repository Configuration Values
Table 44. Intel Yum Repository Placeholder Values
Placeholder Value
repo_file_name intel.repo
package_info Not Applicable
mirror_cmds Not Applicable
default_rpm_dir intel
target_rpm_dir intel/2
local_repo_info [intel]
name=Intel Hadoop Distribution 2.x
baseurl=http://ip_of_yum_repo_webserver/intel/2/idh
enabled=1
gpgcheck=0
48 VMware, Inc.
Hortonworks Yum Repository Configuration Values
Table 45. Hortonworks Yum Repository Placeholder Values
Placeholder Value
repo_file_name hdp.repo
package_info [hdp]
name=Hortonworks Data Platform Version - HDP-2.1.1.0
baseurl=http://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/2.1.1.0
gpgcheck=1
gpgkey=http://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/2.1.1.0/RPM-GPG­KEY/RPM-GPG-KEY-Jenkins
enabled=1
priority=1
NOTE If you use a version other than HDP 2.1.1.0, use the exact version number of your Hortonworks distribution in the pathname.
mirror_cmds reposync -r hdp
default_rpm_dir hdp
target_rpm_dir hdp/2
local_repo_info [hdp]
name=Hortonworks Data Platform Version -HDP-2.1.1.0
baseurl=http://ip_of_yum_repo_webserver/hdp/2/
enabled=1
gpgcheck=0
Chapter 4 Managing Hadoop Distributions
MapR Yum Repository Configuration Values
Table 46. MapR Yum Repository Placeholder Values
Placeholder Value
repo_file_name mapr.repo
package_info [maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/3.1.0/redhat/
enabled=1
gpgcheck=0
protect=1
[maprecosystem]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/ecosystem/redhat
enabled=1
gpgcheck=0
protect=1
NOTE If you use a version other than 3.1.0, use the exact version number of your MapR distribution in the pathname.
mirror_cmds reposync -r maprtech
reposync -r maprecosystem
default_rpm_dir maprtech maprecosystem
VMware, Inc. 49
Table 46. MapR Yum Repository Placeholder Values (Continued)
Placeholder Value
target_rpm_dir mapr/3
local_repo_info [mapr]
name=MapR Version 3
baseurl=http://ip_of_yum_repo_webserver/mapr/3/
enabled=1
gpgcheck=0
protect=1
Pivotal Yum Repository Configuration Values
Table 47. Pivotal Yum Repository Placeholder Values
Placeholder Value
repo_file_name phd.repo
package_info Not Applicable
mirror_cmds Not Applicable
default_rpm_dir pivotal
target_rpm_dir phd/1
local_repo_info [pivotalhd]
name=PHD Version 1.0
baseurl=http://ip_of_yum_repo_webserver/phd/1/
enabled=1
gpgcheck=0

Create a Local Yum Repository for Apache Bigtop, Cloudera, Hortonworks, and MapR Hadoop Distributions

Although publically available Yum repositories exist for Apache Bigtop, Cloudera, Hortonworks, and MapR distributions, creating your own Yum repository can result in faster download times and greater control over the repository.
Prerequisites
High-speed Internet access.
n
CentOS 6.x 64-bit or Red Hat Enterprise Linux (RHEL) 6.x 64-bit.
n
An HTTP server with which to create the Yum repository. For example, Apache Lighttpd.
n
If there is a firewall on your system, ensure that the firewall does not block the network port number
n
used by your HTTP server proxy. Typically, this is port 80.
Refer to the Yum repository placeholder values to populate the variables required in the steps. See
n
“Yum Repository Configuration Values,” on page 47.
50 VMware, Inc.
Chapter 4 Managing Hadoop Distributions
Procedure
1 If your Yum server requires an HTTP proxy server to connect to the Internet, open a command shell,
such as Bash or PuTTY, log in to the Serengeti Management Server, and run the commands to export the http_proxy environment variable.
# switch to root user sudo su export http_proxy=http://proxy_server:port
Option Description
proxy_server
port
The IP address or domain name of the proxy server.
The network port number to use with the proxy server.
2 Install the Web server that you want to use as a Yum server.
This example installs the Apache Web Server and enables the httpd server to start whenever the machine is restarted.
yum install -y httpd /sbin/service httpd start /sbin/chkconfig httpd on
3 If they are not installed, install the Yum utils and createrepo packages.
The Yum utils package includes the reposync command.
yum install -y yum-utils createrepo
4 Synchronize the Yum server with the official Yum repository of your preferred Hadoop vendor.
a Using a text editor, create the file /etc/yum.repos.d/$repo_file_name.
b Add the package_info content to the new file.
c Mirror the remote yum repository to the local machine by running the mirror_cmds for your
distribution packages.
It might take several minutes to download the RPMs from the remote repository. The RPMs are placed in the $default_rpm_dir directories.
5 Create the local Yum repository.
a Move the RPMs to a new directory under the Apache Web Server document root.
The default Apache document root is /var/www/html/. If you use the Serengeti Management Server as your Yum server machine, the document root is /opt/serengeti/www/.
doc_root=/var/www/html mkdir -p $doc_root/$target_rpm_dir mv $default_rpm_dir $doc_root/$target_rpm_dir/
For example, for the MapR Hadoop distribution, run the following mv command:
mv maprtech maprecosystem $doc_root/mapr/3/
b Create a Yum repository for the RPMs.
cd $doc_root/$target_rpm_dir createrepo .
c Create a new file, $doc_root/$target_rpm_dir/$repo_file_name, and include the local_repo_info.
d From a different machine, ensure that you can download the repo file from
http://ip_of_webserver/target_rpm_dir/repo_file_name.
VMware, Inc. 51
6 (Optional) Configure HTTP proxy.
If the virtual machines created by the Serengeti Management Server do not need an HTTP proxy to connect to the local Yum repository, skip this step.
On the Serengeti Management Server, edit the /opt/serengeti/conf/serengeti.properties file and add the following content anywhere in the file or replace existing items:
# set http proxy server serengeti.http_proxy = http://<proxy_server:port>
# set the FQDNs (or IPs if no FQDN) of the Serengeti Management Server and the local yum repository servers for 'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work. serengeti.no_proxy = serengeti_server_fqdn_or_ip.yourdomain.com, yum_server_fqdn_or_ip.yourdomain.com
What to do next
Configure your Apache Bigtop, Cloudera, Hortonworks, or MapR deployment for use with Big Data Extensions. See “Configure a Yum-Deployed Hadoop Distribution,” on page 55.

Create a Local Yum Repository for the Intel Hadoop Distribution

Intel does not provide a publically available Yum repository. Creating your own Yum repository for Intel provides you with better access and control over installing and updating your Intel Hadoop distribution software.
Intel does not provide a publicly accessible Yum repository from which you can deploy and upgrade the Intel Hadoop software distribution. You might want to download the Intel software tarballs, and create your own Yum repository from which to deploy and configure the Intel Hadoop software.
Prerequisites
High-speed Internet access.
n
CentOS 6.x 64-bit operating system.
n
NOTE Because the Intel Hadoop distribution requires CentOS 6.1 or later 64-bit version (x86_64), the Yum server that you create to deploy the distribution must also use a CentOS 6.x 64-bit operating system.
An HTTP server with which to create the Yum repository. For example, Apache Web Server.
n
If there is a firewall on your system, ensure that the firewall does not block the network port number
n
used by your HTTP server proxy. Typically, this is port 80.
Procedure
1 If your Yum server requires an HTTP proxy server, open a command shell, such as Bash or PuTTY, log
in to the Serengeti Management Server, and run the commands to export the http_proxy environment variable.
# switch to root user sudo su export http_proxy=http://proxy_server:port
Option Description
proxy_server
port
The IP address or domain name of the proxy server.
The network port number to use with the proxy server.
52 VMware, Inc.
Chapter 4 Managing Hadoop Distributions
2 Install the Web server that you want to use with a Yum server.
This example installs the Apache Web Server and enables the httpd server to start whenever the machine is restarted.
yum install -y httpd /sbin/service httpd start /sbin/chkconfig httpd on
3 If they are not installed, install the Yum utils and createrepo packages.
The Yum utils package includes the reposync command.
yum install -y yum-utils createrepo
4 Download Intel Hadoop 2.5.1 from the Intel Web site.
5 Extract the tarball that you downloaded.
tar -xzf intelhadoop-2.5.1-20659-zh.el6.x86_64.tar.gz
6 omit
7 Create and configure the local Yum repository.
a Move the RPMs to a new directory under the Apache Web Server document root.
The default Apache document root is /var/www/html/. If you use the Serengeti Management Server as your Yum server machine, the document root is /opt/serengeti/www/.
doc_root=/var/www/html mkdir -p $doc_root/$target_rpm_dir mv $default_rpm_dir $doc_root/$target_rpm_dir/
This example moves the RPMs for the Intel Hadoop distribution.
mv intelhadoop/idh/ $doc_root/intel/2/
b Create a file, $doc_root/$target_rpm_dir/intelhadoop.repo, and include the local_repo_info.
8 (Optional) Configure an HTTP proxy.
If the virtual machines created by the Serengeti Management Server do not need an HTTP proxy to connect to the local Yum repository, skip this step.
On the Serengeti Management Server, edit the file/opt/serengeti/conf/serengeti.properties, and add the following content anywhere in the file or replace existing items:
# set http proxy server serengeti.http_proxy = http://<proxy_server:port>
# set the FQDNs (or IPs if no FQDN) of the Serengeti Management Server and the local yum repository servers for 'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work. serengeti.no_proxy = serengeti_server_fqdn_or_ip.yourdomain.com, yum_server_fqdn_or_ip.yourdomain.com
What to do next
After you create a Yum repository for your Intel Hadoop software, configure your Intel Hadoop
n
deployment for use with Big Data Extensions. See “Configure a Yum-Deployed Hadoop Distribution,” on page 55.
VMware, Inc. 53

Create a Local Yum Repository for the Pivotal Hadoop Distribution

Pivotal does not provide a publically available Yum repository. Creating your own Yum repository for Pivotal provides you with better access and control over installing and updating your Pivotal HD distribution software.
Pivotal does not provide a publicly accessible Yum repository from which you can deploy and upgrade the Pivotal Hadoop software distribution. You might want to download the Pivotal software tarballs, and create your own Yum repository from which to deploy and configure the Pivotal Hadoop software.
Prerequisites
High-speed Internet access.
n
CentOS 6.x 64-bit operating system.
n
NOTE Because the Pivotal Hadoop distribution requires CentOS 6.2 64-bit version or 6.4 64-bit version (x86_64), the Yum server that you create to deploy the distribution must also use a CentOS 6.x 64-bit operating system.
An HTTP server with which to create the Yum repository. For example, Apache Lighttpd.
n
If there is a firewall on your system, ensure that the firewall does not block the network port number
n
used by your HTTP server proxy. Typically, this is port 80.
Procedure
1 If your Yum server requires an HTTP proxy server, open a command shell, such as Bash or PuTTY, log
in to the Serengeti Management Server, and run the commands to export the http_proxy environment variable.
# switch to root user sudo su export http_proxy=http://proxy_server:port
Option Description
proxy_server
port
The IP address or domain name of the proxy server.
The network port number to use with the proxy server.
2 Install the Web server that you want to use with a Yum server.
This example installs the Apache Web Server and enables the httpd server to start whenever the machine is restarted.
yum install -y httpd /sbin/service httpd start /sbin/chkconfig httpd on
3 If they are not installed, install the Yum utils and createrepo packages.
The Yum utils package includes the reposync command.
yum install -y yum-utils createrepo
4 Download the Pivotal HD 1.0 tarball from the Pivotal Web site.
5 Extract the tarball that you downloaded.
tar -xf phd_1.0.1.0-19_community.tar
54 VMware, Inc.
Chapter 4 Managing Hadoop Distributions
6 Extract PHD_1.0.1_CE/PHD-1.0.1.0-19.tar to the default_rpm_dir directory.
For Pivotal Hadoop the default_rpm_dir directory is pivotal.
The version numbers of the tar that you extract might be different from those used in the example if an update has occurred.
tar -xf PHD_1.0.1_CE/PHD-1.0.1.0-19.tar -C pivotal
7 Create and configure the local Yum repository.
a Move the RPMs to a new directory under the Apache Web Server document root.
The default Apache document root is /var/www/html/. If you use the Serengeti Management Server as your Yum server machine, the document root is /opt/serengeti/www/.
doc_root=/var/www/html mkdir -p $doc_root/$target_rpm_dir mv $default_rpm_dir $doc_root/$target_rpm_dir/
This example moves the RPMs for the Pivotal Hadoop distribution.
mv pivotal $doc_root/phd/1/
b Create a Yum repository for the RPMs.
cd $doc_root/$target_rpm_dir createrepo .
c Create a file, $doc_root/$target_rpm_dir/$repo_file_name, and include the local_repo_info.
d From a different machine, ensure that you can download the repository file from
http://ip_of_webserver/$target_rpm_dir/$repo_file_name.
8 (Optional) Configure an HTTP proxy.
If the virtual machines created by the Serengeti Management Server do not need an HTTP proxy to connect to the local Yum repository, skip this step.
On the Serengeti Management Server, edit the file/opt/serengeti/conf/serengeti.properties, and add the following content anywhere in the file or replace existing items:
# set http proxy server serengeti.http_proxy = http://<proxy_server:port>
# set the FQDNs (or IPs if no FQDN) of the Serengeti Management Server and the local yum repository servers for 'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work. serengeti.no_proxy = serengeti_server_fqdn_or_ip.yourdomain.com, yum_server_fqdn_or_ip.yourdomain.com

Configure a Yum-Deployed Hadoop Distribution

You can install Hadoop distributions that use Yum repositories (as opposed to tarballs) for use with Big Data Extensions. When you create a cluster for a Yum-deployed Hadoop distribution, the Hadoop nodes download and install Red Hat Package Manager (RPM) packages from the distribution's official Yum repositories or your local Yum repositories.
Prerequisites
High-speed Internet access from the Serengeti Management Server to download RPM packages from
n
your chosen Hadoop distribution's official Yum repository. If you do not have adequate Internet access to download the RPM packages from your Hadoop vendor's Yum repository, you can create a local Yum repository for your Hadoop distribution.
VMware, Inc. 55
Review the different Hadoop distributions so that you know which distribution name, vendor
n
abbreviation, and version number to use as an input parameter, and whether the distribution supports Hadoop Virtualization Extensions. See “Hadoop Distribution Deployment Types,” on page 43.
Create a local Yum repository for your Hadoop distribution. Creating your own repository can result in
n
better access and more control over the repository. See “Configuring Yum and Yum Repositories,” on page 46.
Procedure
1 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
2 Run the /opt/serengeti/sbin/config-distro.rb Ruby script.
config-distro.rb --name distro_name --vendor vendor_abbreviation --version ver_number
--repos http://url_to_yum_repo/name.repo
Option Description
--name
--vendor
--version
--repos
Name to identify the Hadoop distribution that you are downloading. For example, chd4 for Cloudera CDH4. This name can include alphanumeric characters ([a-z], [A-Z], [0-9]) and underscores ("_").
Abbreviation of vendor name whose Hadoop distribution you want to use. For example, CDH.
Version of the Hadoop distribution that you want to use. For example,
4.6.0.
URL from which to download the Hadoop distribution Yum package. This URL can be a local Yum repository that you create or a publicly accessible Yum repository hosted by the software vendor.
This example adds the Apache Bigtop Hadoop Distribution to Big Data Extensions.
config-distro.rb --name bigtop --vendor BIGTOP --version 0.8.0
--repos http://url_to_yum_repo/bigtop.repo
The example adds the Cloudera CDH4 Hadoop distribution to Big Data Extensions.
config-distro.rb --name cdh4 --vendor CDH --version 4.6.0 --repos http://url_to_yum_repo/cloudera-cdh4.repo
NOTE The config-distro.rb script downloads files only for tarball-deployed distributions. No files are downloaded for Yum-deployed distributions.
This example adds the Hortonworks Hadoop Distribution to Big Data Extensions.
config-distro.rb --name hdp --vendor HDP --version 2.1.1
--repos http://url_to_yum_repo/hdp.repo
The example adds the MapR Hadoop distribution to Big Data Extensions.
config-distro.rb --name mapr --vendor MAPR --version 3.1.0 --repos http://url_to_yum_repo/mapr.repo
This example adds the Intel Hadoop Distribution to Big Data Extensions.
config-distro.rb --name intel --vendor INTEL --version 2.5.1
--repos http://url_to_yum_repo/intel.repo
This example adds the Pivotal Hadoop Distribution to Big Data Extensions.
config-distro.rb --name phd --vendor PHD --version 2.0
--repos http://url_to_yum_repo/phd.repo
56 VMware, Inc.
Chapter 4 Managing Hadoop Distributions
3 To enable Big Data Extensions to use the new distribution, restart the Tomcat service.
sudo /sbin/service tomcat restart
The Serengeti Management Server reads the revised manifest file, and adds the distribution to those from which you can create a cluster.
4 Return to the Big Data Extensions Plug-in for vSphere Web Client, and click Hadoop Distributions to
verify that the Hadoop distribution is available.
What to do next
You can create Hadoop and HBase clusters. See Chapter 7, “Creating Hadoop and HBase Clusters,” on page 75.

Create a Hadoop Template Virtual Machine using RHEL Server 6.x and VMware Tools

You can create a Hadoop Template virtual machine that has a customized version of the RHEL Server 6.x operating system that includes VMware Tools. Although only a few Hadoop distributions require a custom version of RHEL Server 6.x, you can customize RHEL Server 6.x for any Hadoop distribution.
You can create a Hadoop Template virtual machine that uses RHEL Server 6.1 or later as the guest operating system into which you can install VMware Tools for RHEL 6.x in combination with a supported Hadoop distribution. This allows you to create a Hadoop Template virtual machine that uses your organization's operating system configuration. When you provision Big Data clusters using the customized Hadoop template, the VMware Tools for RHEL 6.x will be in the virtual machines that are created from the Hadoop Template virtual machine.
If you create Hadoop Template virtual machines with multiple cores per socket, when you specify the CPU settings for the virtual machine you must specify a multiple of cores per socket. For example, if the virtual machine uses two cores per socket, the vCPU settings must be an even number. For example: 4, 8, or 12. If you specify an odd number, the cluster provisioning or CPU resizing will fail.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
n
Client,” on page 23.
Obtain the IP address of the Serengeti Management Server.
n
Locate the VMware Tools version that corresponds to the ESXi version in your data center.
n
Procedure
1 Create a virtual machine template with a 20GB thin provisioned disk, and install RHEL 6.x.
a Download the RHEL Server 6.x installation ISO from www.redhat.com to a datastore.
b From vCenter Server, create a virtual machine template with a 20GB thin provision disk and select
Red Hat Enterprise Linux 6 (64-bit) as the Guest OS.
c Right-click the virtual machine, select CD Device, and select the datastore ISO file for the RHEL
ISO file.
VMware, Inc. 57
d Under Device Status, select connected and connect at power on, and click OK.
e From the console window, install the RHEL Server 6.x operating system using the default settings.
See the Red Hat Enterprise Linux Installation Guide, available on the Red Hat website, for more information.
You can select the language and time zone you want the operating system to use, and you can specify that the swap partition use a smaller size to save disk space (for example, 500MB). The swap partition is not used by Big Data Extensions, so you can safely reduce the size.
From the Package Installation Defaults screen, select Minimal.
The new operating system installs into the virtual machine template.
2 Run the ifconfig command to ensure that the virtual machine has a valid IP and Internet connectivity.
This task assumes the use of Dynamic Host Configuration Protocol (DHCP).
If IP address information appears, skip to Step 4.
n
If no IP address information appears, which is the case when DHCP is configured, continue with
n
Step 3.
3 Configure the network.
a Using a text editor open the /etc/sysconfig/network-scripts/ifcfg-eth0 file.
b Locate the following parameters and specify the following configuration.
DEVICE=eth0 ONBOOT=yes BOOTPROTO=dhcp
c Save your changes and close the file.
d Restart the network service.
sudo /sbin/service network restart
e Run the ifconfig command to ensure that the virtual machine has a valid IP and Internet
connectivity.
4 Install the latest JDK 7 RPM.
a From the Oracle® Java SE 7 Downloads page, download the latest JDK 7 Linux x64 RPM and copy
it to the virtual machine template's root folder.
b Install the RPM.
rpm -Uvh jdk-7uxx-linux-x64.rpm
c Delete the RPM file.
rm -f jdk-7uxx-linux-x64.rpm
d Edit /etc/environment and add the following line: JAVA_HOME=/usr/java/default
5 Install VMware Tools for RHEL 6.x.
a Right-click the RHEL 6 virtual machine in Big Data Extensions, then select Guest > Install/Upgrade
VMware Tools.
b Log in to the virtual machine and mount the CD-ROM to access the VMware Tools installation
package.
mkdir /mnt/cdrom mount /dev/cdrom /mnt/cdrom mkdir /tmp/vmtools cd /tmp/vmtools
58 VMware, Inc.
Chapter 4 Managing Hadoop Distributions
c Run the tar xf command to extract the VMware Tools package tar file.
tar xf VMwareTools-*.tar.gz
d Make vmware-tools-distrib your working directory, and run the vmware-install.pl script.
./vmware-install.pl
Press Enter to finish the installation.
e Remove the vmtools temporary (temp) file that is created as an artifact of the installation process.
rm -rf /tmp/vmtools
6 (Optional) In the vSphere Web Client, right-click the virtual machine and select Snapshot > Take
Snapshot.
Create a snapshot to use for recovery operations.
7 Deploy the Big Data Extension vApp.
8 Run the installation scripts to customize the virtual machine.
a Register the RHEL operating system to enable the RHEL Yum repositories. This allows the
installation script to download packages from the Yum repository. See "Registering from the Command Line" in the Red Hat Enterprise Linux 6 Deployment Guide, available on the Red Hat website.
b Download the scripts from https://deployed_serengeti_server_IP/custos/custos.tar.gz.
c Create the directory /tmp/custos, make this your working directory, and run tar xf to uncompress
the tar file.
mkdir /tmp/custos cd /tmp/custos tar xf /tmp/custos/custos.tar.gz
d Run the installer.sh script specifying the /usr/java/default directory path.
./installer.sh /usr/java/default
You must use the same version of the installer.sh script as your Big Data Extensions deployment.
9 Remove the /etc/udev/rules.d/70-persistent-net.rules file to prevent increasing the eth number
during the clone operation.
If you do not remove this file, virtual machines cloned from the template cannot get IP addresses. If you power on the Hadoop Template virtual machine to make changes, remove this file before shutting down this virtual machine.
10 Shut down virtual machine.
11 If you created a snapshot as described in Step 6, delete it. In the vSphere Web Client, right-click the
virtual machine, select Snapshot > Snapshot Manager, select the serengeti-snapshot, and click Delete.
12 Synchronize the Hadoop Template virtual machine's time with vCenter Server.
a In the vSphere Web Client, right-click the Hadoop Template virtual machine and select Edit
Settings.
b On the VM Options tab, click VMware Tools and select Synchronize guest time with host.
13 In the vSphere Web Client, edit the template settings, and deselect (uncheck) all devices.
VMware, Inc. 59
14 Replace the original Hadoop Template virtual machine with the customized virtual machine that you
created.
a Move the original Hadoop Template virtual machine out of the vApp.
b Drag the new template virtual machine that you just created into the vApp.
15 Log in to the Serengeti Management Server as the user serengeti, and restart the Tomcat service.
sudo /sbin/service tomcat restart
Restarting the Tomcat service enables the custom RHEL virtual machine template, making it your Hadoop Template virtual machine.
What to do next
To modify or update the Hadoop Template virtual machine operating system, remove the serengeti­snapshot that is created each time you shutdown and restart the virtual machine. See “Maintain a
Customized Hadoop Template Virtual Machine,” on page 60.

Maintain a Customized Hadoop Template Virtual Machine

You can modify or update the Hadoop Template virtual machine operating system. When you make updates, you must remove the snapshot that is created by the virtual machine.
If you create a custom Hadoop Template virtual machine that uses a version of RHEL 6.x, or modify the operating system, you must remove the serengeti-snapshot that Big Data Extensions creates. If you do not remove the serengeti-snapshot, changes you made to the Hadoop Template virtual machine will not take effect.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
n
Client,” on page 23.
Create a customized Hadoop Template virtual machine using RHEL 6.x. See
n
“Create a Hadoop Template Virtual Machine using RHEL Server 6.x and VMware Tools,” on page 57.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Power on the Hadoop Template virtual machine, and apply changes or updates.
3 Remove the /etc/udev/rules.d/70-persistent-net.rules file to prevent increasing the eth number
during the clone operation.
If you do not remove this file, virtual machines cloned from the template cannot get IP addresses. If you power on the Hadoop Template virtual machine to make changes, remove this file before shutting down this virtual machine.
4 From the vSphere Web Client, shut down the Hadoop Template virtual machine.
5 Delete the snapshot labeled serengeti-snapshot from the customized Hadoop Template virtual machine.
a In the vSphere Web Client, right-click the Hadoop Template virtual machine and select Snapshot >
Snapshot Manager
b Select the serengeti-snapshot, and click Delete.
The generated snapshot is removed.
60 VMware, Inc.
Chapter 4 Managing Hadoop Distributions
6 Synchronize the Hadoop Template virtual machine's time with vCenter Server.
a In the vSphere Web Client, right-click the Hadoop Template virtual machine and select Edit
Settings.
b On the VM Options tab, click VMware Tools and select Synchronize guest time with host.
VMware, Inc. 61
62 VMware, Inc.
Managing the Big Data Extensions
Environment 5
After you install Big Data Extensions, you can stop and start the Serengeti services, create user accounts, manage passwords, update SSL certificates, and log in to cluster nodes to perform troubleshooting.
Add Specific User Names to Connect to the Serengeti Management Server on page 64
n
You can add specific user names with which to login to the Serengeti Management Server. The user names you add are the only users who can connect to the Serengeti Management Server using the Serengeti Command-Line Interface or the Big Data Extensions user interface for use with vSphere Web Client Web Client.
Change the Password for the Serengeti Management Server on page 64
n
When you power on the Serengeti Management Server for the first time, it generates a random password that is used for the root and serengeti users. If you want an easier to remember password, you can use the virtual machine console to change the random password for the root and serengeti users.
Configure vCenter Single Sign-On Settings for the Serengeti Management Server on page 65
n
If the Big Data Extensions Single Sign-On (SSO) authentication settings are not configured or if they change after you install the Big Data Extensions plug-in, you can use the Serengeti Management Server Administration Portal to enable SSO, update the certificate, and register the plug-in so that you can connect to the Serengeti Management Server and continue managing clusters.
VMware, Inc.
Create a User Name and Password for the Serengeti Command-Line Interface on page 66
n
The Serengeti Command-Line Interface Client uses the vCenter Server login credentials with read permissions on the Serengeti Management Server. If you do not create a user name and password for the Serengeti Command-Line Interface Client, it will use the default vCenter Server administrator credentials. However, for security reasons, it's best to create a user account specifically for use with the Serengeti Command-Line Interface Client.
Stop and Start Serengeti Services on page 66
n
You can stop and start Serengeti services to make a reconfiguration take effect, or to recover from an operational anomaly.
63

Add Specific User Names to Connect to the Serengeti Management Server

You can add specific user names with which to login to the Serengeti Management Server. The user names you add are the only users who can connect to the Serengeti Management Server using the Serengeti Command-Line Interface or the Big Data Extensions user interface for use with vSphere Web Client Web Client.
Prerequisites
Deploy the Serengeti vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web Client,” on
n
page 23.
Use the vSphere Web Client to log in to vCenter Server, and verify that the
n
Serengeti Management Server virtual machine is running.
Procedure
1 Right-click the Serengeti Management Server virtual machine and select Open Console.
The password for the Serengeti Management Server appears.
NOTE If the password scrolls off the console screen, press Ctrl+D to return to the command prompt.
2 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
Use the IP address that appears in the Summary tab and the current password.
3 Edit the /opt/serengeti/conf/Users.xml file to add additional user names.
vi /opt/serengeti/conf/Users.xml
4 Edit the <user name="*" /> attribute, replacing the asterisk (*) wildcard character with the user name
you wish to use. You can add multiple user names by adding a new <user name="name" />attribute on its own line. The User.xml file supports multiple lines.
<user name="jsmith" /> <user name="sjones" /> <user name="jlydon" />
5 Restart the Tomcat service.
/sbin/service tomcat restart
Only the user names you add to the User.xml file can be used to login to the Serengeti Management Server using the Serengeti Command-Line Interface or the Big Data Extensions user interface for use with vSphere Web Client.
What to do next
You can create a new user name and password for the Serengeti Command-Line Interface. See “Create a
User Name and Password for the Serengeti Command-Line Interface,” on page 66.

Change the Password for the Serengeti Management Server

When you power on the Serengeti Management Server for the first time, it generates a random password that is used for the root and serengeti users. If you want an easier to remember password, you can use the virtual machine console to change the random password for the root and serengeti users.
NOTE You can change the password for any node's virtual machine by using this procedure.
64 VMware, Inc.
Chapter 5 Managing the Big Data Extensions Environment
Prerequisites
Deploy the Serengeti vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web Client,” on
n
page 23.
Use the vSphere Web Client to log in to vCenter Server, and verify that the Serengeti Management
n
Server virtual machine is running.
Procedure
1 Right-click the Serengeti Management Server virtual machine and select Open Console.
The password for the Serengeti Management Server appears.
NOTE If the password scrolls off the console screen, press Ctrl+D to return to the command prompt.
2 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
Use the IP address that appears in the Summary tab and the current password.
3 Use the /opt/serengeti/sbin/set-password command to change the password for the root user and the
serengeti user.
sudo /opt/serengeti/sbin/set-password -u
4 Enter a new password, and enter it again to confirm.
The next time you log in to the Serengeti Management Server, use the new password.
What to do next
You can create a new user name and password for the Serengeti Command-Line Interface Client. See
“Create a User Name and Password for the Serengeti Command-Line Interface,” on page 66.

Configure vCenter Single Sign-On Settings for the Serengeti Management Server

If the Big Data Extensions Single Sign-On (SSO) authentication settings are not configured or if they change after you install the Big Data Extensions plug-in, you can use the Serengeti Management Server Administration Portal to enable SSO, update the certificate, and register the plug-in so that you can connect to the Serengeti Management Server and continue managing clusters.
The SSL certificate for the Big Data Extensions plug-in can change for many reasons. For example, you install a custom certificate or replace an expired certificate.
Prerequisites
Ensure that you know the IP address of the Serengeti Management Server to which you want to
n
connect.
Ensure that you have login credentials for the Serengeti Management Server root user.
n
Procedure
1 Open a Web browser and go the URL of the Serengeti Management Server Administration Portal.
https://management-server-ip-address:5480
2 Type root for the user name, type the password, and click Login.
3 Select the SSO tab.
VMware, Inc. 65
4 Do one of the following.
Option Description
Update the certificate
Enable SSO for the first time
The Big Data Extensions and vCenter SSO server certificates are synchronized.
What to do next
Reregister the Big Data Extensions plug-in with the Serengeti Management Server. See “Connect to a
Serengeti Management Server,” on page 28.
Click Update Certificate.
Type the Lookup Service URL, and click Enable SSO.

Create a User Name and Password for the Serengeti Command-Line Interface

The Serengeti Command-Line Interface Client uses the vCenter Server login credentials with read permissions on the Serengeti Management Server. If you do not create a user name and password for the Serengeti Command-Line Interface Client, it will use the default vCenter Server administrator credentials. However, for security reasons, it's best to create a user account specifically for use with the Serengeti Command-Line Interface Client.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
n
Client,” on page 23.
Install the Serengeti Command-Line Interface Client. See “Install the Serengeti Remote Command-Line
n
Interface Client,” on page 29.
Procedure
1 Open a Web browser and go to: https://vc-hostname:port/vsphere-client.
The vc-hostname can be either the DNS host name or IP address of vCenter Server. By default the port is 9443, but this can change during the installation of the vSphere Web Client.
2 Type the user name and password that has administrative privileges on vCenter Server, and click
Login.
NOTE vCenter Server 5.5 users must use a local domain to perform SSO related operations.
3 From the vSphere Web Client Navigator panel, select Administration, SSO Users and Groups.
4 Change the login credentials.
The login credentials are updated. The next time you access the Serengeti Command-Line Interface use the new login credentials.
What to do next
You can change the password of the Serengeti Management Server. See “Change the Password for the
Serengeti Management Server,” on page 64.

Stop and Start Serengeti Services

You can stop and start Serengeti services to make a reconfiguration take effect, or to recover from an operational anomaly.
66 VMware, Inc.
Chapter 5 Managing the Big Data Extensions Environment
Procedure
1 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
2 Run the serengeti-stop-services.sh script to stop the Serengeti services.
serengeti-stop-services.sh
3 Run the serengeti-start-services.sh script to start the Serengeti services.
serengeti-start-services.sh
VMware, Inc. 67
68 VMware, Inc.
Managing vSphere Resources for
Hadoop and HBase Clusters 6
Big Data Extensions lets you manage the resource pools, datastores, and networks that you use in the Hadoop and HBase clusters that you create.
Add a Resource Pool with the Serengeti Command-Line Interface on page 70
n
You add resource pools to make them available for use by Hadoop clusters. Resource pools must be located at the top level of a cluster. Nested resource pools are not supported.
Remove a Resource Pool with the Serengeti Command-Line Interface on page 70
n
You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove resource pools when you do not need them or if you want the Hadoop clusters you create in the Serengeti Management Server to be deployed under a different resource pool. Removing a resource pool removes its reference in vSphere. The resource pool is not deleted.
Add a Datastore in the vSphere Web Client on page 70
n
You can add datastores to Big Data Extensions to make them available to Hadoop and HBase clusters. Big Data Extensions supports both shared datastores and local datastores.
Remove a Datastore in the vSphere Web Client on page 71
n
You remove a datastore from Big Data Extensions when you no longer want the Hadoop clusters you create to use that datastore.
VMware, Inc.
Add a Network in the vSphere Web Client on page 72
n
You add networks to Big Data Extensions to make the IP addresses contained by those networks available to Hadoop and HBase clusters.
Reconfigure a Static IP Network in the vSphere Web Client on page 72
n
You can reconfigure a Big Data Extensions static IP network by adding IP address segments to it. You might need to add IP address segments so that there is enough capacity for a cluster that you want to create.
Remove a Network in the vSphere Web Client on page 73
n
You can remove an existing network from Big Data Extensions when you no longer need it. Removing an unused network frees the IP addresses for use by other services.
69

Add a Resource Pool with the Serengeti Command-Line Interface

You add resource pools to make them available for use by Hadoop clusters. Resource pools must be located at the top level of a cluster. Nested resource pools are not supported.
When you add a resource pool to Big Data Extensions it symbolically represents the actual vSphere resource pool as recognized by vCenter Server. This symbolic representation lets you use the Big Data Extensions resource pool name, instead of the full path of the resource pool in vCenter Server, in cluster specification files.
Prerequisites
Deploy Big Data Extensions.
Procedure
1 Access the Serengeti Command-Line Interface client.
2 Run the resourcepool add command.
The --vcrp parameter is optional.
This example adds a Serengeti resource pool named myRP to the vSphere rp1 resource pool that is contained by the cluster1 vSphere cluster.
resourcepool add --name myRP --vccluster cluster1 --vcrp rp1
What to do next
After you add a resource pool to Big Data Extensions, do not rename the resource pool in vSphere. If you rename it, you cannot perform Serengeti operations on clusters that use that resource pool.

Remove a Resource Pool with the Serengeti Command-Line Interface

You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove resource pools when you do not need them or if you want the Hadoop clusters you create in the Serengeti Management Server to be deployed under a different resource pool. Removing a resource pool removes its reference in vSphere. The resource pool is not deleted.
Procedure
1 Access the Serengeti Command-Line Interface client.
2 Run the resourcepool delete command.
If the command fails because the resource pool is referenced by a Hadoop cluster, you can use the
resourcepool list command to see which cluster is referencing the resource pool.
This example deletes the resource pool named myRP.
resourcepool delete --name myRP

Add a Datastore in the vSphere Web Client

You can add datastores to Big Data Extensions to make them available to Hadoop and HBase clusters. Big Data Extensions supports both shared datastores and local datastores.
Prerequisites
Install Big Data Extensions.
70 VMware, Inc.
Chapter 6 Managing vSphere Resources for Hadoop and HBase Clusters
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, select Resources.
4 Expand the Inventory Lists, and select Datastores.
5 Click the Add (+) icon.
6 In the Name text box, type a name with which to identify the datastore in Big Data Extensions.
7 From the Type list, select the datastore type in vSphere.
Type Description
Shared
Local
Recommended for master nodes. Enables you to leverage vMotion, HA, and Fault Tolerance.
NOTE If you do not specify shared storage and try to provision a cluster using vMotion, HA, or Fault Tolerance, the provisioning fails.
Recommended for worker nodes. Throughput is scalable and the cost of storage is lower.
8 Select one or more vSphere datastores to make available to the Big Data Extensions datastore that you
are adding.
9 Click OK to save your changes.
The vSphere datastores are available for use by Hadoop and HBase clusters deployed within Big Data Extensions.

Remove a Datastore in the vSphere Web Client

You remove a datastore from Big Data Extensions when you no longer want the Hadoop clusters you create to use that datastore.
Prerequisites
Remove all Hadoop clusters associated with the datastore. See “Delete a Hadoop Cluster in the vSphere
Web Client,” on page 88.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, select Resources.
4 Expand Resources, select Inventory Lists, and select Datastores.
5 Select the datastore that you want to remove, right-click, and select Remove.
6 Click Yes to confirm.
If you did not remove the cluster that uses the datastore, you receive an error message indicating that the datastore cannot be removed because it is currently in use.
The datastore is removed from Big Data Extensions.
VMware, Inc. 71

Add a Network in the vSphere Web Client

You add networks to Big Data Extensions to make the IP addresses contained by those networks available to Hadoop and HBase clusters.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the network.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, select Resources.
4 Expand Resources, select Inventory Lists, and select Networks.
5 Click the Add (+) icon.
6 In the Name text box, type a name with which to identify the network resource in Big Data Extensions.
7 From the Port group name list, select the vSphere port group that you want to add to Big Data
Extensions.
8 Choose the type of addressing to use for the network: Use DHCP to obtain IP addresses or Use static
IP addresses.
9 If you chose Use static IP addresses in Step 8, enter one or more IP address ranges.
10 Click OK to save your changes.
The IP addresses of the network are available to Hadoop and HBase clusters that you create within Big Data Extensions.

Reconfigure a Static IP Network in the vSphere Web Client

You can reconfigure a Big Data Extensions static IP network by adding IP address segments to it. You might need to add IP address segments so that there is enough capacity for a cluster that you want to create.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the network.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, select Resources.
4 Expand Resources, select Inventory Lists, and select Networks.
5 Select the static IP network to reconfigure, right-click, and select Add IP Range.
6 Click Add IP range, and enter the IP address information.
7 Click OK to save your changes.
IP address segments are added to the network.
72 VMware, Inc.
Chapter 6 Managing vSphere Resources for Hadoop and HBase Clusters

Remove a Network in the vSphere Web Client

You can remove an existing network from Big Data Extensions when you no longer need it. Removing an unused network frees the IP addresses for use by other services.
Prerequisites
Remove clusters assigned to the network. See “Delete a Hadoop Cluster in the vSphere Web Client,” on page 88.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, click Resources.
4 Expand Resources, select Inventory Lists, and select Networks.
5 Select the network to remove, right-click, and select Remove.
6 Click Yes to confirm.
If you have not removed the cluster that uses the network, you receive an error message indicating that the network cannot be removed because it is currently in use.
The network is removed, and the IP addresses are available for use.
VMware, Inc. 73
74 VMware, Inc.

Creating Hadoop and HBase Clusters 7

Big Data Extensions lets you create and deploy Hadoop and HBase clusters. A Hadoop or HBase cluster is a special type of computational cluster designed specifically for storing and analyzing large amounts of unstructured data in a distributed computing environment.
The resource requirements are different for clusters created with the Serengeti Command-Line Interface and the Big Data Extensions plug-in for the vSphere Web Client because the clusters use different default templates. The default clusters created through the Serengeti Command-Line Interface are targeted for Project Serengeti users and proof-of-concept applications, and are smaller than the Big Data Extensions plug-in templates, which are targeted for larger deployments for commercial use.
Additionally, some deployment configurations require more resources than other configurations. For example, if you create a Greenplum HD 1.2 cluster, you cannot use the SMALL size virtual machine. If you create a default MapR or Greenplum HD cluster through the Serengeti Command-Line Interface, at least 550GB of storage and 55GB of memory are recommended. For other Hadoop distributions, at least 350GB of storage and 35GB of memory are recommended.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the cluster's virtual machine automatic migration. Although this prevents vSphere from automatically migrating the virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such management functions outside of the Big Data Extensions environment can make it impossible for you to perform some Big Data Extensions operations, such as disk failure recovery.
VMware, Inc.
About Hadoop and HBase Cluster Deployment Types on page 76
n
Big Data Extensions lets you deploy several types of Hadoop and HBase clusters. You need to know about the types of clusters that you can create.
About Cluster Topology on page 77
n
You can improve workload balance across your cluster nodes, and improve performance and throughput, by specifying how Hadoop virtual machines are placed using topology awareness. For example, you can have separate data and compute nodes, and improve performance and throughput by placing the nodes on the same set of physical hosts.
About HBase Database Access on page 77
n
Serengeti supports several methods of HBase database access.
Create a Hadoop or HBase Cluster in the vSphere Web Client on page 78
n
After you complete deployment of the Hadoop distribution, you can create Hadoop and HBase clusters to process data. You can create multiple clusters in your Big Data Extensions environment, but your environment must meet all prerequisites and have adequate resources.
75
Create a Cluster with Topology Awareness with the Serengeti Command-Line Interface on page 80
n
To achieve a balanced workload or to improve performance and throughput, you can control how Hadoop virtual machines are placed by adding topology awareness to the Hadoop clusters. For example, you can have separate data and compute nodes, and improve performance and throughput by placing the nodes on the same set of physical hosts.

About Hadoop and HBase Cluster Deployment Types

Big Data Extensions lets you deploy several types of Hadoop and HBase clusters. You need to know about the types of clusters that you can create.
You can create the following types of clusters.
Basic Hadoop Cluster
HBase Cluster
Data-Compute Separated Hadoop Cluster
Compute-Only Hadoop Cluster
Customized Cluster
You can create a simple Hadoop deployment for proof of concept projects and other small scale data processing tasks using the basic Hadoop cluster.
You can create an HBase cluster. To run HBase MapReduce jobs, configure the HBase cluster to include JobTracker or TaskTracker nodes.
You can separate the data and compute nodes in a Hadoop cluster, and you can control how nodes are placed on your environment's vSphere ESXi hosts.
You can create a compute-only cluster to run MapReduce jobs. Compute­only clusters run only MapReduce services that read data from external HDFS clusters and that do not need to store data.
You can use an existing cluster specification file to create clusters using the same configuration as your previously created clusters. You can also edit the file to customize the cluster configuration.
Hadoop Distributions Supporting MapReduce v1 and MapReduce v2 (YARN)
If the Hadoop distribution you use supports both MapReduce v1 and MapReduce v2 (YARN), the default Hadoop cluster configuration creates a MapReduce v2 cluster.
In addition, if you are using two different versions of a vendor's Hadoop distribution, and both versions support MapReduce v1 and MapReduce v2, the cluster you create using the latest version with the default Hadoop cluster will use MapReduce v2. Clusters you create with the earlier Hadoop version will use MapReduce v1. For example, if you have both Cloudera CDH 5 and CDH 4 installed within Big Data Extensions, clusters you create with CDH 5 will use MapReduce v2, and clusters you create with CDH 4 will use MapReduce v1.
76 VMware, Inc.

About Cluster Topology

You can improve workload balance across your cluster nodes, and improve performance and throughput, by specifying how Hadoop virtual machines are placed using topology awareness. For example, you can have separate data and compute nodes, and improve performance and throughput by placing the nodes on the same set of physical hosts.
To get maximum performance out of your Hadoop or HBase cluster, configure your cluster so that it has awareness of the topology of your environment's host and network information. Hadoop performs better when it uses within-rack transfers, where more bandwidth is available, to off-rack transfers when assigning MapReduce tasks to nodes. HDFS can place replicas more intelligently to trade off performance and resilience. For example, if you have separate data and compute nodes, you can improve performance and throughput by placing the nodes on the same set of physical hosts.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the cluster's virtual machine automatic migration. Although this prevents vSphere from migrating the virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such management functions outside of the Big Data Extensions environment might break the cluster's placement policy, such as the number of instances per host and the group associations. Even if you do not specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN placement policy constraints.
Chapter 7 Creating Hadoop and HBase Clusters
You can specify the following topology awareness configurations.
Hadoop Virtualization Extensions (HVE)
RACK_AS_RACK
HOST_AS_RACK
None
Enhanced cluster reliability and performance provided by refined Hadoop replica placement, task scheduling, and balancer policies. Hadoop clusters implemented on a virtualized infrastructure have full awareness of the topology on which they are running when using HVE.
To use HVE, your Hadoop distribution must support HVE and you must create and upload a topology rack-hosts mapping file.
Standard topology for Apache Hadoop distributions. Only rack and host information are exposed to Hadoop. To use RACK_AS_RACK, create and upload a server topology file.
Simplified topology for Apache Hadoop distributions. To avoid placing all HDFS data block replicas on the same physical host, each physical host is treated as a rack. Because data block replicas are never placed on a rack, this avoids the worst case scenario of a single host failure causing the complete loss of any data block.
Use HOST_AS_RACK if your cluster uses a single rack, or if you do not have rack information with which to decide about topology configuration options.
No topology is specified.

About HBase Database Access

Serengeti supports several methods of HBase database access.
Log in to the client node virtual machine and run hbase shell commands.
n
Log in to the client node virtual machine and run HBase jobs by using the hbase command.
n
hbase org.apache.hadoop.hbase.PerformanceEvaluation –-nomapred randomWrite 3
VMware, Inc. 77
The default Serengeti-deployed HBase cluster does not contain Hadoop JobTracker or Hadoop TaskTracker daemons. To run an HBase MapReduce job, you must deploy a customized cluster that includes JobTracker and TaskTracker nodes.
Use the HBase cluster’s client node Rest-ful Web Services, which listen on port 8080, by using the curl
n
command.
curl –I http://client_node_ip:8080/status/cluster
Use the HBase cluster’s client node Thrift gateway, which listens on port 9090.
n

Create a Hadoop or HBase Cluster in the vSphere Web Client

After you complete deployment of the Hadoop distribution, you can create Hadoop and HBase clusters to process data. You can create multiple clusters in your Big Data Extensions environment, but your environment must meet all prerequisites and have adequate resources.
Prerequisites
Deploy the Big Data Extensions vApp. See “Getting Started with Big Data Extensions,” on page 11.
n
Install the Big Data Extensions plug-in. See“Install the Big Data Extensions Plug-In,” on page 27.
n
Connect to a Serengeti Management Server. See “Connect to a Serengeti Management Server,” on
n
page 28.
Configure one or more Hadoop distributions. See “Configure a Tarball-Deployed Hadoop
n
Distribution,” on page 44or “Configuring Yum and Yum Repositories,” on page 46.
Understand the topology configuration options that you want to use with your cluster.
n
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions > Big Data Clusters.
3 In the Objects tab, click New Big Data Cluster.
4 Follow the prompts to create the new cluster. The table describes the information to enter for the cluster
that you want to create.
Option Description
Hadoop cluster name
Hadoop distro
Type a name to identify the cluster.
The only valid characters for cluster names are alphanumeric and underscores. When you choose the cluster name, also consider the applicable vApp name. Together, the vApp and cluster names must be < 80 characters.
Select the Hadoop distribution. The list contains the default Apache Hadoop distribution for Big Data Extensions and the distributions that you added to your Big Data Extensions environment. The distribution names match the --name parameter's value that was passed to the config- distro.rb script when the Hadoop distribution was configured. For example, cdh4 and mapr.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
78 VMware, Inc.
Option Description
Deployment type
Select the type of cluster you want to create.
Basic Hadoop Cluster
n
HBase Cluster
n
Compute-only Hadoop Cluster
n
Data/Compute Separation Hadoop Cluster
n
Customize
n
The type of cluster you create determines the available node group selections.
If you select Customize, you can load an existing cluster specification file.
DataMaster Node Group
The DataMaster node is a virtual machine that runs the Hadoop NameNode service. This node manages HDFS data and assigns tasks to Hadoop TaskTracker services deployed in the worker node group.
Select a resource template from the drop-down menu, or select Customize to customize a resource template.
For the master node, use shared storage so that you protect this virtual machine with vSphere HA and vSphere FT.
ComputeMaster Node Group
The ComputeMaster node is a virtual machine that runs the Hadoop JobTracker service. This node assigns tasks to Hadoop TaskTracker services deployed in the worker node group.
Select a resource template from the drop-down menu, or select Customize to customize a resource template.
For the master node, use shared storage so that you protect this virtual machine with vSphere HA and vSphere FT.
HBaseMaster Node Group (HBase cluster only)
The HBaseMaster node is a virtual machine that runs the HBase master service. This node orchestrates a cluster of one or more RegionServer slave nodes.
Select a resource template from the drop-down menu, or select Customize to customize a resource template.
For the master node, use shared storage so that you protect this virtual machine with vSphere HA and vSphere FT.
Worker Node Group
Worker nodes are virtual machines that run the Hadoop DataNode, TaskTracker, and HBase HRegionServer services. These nodes store HDFS data and execute tasks.
Select the number of nodes and the resource template from the drop-down menu, or select Customize to customize a resource template.
For worker nodes, use local storage.
NOTE You can add nodes to the worker node group by using Scale Out Cluster. You cannot reduce the number of nodes.
Client Node Group
A client node is a virtual machine that contains Hadoop client components. From this virtual machine you can access HDFS, submit MapReduce jobs, run Pig scripts, run Hive queries, and HBase commands.
Select the number of nodes and a resource template from the drop-down menu, or select Customize to customize a resource template.
NOTE You can add nodes to the client node group by using Scale Out Cluster. You cannot reduce the number of nodes.
Hadoop Topology
Select the topology configuration that you want the cluster to use.
RACK_AS_RACK
n
HOST_AS_RACK
n
HVE
n
NONE
n
If you do not see the topology configuration that you want, define it in a topology rack-hosts mapping file, and use the Serengeti Command-Line Interface to upload the file to the Serengeti Management Server. See
“About Cluster Topology,” on page 77
Chapter 7 Creating Hadoop and HBase Clusters
VMware, Inc. 79
Option Description
Network
Resource Pools
VM Password
Select one or more networks for the cluster to use.
For optimal performance, use the same network for HDFS and MapReduce traffic in Hadoop and Hadoop+HBase clusters. HBase clusters use the HDFS network for traffic related to the HBase Master and HBase RegionServer services.
IMPORTANT You cannot configure multiple networks for clusters that use the MapR Hadoop distribution.
n
n
Select one or more resource pools that you want the cluster to use.
Choose how initial administrator passwords are assigned to the cluster's virtual machine nodes.
n
n
To assign a custom initial administrator password to all the nodes in the cluster, choose Set password, and type and confirm the initial password.
Passwords are from 8 to 128 characters, and include only alphanumeric characters ([0-9, a-z, A-Z]) and the following special characters: _ @ # $ % ^ & *
IMPORTANT If you set an initial administrator password, it is used for nodes that are created by future scaling and disk failure recovery operations. If you use the random password, nodes that are created by future scaling and disk failure recovery operations will use new, random passwords.
The Serengeti Management Server clones the template virtual machine to create the nodes in the cluster. When each virtual machine starts, the agent on that virtual machine pulls the appropriate Big Data Extensions software components to that node and deploys the software.
To use one network for all traffic, select the network from the Network list.
To use separate networks for the management, HDFS, and MapReduce traffic, select Customize the HDFS network and MapReduce network, and select a network from each network list.
Use random password.
Set password.

Create a Cluster with Topology Awareness with the Serengeti Command-Line Interface

To achieve a balanced workload or to improve performance and throughput, you can control how Hadoop virtual machines are placed by adding topology awareness to the Hadoop clusters. For example, you can have separate data and compute nodes, and improve performance and throughput by placing the nodes on the same set of physical hosts.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti Command-Line Interface.
2 (Optional) Run the topology list command to view the list of available topologies.
topology list
80 VMware, Inc.
Chapter 7 Creating Hadoop and HBase Clusters
3 (Optional) If you want the cluster to use HVE or RACK_AS_RACK toplogies, create a topology rack-
hosts mapping file and upload the file to the Serengeti Management Server.
topology upload --fileName name_of_rack_hosts_mapping_file
4 Run the cluster create command to create the cluster.
cluster create --name cluster-name ... --topology {HVE|RACK_AS_RACK|HOST_AS_RACK}
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
This example creates an HVE topology.
cluster create --name cluster-name --topology HVE --distro name_of_HVE-supported_distro
5 View the allocated nodes on each rack.
cluster list --name cluster-name –-detail
VMware, Inc. 81
82 VMware, Inc.
Managing Hadoop and HBase
Clusters 8
You can use the vSphere Web Client to start and stop your Hadoop or HBase cluster and to modify cluster configuration. You can also manage a cluster using the Serengeti Command-Line Interface.
CAUTION Do not use vSphere management functions such as migrating cluster nodes to other hosts for clusters that you create with Big Data Extensions. Performing such management functions outside of the Big Data Extensions environment can make it impossible for you to perform some Big Data Extensions operations, such as disk failure recovery.
Stop and Start a Hadoop Cluster in the vSphere Web Client on page 84
n
You can stop a running Hadoop cluster and start a stopped Hadoop cluster from the vSphere Web Client.
Scale Out a Hadoop Cluster in the vSphere Web Client on page 84
n
You specify the number of nodes to use when you create Hadoop clusters. You can scale out the cluster by increasing the number of worker nodes and client nodes.
Scale CPU and RAM in the vSphere Web Client on page 85
n
You can increase or decrease a cluster’s compute capacity to prevent CPU or memory resource contention among running jobs.
VMware, Inc.
Reconfigure a Hadoop or HBase Cluster with the Serengeti Command-Line Interface on page 86
n
You can reconfigure any Hadoop or HBase cluster that you create with Big Data Extensions.
Delete a Hadoop Cluster in the vSphere Web Client on page 88
n
You can delete a cluster using the vSphere Web Client. Deleting a cluster removes it from both the inventory and datastore
About Resource Usage and Elastic Scaling on page 88
n
Scaling lets you adjust the compute capacity of Hadoop data-compute separated clusters. When you enable elastic scaling for a Hadoop cluster, the Serengeti Management Server can stop and start compute nodes to match resource requirements to available resources. You can use manual scaling for more explicit cluster control.
Use Disk I/O Shares to Prioritize Cluster Virtual Machines in the vSphere Web Client on page 92
n
You can set the disk I/O shares for the virtual machines running a cluster. Disk shares distinguish high-priority virtual machines from low-priority virtual machines.
About vSphere High Availability and vSphere Fault Tolerance on page 93
n
The Serengeti Management Server leverages vSphere HA to protect the Hadoop master node virtual machine, which can be monitored by vSphere.
83
Recover from Disk Failure with the Serengeti Command-Line Interface Client on page 93
n
If there is a disk failure in a Hadoop cluster, and the disk does not perform management roles such as NameNode, JobTracker, ResourceManager, HMaster, or ZooKeeper, you can recover by running the Serengeti cluster fix command.
Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client on page 94
n
To perform troubleshooting or to run your management automation scripts, log in to Hadoop master, worker, and client nodes with password-less SSH from the Serengeti Management Server using SSH client tools such as SSH, PDSH, ClusterSSH, and Mussh.
Change the User Password on All of a Cluster's Nodes on page 94
n
You can change the user password for all nodes in a cluster. The user password that you can change includes the serengeti and root users.

Stop and Start a Hadoop Cluster in the vSphere Web Client

You can stop a running Hadoop cluster and start a stopped Hadoop cluster from the vSphere Web Client.
Prerequisites
To stop a cluster it must be running.
n
To start a cluster it must be stopped.
n
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, click Big Data Clusters.
4 Select the cluster to stop or start from the Hadoop Cluster Name column, and right-click to display the
Actions menu.
5 Select Shut Down Big Data Cluster to stop a running cluster, or select Start Big Data Cluster to start a
cluster.

Scale Out a Hadoop Cluster in the vSphere Web Client

You specify the number of nodes to use when you create Hadoop clusters. You can scale out the cluster by increasing the number of worker nodes and client nodes.
You can scale the cluster by using the vSphere Web Client or the Serengeti Command-Line Interface Client. The command-line interface provides more configuration options than the vSphere Web Client. See the VMware vSphere Big Data Extensions Command-Line Interface Guide.
You cannot decrease the number of worker and client nodes from the vSphere Web Client.
IMPORTANT Even if you changed the user password on the cluster's nodes, the changed password is not used for the new nodes that are created when you scale out a cluster. If you set the cluster's initial administrator password when you created the cluster, that initial administrator password is used for the new nodes. If you did not set the cluster's initial administrator password when you created the cluster, new random passwords are used for the new nodes.
Prerequisites
Verify that the cluster is running. See “Stop and Start a Hadoop Cluster in the vSphere Web Client,” on
n
page 84.
84 VMware, Inc.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, select Big Data Clusters.
4 From the Hadoop Cluster Name column, select the cluster to scale out.
5 Click the All Actions icon, and select Scale Out.
6 From the Node group list, select the worker or client node group to scale out.
If a node group has 0 nodes, it does not appear in the Node group list.
7 In the Instance number text box, type the target number of node instances to add, and click OK.
You cannot decrease the number of nodes. If you specify an instance number that is less than or equal to the current number of instances, a Scale Out Failed error occurs.
The cluster is updated to include the specified number of nodes.

Scale CPU and RAM in the vSphere Web Client

You can increase or decrease a cluster’s compute capacity to prevent CPU or memory resource contention among running jobs.
Chapter 8 Managing Hadoop and HBase Clusters
You can adjust compute resources without increasing the workload on the Master node. If increasing or decreasing the cluster's CPU or RAM is unsuccessful for a node, which is commonly because of insufficient resources being available, the node is returned to its original CPU or RAM setting.
All node types support CPU and RAM scaling, but do not scale a cluster's master node CPU or RAM because Big Data Extensions powers down the virtual machine during the scaling process.
When you scale your cluster’s CPU and RAM, the number of CPUs must be a multiple of the number of cores per socket, and you must scale the amount of RAM as a multiple of 4, allowing a minimum of 3748 MB.
Prerequisites
Verify that the cluster that you want to scale is running. See “Stop and Start a Hadoop Cluster in the
n
vSphere Web Client,” on page 84.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, select Big Data Clusters.
4 From the Hadoop Cluster Name column, select the cluster that you want to scale up or down.
5 Click the All Actions icon, and select Scale Up/Down.
6 From the Node group drop-down menu, select the ComputeMaster, DataMaster, Worker, Client, or
Customized node group whose CPU or RAM you want to scale up or down.
7 Enter the number of vCPUs to use and the amount of RAM and click OK.
After applying new values for CPU and RAM, the cluster is placed into Maintenance mode as it applies the new values. You can monitor the status of the cluster as the new values are applied.
VMware, Inc. 85

Reconfigure a Hadoop or HBase Cluster with the Serengeti Command-Line Interface

You can reconfigure any Hadoop or HBase cluster that you create with Big Data Extensions.
The cluster configuration is specified by attributes in Hadoop distribution XML configuration files such as:
core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, yarn-env.sh, yarn-site.sh, and hadoop­metrics.properties.
NOTE Always use the cluster config command to change the parameters specified by these configuration files. If you manually modify these files, your changes will be erased if the virtual machine is rebooted, or you use the cluster config, cluster start, cluster stop, or cluster resize commands.
Procedure
1 Use the cluster export command to export the cluster specification file for the cluster that you want to
reconfigure.
cluster export --name cluster_name --specFile file_path/cluster_spec_file_name
Option Description
cluster_name
file_path
cluster_spec_file_name
2 Edit the configuration information located near the end of the exported cluster specification file.
Name of the cluster that you want to reconfigure.
The file system path at which to export the specification file.
The name with which to label the exported cluster specification file.
If you are modeling your configuration file on existing Hadoop XML configuration files, use the
convert-hadoop-conf.rb conversion tool to convert Hadoop XML configuration files to the required
JSON format.
… "configuration": { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/core­default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample: // "io.file.buffer.size": "4096" }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs­default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/mapred­default.html }, "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", // "HADOOP_SECONDARYNAMENODE_OPTS": "", // "HADOOP_JOBTRACKER_OPTS": "", // "HADOOP_TASKTRACKER_OPTS": "",
86 VMware, Inc.
Chapter 8 Managing Hadoop and HBase Clusters
// "HADOOP_CLASSPATH": "", // "JAVA_HOME": "", // "PATH": "", }, "log4j.properties": { // "hadoop.root.logger": "DEBUG, DRFA ", // "hadoop.security.logger": "DEBUG, DRFA ", }, "fair-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html // "text": "the full content of fair-scheduler.xml in one line" }, "capacity-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html } } } …
3 (Optional) If your Hadoop distribution’s JAR files are not in the $HADOOP_HOME/lib directory, add the
full path of the JAR file in $HADOOP_CLASSPATH to the cluster specification file.
This action lets the Hadoop daemons locate the distribution JAR files.
For example, the Cloudera CDH3 Hadoop Fair Scheduler JAR files are in /usr/lib/hadoop/contrib/fairscheduler/. Add the following to the cluster specification file to enable Hadoop to use the JAR files.
… "configuration": { "hadoop": { "hadoop-env.sh": { "HADOOP_CLASSPATH": "/usr/lib/hadoop/contrib/fairscheduler/*:$HADOOP_CLASSPATH" }, "mapred-site.xml": { "mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler" … }, "fair-scheduler.xml": { … } } } …
4 Access the Serengeti Command-Line Interface.
5 Run the cluster config command to apply the new Hadoop configuration.
cluster config --name cluster_name --specFile file_path/cluster_spec_file_name
6 (Optional) Reset an existing configuration attribute to its default value.
a Remove the attribute from the cluster configuration file’s configuration section, or comment out the
attribute using double back slashes (//).
b Re-run the cluster config command.
VMware, Inc. 87

Delete a Hadoop Cluster in the vSphere Web Client

You can delete a cluster using the vSphere Web Client. Deleting a cluster removes it from both the inventory and datastore
When you create a cluster, Big Data Extensions creates a folder and a resource pool for each cluster, as well as resource pools for each node group within the cluster. When you delete a cluster all of these organizational folders and resource pools are also removed.
When you delete a cluster, it is removed from both the inventory and datastore.
You can delete a running cluster, a stopped cluster, or a cluster in an error state.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 In the object navigator, select Big Data Extensions.
3 In Inventory Lists, select Big Data Clusters.
4 From the Hadoop Cluster Name column, select the cluster to delete.
5 Click the All Actions icon, and select Delete Big Data Cluster.
The cluster and all the virtual machines it contains are removed from your Big Data Extensions environment.

About Resource Usage and Elastic Scaling

Scaling lets you adjust the compute capacity of Hadoop data-compute separated clusters. When you enable elastic scaling for a Hadoop cluster, the Serengeti Management Server can stop and start compute nodes to match resource requirements to available resources. You can use manual scaling for more explicit cluster control.
Manual scaling is appropriate for static environments where capacity planning can predict resource availability for workloads. Elastic scaling is best suited for mixed workload environments where resource requirements and availability fluctuate.
When you select manual scaling, Big Data Extensions disables elastic scaling. You can configure the target number of compute nodes for manual scaling. If you do not configure the target number of compute nodes, Big Data Extensions sets the number of active compute nodes to the current number of active compute nodes. If nodes become unresponsive, they remain in the cluster and the cluster operates with fewer functional nodes. In contrast, when you enable elastic scaling, Big Data Extensions manages the number of active TaskTracker nodes according to the range that you specify, replacing unresponsive or faulty nodes with live, responsive nodes.
For both manual and elastic scaling, Big Data Extensions, not vCenter Server, controls the number of active nodes. However, vCenter Server applies the usual reservations, shares, and limits to the cluster's resource pool according to the cluster's vSphere configuration. vSphere DRS operates as usual, allocating resources between competing workloads, which in turn influences how Big Data Extensions dynamically adjusts the number of active nodes in competing Hadoop clusters while elastic scaling is in effect.
Big Data Extensions also lets you adjust cluster nodes' access priority for datastores by using the vSphere Storage I/O Control feature. Clusters configured for HIGH I/O shares receive higher priority access than clusters with NORMAL priority. Clusters configured for NORMAL I/O shares receive higher priority access than clusters with LOW priority. In general, higher priority provides better disk I/O performance.
88 VMware, Inc.
Scaling Modes
To change between manual and elastic scaling, you change the scaling mode.
MANUAL. Big Data Extensions disables elastic scaling. When you change to manual scaling, you can
n
configure the target number of compute nodes. If you do not configure the target number of compute nodes, Big Data Extensions sets the number of active compute nodes to the current number of active compute nodes.
AUTO. Enables elastic scaling. Big Data Extensions manages the number of active compute nodes,
n
maintaining the number of compute nodes in the range from the configured minimum to the configured maximum number of compute nodes in the cluster. If the minimum number of compute nodes is undefined, the lower limit is 0. If the maximum number of compute nodes is undefined, the upper limit is the number of available compute nodes.
Elastic scaling operates on a per-host basis, at a node-level granularity. That is, the more compute nodes a Hadoop cluster has on a host, the finer the control that Big Data Extensions elasticity can exercise. The tradeoff is that the more compute nodes you have, the higher the overhead in terms of runtime resource cost, disk footprint, I/O requirements, and so on.
When resources are overcommitted, elastic scaling reduces the number of powered on compute nodes. Conversely, if the cluster receives all the resources it requested from vSphere, and Big Data Extensions determines that the cluster can make use of additional capacity, elastic scaling powers on additional compute nodes.
Chapter 8 Managing Hadoop and HBase Clusters
Resources can become overcommitted for many reasons, such as:
The compute nodes have lower resource entitlements than a competing workload, according to
n
how vCenter Server applies the usual reservations, shares, and limits as configured for the cluster.
Physical resources are configured to be available, but another workload is consuming those
n
resources.
In elastic scaling, Big Data Extensions has two different behaviors for deciding how many active compute nodes to maintain. In both behaviors, Big Data Extensions replaces unresponsive or faulty nodes with live, responsive nodes.
Variable. The number of active, healthy TaskTracker compute nodes is maintained from the
n
configured minimum number of compute nodes to the configured maximum number of compute nodes. The number of active compute nodes varies as resource availability fluctuates.
Fixed. The number of active, healthy TaskTracker compute nodes is maintained at a fixed number
n
when the same value is configured for the minimum and maximum number of compute nodes.
Default Cluster Scaling Parameter Values
When you create a cluster, its scaling configuration is as follows.
The cluster's scaling mode is MANUAL, for manual scaling.
n
The cluster's minimum number of compute nodes is -1. It appears as "Unset" in the Serengeti CLI
n
displays. Big Data Extensions elastic scaling treats a minComputeNodeNum value of -1 as if it were zero (0).
The cluster's maximum number of compute nodes is -1. It appears as "Unset" in the Serengeti CLI
n
displays. Big Data Extensions elastic scaling treats a maxComputeNodeNum value of -1 as if it were unlimited.
The cluster's target number of nodes is not applicable. Its value is -1. Big Data Extensions manual
n
scaling operations treat a targetComputeNodeNum value of -1 as if it were unspecified upon a change to manual scaling.
VMware, Inc. 89
Interactions Between Scaling and Other Cluster Operations
Some cluster operations cannot be performed while Big Data Extensions is actively scaling a cluster.
If you try to perform the following operations while Big Data Extensions is scaling a cluster in MANUAL mode, Big Data Extensions warns you that in the cluster's current state, the operation cannot be performed.
Concurrent attempt at manual scaling
n
Switch to AUTO mode while manual scaling operations are in progress
n
If a cluster is in AUTO mode for elastic scaling when you perform the following cluster operations on it, Big Data Extensions changes the scaling mode to MANUAL and changes the cluster to manual scaling. You can re-enable the AUTO mode for elastic scaling after the cluster operation finishes, except if you delete the cluster.
Delete the cluster
n
Repair the cluster
n
Stop the cluster
n
If a cluster is in AUTO mode for elastic scaling when you perform the following cluster operations on it, Big Data Extensions temporarily switches the cluster to MANUAL mode. When the cluster operation finishes, Big Data Extensions returns the scaling mode to AUTO, which re-enables elastic scaling.
Resize the cluster
n
Reconfigure the cluster
n
If Big Data Extensions is scaling a cluster when you perform an operation that changes the scaling mode to MANUAL, your requested operation waits until the scaling finishes, and then the requested operation begins.

Optimize Cluster Resource Usage with Elastic Scaling in the vSphere Web Client

You can specify the scaling mode of a cluster. Scaling lets you specify the number of nodes that the cluster can use, and whether it adds nodes or uses nodes within a targeted range.
When you enable elastic scaling for a cluster, Big Data Extensions optimizes cluster performance and use of nodes that have a Hadoop TaskTracker role.
When you set a cluster's scaling mode to AUTO, configure the minimum number of compute nodes. If you do not configure the minimum and maximum number of compute nodes, the previous settings are retained. When you set a cluster's scaling mode to MANUAL, configure the target number of compute nodes. If you do not configure the target number of compute nodes, Big Data Extensions sets the number of active compute nodes to the current number of active compute nodes.
In elastic scaling, Big Data Extensions has two different behaviors for deciding how many active compute nodes to maintain. In both behaviors, Big Data Extensions replaces unresponsive or faulty nodes with live, responsive nodes.
Variable. The number of active, healthy TaskTracker compute nodes is maintained from the configured
n
minimum number of compute nodes to the configured maximum number of compute nodes. The number of active compute nodes varies as resource availability fluctuates.
Fixed. The number of active, healthy TaskTracker compute nodes is maintained at a fixed number when
n
the same value is configured for the minimum and maximum number of compute nodes.
Prerequisites
Understand how elastic scaling and resource usage work. See “About Resource Usage and Elastic
n
Scaling,” on page 88.
90 VMware, Inc.
Chapter 8 Managing Hadoop and HBase Clusters
Verify that the cluster you want to optimize is data-compute separated. See “About Hadoop and HBase
n
Cluster Deployment Types,” on page 76
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 In the object navigator select Big Data Extensions.
3 Under Inventory Lists click Big Data Clusters.
4 Select the cluster whose elasticity mode you want to set from the Hadoop Cluster Name column.
5 Click the All Actions icon, and select Set Elasticity Mode.
6 Specify the elasticity settings for the cluster that you want to modify.
Option Description
Elasticity mode
Target compute nodes
Min compute nodes
Max compute nodes
Select the type of elasticity mode you want to use. You can choose manual or automatic.
Specify the number of compute nodes the cluster should target for use. This option is applicable only to manual scaling (manual elasticity mode).
If you do not specify the target number of compute nodes, the node setting remains unconfigured, and Big Data Extensions sets the number of active compute nodes to the current number of active compute nodes.
NOTE A value of "Unset" or "-1" means that the node setting has not been configured and is not applicable.
Specify the minimum number (the lower limit) of active compute nodes to maintain in the cluster. This option is applicable only to elastic scaling (automatic elasticity mode).
To ensure that under contention elasticity keeps a cluster operating with more than a cluster’s initial default setting of zero compute nodes, configure the minimum number of compute nodes to a nonzero number.
Specify the maximum number (the upper limit) of active compute nodes to maintain in the cluster. This option is applicable only to elastic scaling (automatic elasticity mode).
What to do next
Specify the cluster's access priority for datastores. See “Use Disk I/O Shares to Prioritize Cluster Virtual
Machines in the vSphere Web Client,” on page 92.

Schedule Fixed Elastic Scaling for a Hadoop Cluster

You can enable fixed, elastic scaling according to a preconfigured schedule. Scheduled fixed, elastic scaling provides more control than variable, elastic scaling while still improving efficiency, allowing explicit changes in the number of active compute nodes during periods of predictable usage.
For example, in an office with typical workday hours, there is likely a reduced load on a VMware View resource pool after the office staff goes home. You could configure scheduled fixed, elastic scaling to specify a greater number of compute nodes from 8 PM to 4 AM, when you know that the workload would otherwise be very light.
Prerequisites
From the Serengeti Command-Line Interface, enable the cluster for elastic scaling, and set the
minComputeNodeNum and MaxComputeNodeNum parameters to the same value: the number of active TaskTracker
nodes that you want during the period of scheduled fixed elasticity.
VMware, Inc. 91
Procedure
1 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
2 Use any scheduling mechanism that you want to call
the /opt/serengeti/sbin/set_compute_node_num.sh script to set the number of active TaskTracker compute nodes that you want.
/opt/serengeti/sbin/set_compute_node_num.sh --name cluster_name --computeNodeNum num_TT_to_maintain
After the scheduling mechanism calls the set_compute_node_num.sh script, fixed, elastic scaling remains in effect with the configured number of active TaskTracker compute nodes until the next scheduling mechanism change or until a user changes the scaling mode or parameters in either the vSphere Web Client or the Serengeti Command-Line Interface.
This example shows how to use a crontab file on the Serengeti Management Server to schedule specific numbers of active TaskTracker compute nodes.
# cluster_A: use 20 active TaskTracker compute nodes from 11:00 to 16:00, and 30 compute nodes the rest of the day 00 11 * * * /opt/serengeti/sbin/set_compute_node_num.sh --name cluster_A -­computeNodeNum 20 >> $HOME/schedule_elasticity.log 2>&1 00 16 * * * /opt/serengeti/sbin/set_compute_node_num.sh --name cluster_A -­computeNodeNum 30 >> $HOME/schedule_elasticity.log 2>&1
# cluster_B: use 3 active TaskTracker compute nodes beginning at 10:00 every weekday 0 10 * * 1-5 /opt/serengeti/sbin/set_compute_node_num.sh --name cluster_B -­computeNodeNum 3 >> $HOME/schedule_elasticity.log 2>&1
# cluster_C: reset the number of active TaskTracker compute nodes every 6 hours to 15 0 */6 * * * /opt/serengeti/sbin/set_compute_node_num.sh --name cluster_B -­computeNodeNum 15 >> $HOME/schedule_elasticity.log 2>&1

Use Disk I/O Shares to Prioritize Cluster Virtual Machines in the vSphere Web Client

You can set the disk I/O shares for the virtual machines running a cluster. Disk shares distinguish high­priority virtual machines from low-priority virtual machines.
Disk shares is a value that represents the relative metric for controlling disk bandwidth to all virtual machines. The values are compared to the sum of all shares of all virtual machines on the server and, on an ESXi host, the service console. Big Data Extensions can adjust disk shares for all virtual machines in a cluster. Using disk shares you can change a cluster's I/O bandwidth to improve the cluster's I/O performance.
For more information about using disk shares to prioritize virtual machines, see the VMware vSphere ESXi and vCenter Server documentation.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 In the object navigator select Big Data Extensions.
3 In the Inventory Lists click Big Data Clusters.
4 Select the cluster whose disk IO shares you want to set from the Hadoop Cluster Name column.
5 Click the Actions icon, and select Set Disk IO Share.
92 VMware, Inc.
Chapter 8 Managing Hadoop and HBase Clusters
6 Specify a value to allocate a number of shares of disk bandwidth to the virtual machine running the
cluster.
Clusters configured for HIGH I/O shares receive higher priority access than those with NORMAL and LOW priorities, which provides better disk I/O performance. Disk shares are commonly set LOW for compute virtual machines and NORMAL for data virtual machines. The master node virtual machine is commonly set to NORMAL.
7 Click OK to save your changes.

About vSphere High Availability and vSphere Fault Tolerance

The Serengeti Management Server leverages vSphere HA to protect the Hadoop master node virtual machine, which can be monitored by vSphere.
When a Hadoop NameNode or JobTracker service stops unexpectedly, vSphere restarts the Hadoop virtual machine in another host, reducing unplanned downtime. If vsphere Fault Tolerance is configured and the master node virtual machine stops unexpectedly because of host failover or loss of network connectivity, the secondary node is used, without downtime.

Recover from Disk Failure with the Serengeti Command-Line Interface Client

If there is a disk failure in a Hadoop cluster, and the disk does not perform management roles such as NameNode, JobTracker, ResourceManager, HMaster, or ZooKeeper, you can recover by running the Serengeti cluster fix command.
Big Data Extensions uses a large number of inexpensive disk drives for data storage (configured as Just a Bunch of Disks). If several disks fail, the Hadoop data node might shutdown. Big Data Extensions lets you to recover from disk failures.
Serengeti supports recovery from swap and data disk failure on all supported Hadoop distributions. Disks are recovered and started in sequence to avoid the temporary loss of multiple nodes at once. A new disk matches the corresponding failed disk’s storage type and placement policies.
The MapR distribution does not support recovery from disk failure by using the cluster fix command.
IMPORTANT Even if you changed the user password on the cluster's nodes, the changed password is not used for the new nodes that are created by the disk recovery operation. If you set the cluster's initial administrator password when you created the cluster, that initial administrator password is used for the new nodes. If you did not set the cluster's initial administrator password when you created the cluster, new random passwords are used for the new nodes.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster fix command.
The nodeGroup parameter is optional.
cluster fix --name cluster_name --disk [--nodeGroup nodegroup_name]
VMware, Inc. 93

Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client

To perform troubleshooting or to run your management automation scripts, log in to Hadoop master, worker, and client nodes with password-less SSH from the Serengeti Management Server using SSH client tools such as SSH, PDSH, ClusterSSH, and Mussh.
You can use a user name and password authenticated login to connect to Hadoop cluster nodes over SSH. All deployed nodes are password-protected with either a random password or a user-specified password that was assigned when the cluster was created.
Prerequisites
Use the vSphere Web Client to log in to vCenter Server, and verify that the Serengeti Management Server virtual machine is running.
Procedure
1 Right-click the Serengeti Management Server virtual machine and select Open Console.
The password for the Serengeti Management Server appears.
NOTE If the password scrolls off the console screen, press Ctrl+D to return to the command prompt.
2 Use the vSphere Web Client to log in to the Hadoop node.
The password for the root user appears on the virtual machine console in the vSphere Web Client.
3 Change the Hadoop node’s password by running the set-password -u command.
sudo /opt/serengeti/sbin/set-password -u

Change the User Password on All of a Cluster's Nodes

You can change the user password for all nodes in a cluster. The user password that you can change includes the serengeti and root users.
You can change a user's password on all nodes within a given cluster.
IMPORTANT If you scale out or perform disk recovery operations on a cluster after you change the user password for the cluster's original nodes, the changed password is not used for the new cluster nodes that are created by the scale out or disk recovery operation. If you set the cluster's initial administrator password when you created the cluster, that initial administrator password is used for the new nodes. If you did not set the cluster's initial administrator password when you created the cluster, new random passwords are used for the new nodes.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
n
Client,” on page 23 .
Configure a Hadoop distribution to use with Big Data Extensions.
n
Create a cluster. See Chapter 7, “Creating Hadoop and HBase Clusters,” on page 75.
n
Procedure
1 Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
94 VMware, Inc.
Chapter 8 Managing Hadoop and HBase Clusters
2 Run the serengeti-ssh.sh script..
serengeti-ssh.sh cluster_name 'echo new_password | sudo passwd username --stdin'
This example changes the password for all nodes in the cluster labeled mycluster for the user serengeti to mypassword.
serengeti-ssh.sh mycluster 'echo mypassword | sudo passwd serengeti --stdin'
The password for the user account that you specify changes on all the nodes in the cluster.
VMware, Inc. 95
96 VMware, Inc.
Monitoring the Big Data Extensions
Environment 9
You can monitor the status of Serengeti-deployed clusters, including their datastores, networks, and resource pools through the Serengeti Command-Line Interface. You can also view a list of available Hadoop distributions. Monitoring capabilities are also in the vSphere Web Client.
View Serengeti Management Server Initialization Status on page 97
n
You can you view the initialization status of the Serengeti Management Server services, view error messages to help troubleshoot problems, and recover services that may not have successfully started.
View Provisioned Clusters in the vSphere Web Client on page 98
n
You can view the clusters deployed within Big Data Extensions, including information about whether the cluster is running, the type of Hadoop distribution used by a cluster, and the number and type of nodes in the cluster.
View Cluster Information in the vSphere Web Client on page 99
n
Use the vSphere Web Client to view virtual machines running each node, resource allocation, IP addresses, and storage information for each node in the Hadoop cluster.
Monitor the Hadoop Distributed File System Status in the vSphere Web Client on page 100
n
When you configure a Hadoop distribution to use with Big Data Extensions, the Hadoop software includes the Hadoop Distributed File System (HDFS). You can monitor the health and status of HDFS from the vSphere Web Client. The HDFS page lets you browse the Hadoop file system, view NameNode logs, and view cluster information including live, dead, and decommissioning nodes, and NameNode storage information.
Monitor MapReduce Status in the vSphere Web Client on page 101
n
The Hadoop software includes MapReduce, a software framework for distributed data processing. You can monitor MapReduce status vSphere Web Client. The MapReduce Web page includes information about scheduling, running jobs, retired jobs, and log files.
Monitor HBase Status in the vSphere Web Client on page 101
n
HBase is the Hadoop database. You can monitor the health and status of your HBase cluster, as well as the tables that it hosts, from the vSphere Web Client.

View Serengeti Management Server Initialization Status

You can you view the initialization status of the Serengeti Management Server services, view error messages to help troubleshoot problems, and recover services that may not have successfully started.
Big Data Extensions may not successfully start for many reasons. The Serengeti Management Server Administration Portal lets you view the initialization status of the Serengeti services, view error messages for individual services to help troubleshoot problems, and recover services that may not have successfully started.
VMware, Inc.
97
Prerequisites
Ensure that you know the IP address of the Serengeti Management Server to which you want to
n
connect.
Ensure that you have login credentials for the Serengeti Management Server root user.
n
Procedure
1 Open a Web browser and go the URL of the Serengeti Management Server Administration Portal.
https://management-server-ip-address:5480
2 Type root for the user name, type the password, and click Login.
3 Click the Summary tab.
The Serengeti Management Server services and their operational status is displayed in the Summary page.
4 Do one of the following.
Option Description
View Initialize Status
View Chef Server Services
Recover a Stopped or Failed Service
Refresh
Click Details. The Serengeti Server Setup dialog box lets you view the initialization status of the Serengeti Management Server. If the Serengeti Management Server fails to initialize, an error message with troubleshooting information displays. Once you resolve the error, a Retry button lets you restart the failed service.
Click the Chef Server tree control to expand the list of Chef services.
Click Recover to restart a stopped or failed service. If a service fails due to a configuration error, you must first resolve the problem that caused the service to fail before you can successfully recover the failed service.
Click Refresh to update the information displayed in the Summary page.
What to do next
If there is an error that you need to resolve, the troubleshooting topics provide solutions to problems you might encounter when using Big Data Extensions. See Chapter 12, “Troubleshooting,” on page 111.

View Provisioned Clusters in the vSphere Web Client

You can view the clusters deployed within Big Data Extensions, including information about whether the cluster is running, the type of Hadoop distribution used by a cluster, and the number and type of nodes in the cluster.
Prerequisites
Create one or more Hadoop or Hbase clusters whose information you can view. See Chapter 7,
n
“Creating Hadoop and HBase Clusters,” on page 75
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 In the Inventory Lists, select Big Data Clusters.
4 Select Big Data Clusters.
Information about all provisioned clusters appears in the right pane.
98 VMware, Inc.
Chapter 9 Monitoring the Big Data Extensions Environment
Table 91. Cluster Information
Option Description
Name Name of the cluster.
Status Status of the cluster.
Distribution Hadoop distribution in use by the cluster.
Elasticity Mode The elasticity mode in use by the cluster.
Disk IO Shares The disk I/O shares in use by the cluster.
Resources The resource pool or vCenter Server cluster in use by the Big Data cluster.
Information Number and type of nodes in the cluster.
Progress Status messages of actions being performed on the cluster.

View Cluster Information in the vSphere Web Client

Use the vSphere Web Client to view virtual machines running each node, resource allocation, IP addresses, and storage information for each node in the Hadoop cluster.
Prerequisites
Create one or more Hadoop clusters.
n
Start the Hadoop cluster.
n
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 From the Inventory Lists, click Big Data Clusters.
4 Click a Big Data cluster.
Information about the cluster appears in the right pane, in the Nodes tab.
Table 92. Cluster Information
Column Description
Node Group Lists all nodes by type in the cluster.
VM Name Name of the virtual machine on which a node is running.
Management Network
Host Host name, IP address, or Fully Qualified Domain Name (FQDN) of the ESXi host on which
Status The virtual machine reports the following status types:
Task Status of in-progress Serengeti operations.
IP address of the virtual machine.
the virtual machine is running.
Not Exist. Status before you create a virtual machine instance in vSphere.
n
Powered On. The virtual machine is powered on after virtual disks and network are
n
configured.
VM Ready. A virtual machine is started and IP is ready.
n
Service Ready. Services inside the virtual machine have been provisioned.
n
Bootstrap Failed. A service inside the virtual machine failed to provision.
n
Powered Off. The virtual machine is powered off.
n
VMware, Inc. 99
5 From the Nodes tab, select a node group.
Information about the node group appears in the Node details panel of the Nodes tab.
Table 93. Cluster Node Details
Field Description
Node Group Name of the selected node group.
VM Name Name of the node group's virtual machine.
Management network Network used for management traffic.
HDFS Network Network used for HDFS traffic.
MapReduce Network Network used for MapReduce traffic.
Host Host name, IP address, or Fully Qualified Domain Name
vCPU Number of virtual CPUs assigned to the node.
RAM Amount of RAM used by the node.
Storage The amount of storage allocated for use by the virtual
Error Indicates a node failure.
(FQDN) of the ESXi host on which the virtual machine is running.
NOTE The RAM size that appears for each node shows the allocated RAM, not the RAM that is in use.
machine running the node.

Monitor the Hadoop Distributed File System Status in the vSphere Web Client

When you configure a Hadoop distribution to use with Big Data Extensions, the Hadoop software includes the Hadoop Distributed File System (HDFS). You can monitor the health and status of HDFS from the vSphere Web Client. The HDFS page lets you browse the Hadoop file system, view NameNode logs, and view cluster information including live, dead, and decommissioning nodes, and NameNode storage information.
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
Prerequisites
Create one or more Hadoop clusters. See Chapter 7, “Creating Hadoop and HBase Clusters,” on
n
page 75.
Procedure
1 Use the vSphere Web Client to log in to vCenter Server.
2 Select Big Data Extensions.
3 In the Inventory Lists, select Big Data Clusters.
4 Select the cluster whose HDFS status you want to view from the Big Data Cluster List tab.
100 VMware, Inc.
Loading...