This document supports the version of each product listed and
supports all subsequent versions until the document is
replaced by a new edition. To check for more recent editions
of this document, see http://www.vmware.com/support/pubs.
EN-001514-02
VMware vSphere Big Data Extensions Administrator's and User's Guide
You can find the most up-to-date technical documentation on the VMware Web site at:
http://www.vmware.com/support/
The VMware Web site also provides the latest product updates.
If you have comments about this documentation, submit your feedback to:
3401 Hillview Ave.
Palo Alto, CA 94304
www.vmware.com
2 VMware, Inc.
Contents
About This Book7
Updated Information9
About VMware vSphere Big Data Extensions11
1
Getting Started with Big Data Extensions 11
Big Data Extensions and Project Serengeti 13
About Big Data Extensions Architecture 14
Big Data Extensions Support for Hadoop Features By Distribution 14
Hadoop Feature Support By Distribution 16
Installing Big Data Extensions19
2
System Requirements for Big Data Extensions 20
Internationalization and Localization 22
Deploy the Big Data Extensions vApp in the vSphere Web Client 23
Install RPMs in the Serengeti Management Server Yum Repository 26
Install the Big Data Extensions Plug-In 27
Connect to a Serengeti Management Server 28
Install the Serengeti Remote Command-Line Interface Client 29
Access the Serengeti Command-Line Interface Using the Remote Command-Line Interface Client 30
Upgrading Big Data Extensions33
3
Prepare to Upgrade Big Data Extensions 34
Upgrade Big Data Extensions Virtual Appliance 35
Upgrade the Big Data Extensions Plug-in 38
Upgrade the Serengeti Command-Line Interface 38
Upgrade the CentOS 6.x Template 39
Upgrade Big Data Extensions Virtual Machine Components 40
VMware, Inc.
Managing Hadoop Distributions43
4
Hadoop Distribution Deployment Types 43
Configure a Tarball-Deployed Hadoop Distribution 44
Configuring Yum and Yum Repositories 46
Create a Hadoop Template Virtual Machine using RHEL Server 6.x and VMware Tools 57
Maintain a Customized Hadoop Template Virtual Machine 60
Managing the Big Data Extensions Environment63
5
Add Specific User Names to Connect to the Serengeti Management Server 64
Change the Password for the Serengeti Management Server 64
Configure vCenter Single Sign-On Settings for the Serengeti Management Server 65
Create a User Name and Password for the Serengeti Command-Line Interface 66
3
VMware vSphere Big Data Extensions Administrator's and User's Guide
Stop and Start Serengeti Services 66
Managing vSphere Resources for Hadoop and HBase Clusters69
6
Add a Resource Pool with the Serengeti Command-Line Interface 70
Remove a Resource Pool with the Serengeti Command-Line Interface 70
Add a Datastore in the vSphere Web Client 70
Remove a Datastore in the vSphere Web Client 71
Add a Network in the vSphere Web Client 72
Reconfigure a Static IP Network in the vSphere Web Client 72
Remove a Network in the vSphere Web Client 73
Creating Hadoop and HBase Clusters75
7
About Hadoop and HBase Cluster Deployment Types 76
About Cluster Topology 77
About HBase Database Access 77
Create a Hadoop or HBase Cluster in the vSphere Web Client 78
Create a Cluster with Topology Awareness with the Serengeti Command-Line Interface 80
Managing Hadoop and HBase Clusters83
8
Stop and Start a Hadoop Cluster in the vSphere Web Client 84
Scale Out a Hadoop Cluster in the vSphere Web Client 84
Scale CPU and RAM in the vSphere Web Client 85
Reconfigure a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 86
Delete a Hadoop Cluster in the vSphere Web Client 88
About Resource Usage and Elastic Scaling 88
Use Disk I/O Shares to Prioritize Cluster Virtual Machines in the vSphere Web Client 92
About vSphere High Availability and vSphere Fault Tolerance 93
Recover from Disk Failure with the Serengeti Command-Line Interface Client 93
Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client 94
Change the User Password on All of a Cluster's Nodes 94
Monitoring the Big Data Extensions Environment97
9
View Serengeti Management Server Initialization Status 97
View Provisioned Clusters in the vSphere Web Client 98
View Cluster Information in the vSphere Web Client 99
Monitor the Hadoop Distributed File System Status in the vSphere Web Client 100
Monitor MapReduce Status in the vSphere Web Client 101
Monitor HBase Status in the vSphere Web Client 101
Using Hadoop Clusters from the Serengeti Command-Line Interface103
10
Run HDFS Commands with the Serengeti Command-Line Interface 103
Run MapReduce Jobs with the Serengeti Command-Line Interface 104
Run Pig and PigLatin Scripts with the Serengeti Command-Line Interface 104
Run Hive and Hive Query Language Scripts with the Serengeti Command-Line Interface 105
Accessing Hive Data with JDBC or ODBC107
11
Configure Hive to Work with JDBC 107
Configure Hive to Work with ODBC 109
4 VMware, Inc.
Contents
Troubleshooting111
12
Log Files for Troubleshooting 112
Configure Serengeti Logging Levels 113
Collect Log Files for Troubleshooting 113
Big Data Extensions Virtual Appliance Upgrade Fails 114
Troubleshooting Cluster Creation Failures 114
Cannot Restart or Reconfigure a Cluster After Changing Its Distribution 121
Cannot Restart or Reconfigure a Cluster Whose Time Is Not Synchronized 121
Virtual Machine Cannot Get IP Address 122
vCenter Server Connections Fail to Log In 122
SSL Certificate Error When Connecting to Non-Serengeti Server with the vSphere Console 123
Serengeti Operations Fail After You Rename a Resource in vSphere 123
A New Plug-In Instance with the Same or Earlier Version Number as a Previous Plug-In Instance
Does Not Load 123
MapReduce Job Fails to Run and Does Not Appear In the Job History 124
Cannot Submit MapReduce Jobs for Compute-Only Clusters with External Isilon HDFS 125
MapReduce Job Stops Responding on a PHD or CDH4 YARN Cluster 125
Unable to Connect the Big Data Extensions Plug-In to the Serengeti Server 125
Cannot Perform Serengeti Operations after Deploying Big Data Extensions 126
Host Name and FQDN Do Not Match for Serengeti Management Server 127
Upgrade Cluster Error When Using Cluster Created in Earlier Version of Big Data Extensions 128
Index129
VMware, Inc. 5
VMware vSphere Big Data Extensions Administrator's and User's Guide
6 VMware, Inc.
About This Book
VMware vSphere Big Data Extensions Administrator's and User's Guide describes how to install
Big Data Extensions within your vSphere environment, and how to manage and monitor Hadoop and
HBase clusters using the Big Data Extensions plug-in for vSphere Web Client.
VMware vSphere Big Data Extensions Administrator's and User's Guide also describes how to perform Hadoop
and HBase operations using the Serengeti Command-Line Interface Client, which provides a greater degree
of control for certain system management and Big Data cluster creation tasks.
Intended Audience
This guide is for system administrators and developers who want to use Big Data Extensions to deploy and
manage Hadoop clusters. To successfully work with Big Data Extensions, you should be familiar with
VMware® vSphere® and Hadoop and HBase deployment and operation.
VMware Technical Publications Glossary
VMware Technical Publications provides a glossary of terms that might be unfamiliar to you. For definitions
of terms as they are used in VMware technical documentation, go to
http://www.vmware.com/support/pubs.
VMware, Inc.
7
VMware vSphere Big Data Extensions Administrator's and User's Guide
8 VMware, Inc.
Updated Information
This VMware vSphere Big Data Extensions Administrator's and User's Guide is updated with each release of the
product or when necessary.
This table provides the update history of the VMware vSphere Big Data Extensions Administrator's and User'sGuide.
RevisionDescription
EN-001514-02 Enhanced the description of how to access the upgrade source in the topic “Download the Upgrade
Source and Accept the License Agreement,” on page 36.
EN-001514-01 Expanded the information on preparing to upgrade and upgrading the Big Data Extensions vApp in
Chapter 3, “Upgrading Big Data Extensions,” on page 33.
EN-001514-00 Initial release.
VMware, Inc. 9
VMware vSphere Big Data Extensions Administrator's and User's Guide
10 VMware, Inc.
About VMware vSphere Big Data
Extensions1
VMware® vSphere™ Big Data Extensions lets you deploy and centrally operate Hadoop and HBase clusters
running on VMware vSphere. Big Data Extensions simplifies the Hadoop and HBase deployment and
provisioning process, and gives you a real time view of the running services and the status of their virtual
hosts. It provides a central place from which to manage and monitor your Hadoop and HBase cluster, and
incorporates a full range of tools to help you optimize cluster performance and utilization.
Getting Started with Big Data Extensions on page 11
n
VMware vSphere Big Data Extensions lets you deploy Hadoop and HBase clusters. The tasks in this
section describe how to set up vSphere for use with Big Data Extensions, deploy the Big Data
Extensions vApp, access the vCenter Server and command-line interface (CLI) administrative
consoles, and configure a Hadoop distribution for use with Big Data Extensions.
Big Data Extensions and Project Serengeti on page 13
n
Big Data Extensions runs on top of Project Serengeti, the open source project initiated by VMware to
automate the deployment and management of Hadoop and HBase clusters on virtual environments
such as vSphere.
About Big Data Extensions Architecture on page 14
n
The Serengeti Management Server and Hadoop Template virtual machine work together to configure
and provision Hadoop and HBase clusters.
Big Data Extensions Support for Hadoop Features By Distribution on page 14
n
Big Data Extensions provides different levels of feature support depending on the Hadoop
distribution and version that you use.
Hadoop Feature Support By Distribution on page 16
n
Each Hadoop distribution and version provides differing feature support. Learn which Hadoop
distributions support which features.
Getting Started with Big Data Extensions
VMware vSphere Big Data Extensions lets you deploy Hadoop and HBase clusters. The tasks in this section
describe how to set up vSphere for use with Big Data Extensions, deploy the Big Data Extensions vApp,
access the vCenter Server and command-line interface (CLI) administrative consoles, and configure a
Hadoop distribution for use with Big Data Extensions.
Prerequisites
Understand what Project Serengeti and Big Data Extensions is so that you know how they fit into your
n
Big Data workflow and vSphere environment. See “Big Data Extensions and Project Serengeti,” on
page 13.
VMware, Inc.
11
VMware vSphere Big Data Extensions Administrator's and User's Guide
Verify that the Big Data Extensions features that you want to use, such as data-compute separated
n
clusters and elastic scaling, are supported by Big Data Extensions for the Hadoop distribution that you
want to use. See “Big Data Extensions Support for Hadoop Features By Distribution,” on page 14.
Understand which features are supported by your Hadoop distribution. See “Hadoop Feature Support
n
By Distribution,” on page 16.
Procedure
1Do one of the following.
Install Big Data Extensions for the first time. Review the system requirements, install vSphere, and
n
install the Big Data Extensions components: Big Data Extensions vApp, Big Data Extensions plugin for vCenter Server, and Serengeti Remote Command-Line Interface Client. See Chapter 2,
“Installing Big Data Extensions,” on page 19.
Upgrade Big Data Extensions from a previous version. Perform the upgrade steps. See Chapter 3,
n
“Upgrading Big Data Extensions,” on page 33.
2(Optional) Install and configure a distribution other than Apache Hadoop for use with Big Data
Extensions.
Apache Hadoop is included in the Serengeti Management Server, but you can use any Hadoop
distribution that Big Data Extensions supports. See Chapter 4, “Managing Hadoop Distributions,” on
page 43.
What to do next
After you have successfully installed and configured your Big Data Extensions environment, you can
perform the following additional tasks, in any order.
Stop and start the Serengeti services, create user accounts, manage passwords, and log in to cluster
n
nodes to perform troubleshooting. See Chapter 5, “Managing the Big Data Extensions Environment,” on
page 63.
Manage the vSphere resource pools, datastores, and networks that you use to create Hadoop and HBase
n
clusters. See Chapter 6, “Managing vSphere Resources for Hadoop and HBase Clusters,” on page 69.
Create, provision, and manage Hadoop and HBase clusters. See Chapter 7, “Creating Hadoop and
n
HBase Clusters,” on page 75 and Chapter 8, “Managing Hadoop and HBase Clusters,” on page 83.
Monitor the status of the clusters that you create, including their datastores, networks, and resource
n
pools, through the vSphere Web Client and the Serengeti Command-Line Interface. See Chapter 9,
“Monitoring the Big Data Extensions Environment,” on page 97.
On your Big Data clusters, run HDFS commands, Hive and Pig scripts, and MapReduce jobs, and access
n
Hive data. See Chapter 10, “Using Hadoop Clusters from the Serengeti Command-Line Interface,” on
page 103.
If you encounter any problems when using Big Data Extensions, see Chapter 12, “Troubleshooting,” on
n
page 111.
12 VMware, Inc.
Big Data Extensions and Project Serengeti
Big Data Extensions runs on top of Project Serengeti, the open source project initiated by VMware to
automate the deployment and management of Hadoop and HBase clusters on virtual environments such as
vSphere.
Big Data Extensions and Project Serengeti provide the following components.
Chapter 1 About VMware vSphere Big Data Extensions
Project Serengeti
Serengeti Management
Server
Serengeti CommandLine Interface Client
Big Data Extensions
An open source project initiated by VMware, Project Serengeti lets users
deploy and manage Hadoop and Big Data clusters in a vCenter Server
managed environment. The major components are the Serengeti
Management Server, which provides cluster provisioning, software
configuration, and management services; an elastic scaling framework; and
command-line interface. Project Serengeti is made available under the
Apache 2.0 license, under which anyone can modify and redistribute Project
Serengeti according to the terms of the license.
Provides the framework and services to run Big Data clusters on vSphere.
The Serengeti Management Server performs resource management, policybased virtual machine placement, cluster provisioning, software
configuration management, and environment monitoring.
The command-line interface (CLI) client provides a comprehensive set of
tools and utilities with which to monitor and manage your Big Data
deployment. If you are using the open source version of Serengeti without
Big Data Extensions, the CLI is the only interface through which you can
perform administrative tasks.
The commercial version of the open source Project Serengeti from VMware,
Big Data Extensions, is delivered as a vCenter Server Appliance.
Big Data Extensions includes all the Project Serengeti functions and the
following additional features and components.
Enterprise level support from VMware.
n
Hadoop distribution from the Apache community.
n
NOTE VMware provides the Hadoop distribution as a convenience but
does not provide enterprise-level support. The Apache Hadoop
distribution is supported by the open source community.
The Big Data Extensions plug-in, a graphical user interface integrated
n
with vSphere Web Client. This plug-in lets you perform common
Hadoop infrastructure and cluster management administrative tasks.
Elastic scaling lets you optimize cluster performance and utilization of
n
physical compute resources in a vSphere environment. Elasticityenabled clusters start and stop virtual machines, adjusting the number
of active compute nodes based on configuration settings that you
specify, to optimize resource consumption. Elasticity is ideal in a mixed
workload environment to ensure that workloads can efficiently share the
underlying physical resources while high-priority jobs are assigned
sufficient resources.
VMware, Inc. 13
VMware vSphere Big Data Extensions Administrator's and User's Guide
About Big Data Extensions Architecture
The Serengeti Management Server and Hadoop Template virtual machine work together to configure and
provision Hadoop and HBase clusters.
Big Data Extensions performs the following steps to deploy a Hadoop or HBase cluster.
1The Serengeti Management Server searches for ESXi hosts with sufficient resources to operate the
cluster based on the configuration settings that you specify, and then selects the ESXi hosts on which to
place Hadoop virtual machines.
2The Serengeti Management Server sends a request to the vCenter Server to clone and configure virtual
machines to use with the Hadoop or HBase cluster.
3The Serengeti Management Server configures the operating system and network parameters for the
new virtual machines.
4Each virtual machine downloads the Hadoop software packages and installs them by applying the
distribution and installation information from the Serengeti Management Server.
5The Serengeti Management Server configures the Hadoop parameters for the new virtual machines
based on the cluster configuration settings that you specify.
6The Hadoop services are started on the new Hadoop virtual machines, at which point you have a
running cluster based on your configuration settings.
Big Data Extensions Support for Hadoop Features By Distribution
Big Data Extensions provides different levels of feature support depending on the Hadoop distribution and
version that you use.
Support for Hadoop MapReduce v1 Distribution Features
Big Data Extensions provides differing levels of feature support depending on the Hadoop distribution and
version that you configure for use. Table 1-1 lists the supported Hadoop MapReduce v1 distributions and
indicates which features are supported when using the distribution with Big Data Extensions.
Table 1‑1. Big Data Extensions Feature Support for Hadoop MapReduce v1 Distributions
Chapter 1 About VMware vSphere Big Data Extensions
Table 1‑1. Big Data Extensions Feature Support for Hadoop MapReduce v1 Distributions (Continued)
Apache
HadoopClouderaHortonworksIntelMapR
Hadoop Topology
YesYesYesYesNo
Configuration
Run Hadoop
YesNoNoNoNo
Commands from
the CLI
Hadoop
YesNoYesNoNo
Virtualization
Extensions (HVE)
vSphere HAYesYesYesYesYes
Service Level
vSphere HA
YesSee “About
Service Level
YesYesNo
vSphere HA
for Cloudera,”
on page 16
vSphere FTYesYesYesYesYes
Support for Hadoop MapReduce v2 (YARN) Distribution Features
Big Data Extensions provides differing levels of feature support depending on the Hadoop distribution and
version that you configure for use. Table 1-2lists the supported Hadoop MapReduce v2 distributions and
indicates which features are supported when using the distribution with Big Data Extensions.
Table 1‑2. Big Data Extensions Feature Support for Hadoop MapReduce v2 (YARN) Distributions
Apache
Bigtop
Apache
HadoopClouderaClouderaHortonworksPivotal
Version0.7.02.0CDH4CDH5HDP 2.0, 2.1PHD 1.1,
2.0
Automatic
YesYesYesYesYesYes
Deployment
Scale OutYesYesYesYesYesYes
Create
YesYesYesYesYesYes
Cluster with
Multiple
Networks
Data-
YesYesYesYesYesYes
Compute
Separation
Compute-
YesYesYesYesYesYes
only
Elastic
Scaling of
Compute
Nodes
Hadoop
YesYesNo when
using
MapReduce
2
No when
using
MapReduce
2
YesNo
YesYesYesYesYesYes
Configuratio
n
Hadoop
YesYesYesYesYesYes
Topology
Configuratio
n
VMware, Inc. 15
VMware vSphere Big Data Extensions Administrator's and User's Guide
Table 1‑2. Big Data Extensions Feature Support for Hadoop MapReduce v2 (YARN) Distributions
(Continued)
Run Hadoop
Apache
Bigtop
NoNoNoNoNoNo
Apache
HadoopClouderaClouderaHortonworksPivotal
Commands
from the CLI
Hadoop
Virtualizatio
n Extensions
(HVE)
Support only
for HDFS
Support only
for HDFS
NoSupport
only for
HDFS
Support only
for HDFS.
HDP 1.3
provides full
support.
vSphere HANoNoNoNoNoNo
Service Level
vSphere HA
NoNoSee “About
Service Level
vSphere HA
for
Cloudera,”
on page 16
See “About
Service
Level
vSphere HA
for
Cloudera,”
NoNo
on page 16
vSphere FTNoNoNoNoNoNo
About Service Level vSphere HA for Cloudera
The Cloudera distributions offer the following support for Service Level vSphere HA.
Cloudera using MapReduce v1 provides service level vSphere HA support for JobTracker.
n
Yes
Cloudera provides its own service level HA support for NameNode through HDFS2.
n
Hadoop Feature Support By Distribution
Each Hadoop distribution and version provides differing feature support. Learn which Hadoop
distributions support which features.
Hadoop Features
The table illustrates which Hadoop distributions support which features.
Table 1‑3. Hadoop Feature Support
Apache
Hadoop
Version1.22.0CDH4,
HDFS1YesYesNoYesYesYesNoNo
HDFS2NoYesYesNoNoNoNoYes
MapReducev1YesYesYesYesYesYesYesNo
MapReduce
v2 (YARN)
PigYesYesYesYesYesYesYesYes
HiveYesYesYesYesYesYesYesYes
Hive ServerYesYesYesYesYesYesYesYes
NoYesYesNoYesNoNoYes
Apache
BigtopCloudera
CDH5
Hortonwork
sHortonworksIntelMapRPivotal
HDP 1.3HDP 2.0-2.12.5.1, 3.0.23.0.2-3.1.0PHD 1.1,
2.0
16 VMware, Inc.
Table 1‑3. Hadoop Feature Support (Continued)
Chapter 1 About VMware vSphere Big Data Extensions
Apache
Hadoop
Apache
BigtopCloudera
Hortonwork
sHortonworksIntelMapRPivotal
HBaseYesYesYesYesYesYesYesYes
ZooKeeperYesYesYesYesYesYesYesYes
VMware, Inc. 17
VMware vSphere Big Data Extensions Administrator's and User's Guide
18 VMware, Inc.
Installing Big Data Extensions2
To install Big Data Extensions so that you can create and provision Hadoop and HBase clusters, you must
install the Big Data Extensions components in the order described.
Procedure
1System Requirements for Big Data Extensions on page 20
Before you begin the Big Data Extensions deployment tasks, your system must meet all of the
prerequisites for vSphere, clusters, networks, storage, hardware, and licensing.
2Internationalization and Localization on page 22
Big Data Extensions supports internationalization (I18N) level 1. However, there are resources you
specify that do not provide UTF-8 support. You can use only ASCII attribute names consisting of
alphanumeric characters and underscores (_) for these resources.
3Deploy the Big Data Extensions vApp in the vSphere Web Client on page 23
Deploying the Big Data Extensions vApp is the first step in getting your Hadoop cluster up and
running with vSphere Big Data Extensions.
4Install RPMs in the Serengeti Management Server Yum Repository on page 26
Install the wsdl4j and mailx RPM packages within the Serengeti Management Server's internal Yum
repository.
VMware, Inc.
5Install the Big Data Extensions Plug-In on page 27
To enable the Big Data Extensions user interface for use with a vCenter Server Web Client, register the
plug-in with the vSphere Web Client. The Big Data Extensions graphical user interface is supported
only when using vSphere Web Client 5.1 and later. If you install Big Data Extensions on vSphere 5.0,
perform all administrative tasks using the Serengeti Command-Line Interface Client.
6Connect to a Serengeti Management Server on page 28
To use the Big Data Extensions plug-in to manage and monitor Big Data clusters and Hadoop
distributions, you must connect the Big Data Extensions plug-in to the Serengeti Management Server
in your Big Data Extensions deployment.
7Install the Serengeti Remote Command-Line Interface Client on page 29
Although the Big Data Extensions Plug-in for vSphere Web Client supports basic resource and cluster
management tasks, you can perform a greater number of the management tasks using the Serengeti
Command-line Interface Client.
8Access the Serengeti Command-Line Interface Using the Remote Command-Line Interface Client on
page 30
You can access the Serengeti Command-Line Interface using the Serengeti Remote Command-Line
Interface Client. The Serengeti Remote Command-Line Interface Client lets you access the Serengeti
Management Server to deploy, manage, and use Hadoop.
19
VMware vSphere Big Data Extensions Administrator's and User's Guide
What to do next
If you want to create clusters on any Hadoop distribution other than Apache Hadoop, which is included in
the Serengeti Management Server, install and configure the distribution for use with Big Data Extensions.
See Chapter 4, “Managing Hadoop Distributions,” on page 43.
System Requirements for Big Data Extensions
Before you begin the Big Data Extensions deployment tasks, your system must meet all of the prerequisites
for vSphere, clusters, networks, storage, hardware, and licensing.
Big Data Extensions requires that you install and configure vSphere and that your environment meets
minimum resource requirements. Make sure that you have licenses for the VMware components of your
deployment.
vSphere Requirements
Before you install Big Data Extensions, set up the following VMware
products.
Install vSphere 5.0 (or later) Enterprise or Enterprise Plus.
n
NOTE The Big Data Extensions graphical user interface is supported
only when using vSphere Web Client 5.1 and later. If you install
Big Data Extensions on vSphere 5.0, perform all administrative tasks
using the Serengeti Command-Line Interface.
When installing Big Data Extensions on vSphere 5.1 or later, use
n
VMware® vCenter™ Single Sign-On to provide user authentication.
When logging in to vSphere 5.1 or later you pass authentication to the
vCenter Single Sign-On server, which you can configure with multiple
identity sources such as Active Directory and OpenLDAP. On successful
authentication, your user name and password is exchanged for a
security token that is used to access vSphere components such as
Big Data Extensions.
Configure all ESXi hosts to use the same Network Time Protocol (NTP)
n
server.
On each ESXi host, add the NTP server to the host configuration, and
n
from the host configuration's Startup Policy list, select Start and stop
with host. The NTP daemon ensures that time-dependent processes
occur in sync across hosts.
Cluster Settings
Configure your cluster with the following settings.
Enable vSphere HA and VMware vSphere® Distributed Resource
n
Scheduler™.
Enable Host Monitoring.
n
Enable Admission Control and set the policy you want. The default
n
policy is to tolerate one host failure.
Set the virtual machine restart priority to High.
n
Set the virtual machine monitoring to virtual machine and Application
n
Monitoring.
Set the Monitoring sensitivity to High.
n
Enable vMotion and Fault Tolerance Logging.
n
All hosts in the cluster have Hardware VT enabled in the BIOS.
n
20 VMware, Inc.
Chapter 2 Installing Big Data Extensions
The Management Network VMkernel Port has vMotion and Fault
n
Tolerance Logging enabled.
Network Settings
Big Data Extensions can deploy clusters on a single network or use multiple
networks. The environment determines how Port Groups that are attached to
NICs are configured and which network backs each Port Group.
You can use either a vSwitch or vSphere Distributed Switch (vDS) to provide
the Port Group backing a Serengeti cluster. vDS acts as a single virtual switch
across all attached hosts while a vSwitch is per-host and requires the Port
Group to be configured manually.
When configuring your networks to use with Big Data Extensions, verify that
the following ports are open as listening ports.
Ports 8080 and 8443 are used by the Big Data Extensions plug-in user
n
interface and the Serengeti Command-Line Interface Client.
Port 5480 is used by vCenter Single Sign-On for monitoring and
n
management.
Port 22 is used by SSH clients.
n
To prevent having to open a network firewall port to access Hadoop
n
services, log into the Hadoop client node, and from that node you can
access your cluster.
To connect to the Internet (for example, to create an internal Yum
n
repository from which to install Hadoop distributions), you may use a
proxy.
To enable communications, be sure that firewalls and Web filters do not
n
block the Serengeti Management Server or other Serengeti nodes.
Direct Attached Storage
Resource Requirements
for the vSphere
Management Server and
Templates
Resource Requirements
for the Hadoop Cluster
Attach and configure Direct Attached Storage on the physical controller to
present each disk separately to the operating system. This configuration is
commonly described as Just A Bunch Of Disks (JBOD). Create VMFS
Datastores on Direct Attached Storage using the following disk drive
recommendations.
8-12 disk drives per host. The more disk drives per host, the better the
n
performance.
1-1.5 disk drives per processor core.
n
7,200 RPM disk Serial ATA disk drives.
n
Resource pool with at least 27.5GB RAM.
n
40GB or more (recommended) disk space for the management server
n
and Hadoop template virtual disks.
Datastore free space is not less than the total size needed by the Hadoop
n
cluster, plus swap disks for each Hadoop node that is equal to the
memory size requested.
Network configured across all relevant ESXi hosts, and has connectivity
n
with the network in use by the management server.
VMware, Inc. 21
VMware vSphere Big Data Extensions Administrator's and User's Guide
vSphere HA is enabled for the master node if vSphere HA protection is
n
needed. To use vSphere HA or vSphere FT to protect the Hadoop master
node, you must use shared storage.
Hardware Requirements
for the vSphere and Big
Data Extensions
Environment
Host hardware is listed in the VMware Compatibility Guide. To run at optimal
performance, install your vSphere and Big Data Extensions environment on
the following hardware.
Dual Quad-core CPUs or greater that have Hyper-Threading enabled. If
n
you can estimate your computing workload, consider using a more
powerful CPU.
Use High Availability (HA) and dual power supplies for the master
n
node's host machine.
4-8 GBs of memory for each processor core, with 6% overhead for
n
virtualization.
Use a 1GB Ethernet interface or greater to provide adequate network
n
bandwidth.
Tested Host and Virtual
Machine Support
The maximum host and virtual machine support that has been confirmed to
successfully run with Big Data Extensions is 128 physical hosts running a
total of 512 virtual machines.
vSphere Licensing
You must use a vSphere Enterprise license or above to use VMware vSphere
HA and vSphere DRS.
Internationalization and Localization
Big Data Extensions supports internationalization (I18N) level 1. However, there are resources you specify
that do not provide UTF-8 support. You can use only ASCII attribute names consisting of alphanumeric
characters and underscores (_) for these resources.
Big Data Extensions Supports Unicode UTF-8
vCenter Server resources you specify using both the CLI and vSphere Web Client can be expressed with
underscore (_), hyphen (-), blank spaces, and all letters and numbers from any language. For example, you
can specify resources such as datastores labeled using non-English characters.
When using a Linux operating system VMware recommend's configuring the system for use with UTF-8
encoding specific to your locale. For example, to use U.S. English, specify the following locale encoding:
en_US.UTF-8. See your vendor's documentation for information on configuring UTF-8 encoding for your
Linux environment.
Special Character Support
The following vCenter Server resources can have a period (.) in their name, letting you select them using
both the CLI and vSphere Web Client.
portgroup name
n
cluster name
n
resource pool name
n
datastore name
n
The use of a period is not allowed in the Serengeti resource name.
22 VMware, Inc.
Chapter 2 Installing Big Data Extensions
Resources Excluded From Unicode UTF-8 Support
The Serengeti cluster specification file, manifest file, and topology racks-hosts mapping file do not provide
UTF-8 support. When creating these files to define the nodes and resources for use by the cluster, use only
ASCII attribute names consisting of alphanumeric characters and underscores (_).
The following resource names
cluster name
n
nodeGroup name
n
node name
n
virtual machine name
n
The following attributes in the Serengeti cluster specification file.
distro name
n
role
n
cluster configuration
n
storage type
n
haFlag
n
instanceType
n
groupAssociationsType
n
The rack name in the topology racks-hosts mapping file, and the placementPolicies field of the Serengeti
cluster specification file.
Deploy the Big Data Extensions vApp in the vSphere Web Client
Deploying the Big Data Extensions vApp is the first step in getting your Hadoop cluster up and running
with vSphere Big Data Extensions.
Prerequisites
Install and configure vSphere. See “System Requirements for Big Data Extensions,” on page 20.
n
Configure all ESXi hosts to use the same NTP server.
n
On each ESXi host, add the NTP server to the host configuration, and from the host configuration's
n
Startup Policy list, select Start and stop with host. The NTP daemon ensures that time-dependent
processes occur in sync across hosts.
When installing Big Data Extensions on vSphere 5.1 or later, use vCenter Single Sign-On to provide
n
user authentication.
Verify that you have one vSphere Enterprise license for each host on which you deploy virtual Hadoop
n
nodes. You manage your vSphere licenses in the vSphere Web Client or in vCenter Server.
Install the Client Integration plug-in for the vSphere Web Client. This plug-in enables OVF deployment
n
on your local file system.
NOTE Depending on the security settings of your browser, you might have to approve the plug-in
when you use it the first time.
Download the Big Data Extensions OVA from the VMware download site.
n
VMware, Inc. 23
VMware vSphere Big Data Extensions Administrator's and User's Guide
Verify that you have at least 40GB disk space available for the OVA. You need additional resources for
n
the Hadoop cluster.
Ensure that you know the vCenter Single Sign-On Look-up Service URL for your vCenter Single Sign-
n
On service.
If you are installing Big Data Extensions on vSphere 5.1 or later, ensure that your environment includes
vCenter Single Sign-On. Use vCenter Single Sign-On to provide user authentication on vSphere 5.1 or
later.
See “System Requirements for Big Data Extensions,” on page 20 for a complete list.
Procedure
1In the vSphere Web Client vCenter Hosts and Clusters view, select Actions > All vCenter Actions >
Deploy OVF Template.
2Choose the location where the Big Data Extensions OVA resides and click Next.
OptionDescription
Deploy from File
Deploy from URL
Browse your file system for an OVF or OVA template.
Type a URL to an OVF or OVA template located on the internet. For
example: http://vmware.com/VMTN/appliance.ovf.
3View the OVF Template Details page and click Next.
4Accept the license agreement and click Next.
5Specify a name for the vApp, select a target datacenter for the OVA, and click Next.
The only valid characters for Big Data Extensions vApp names are alphanumeric and underscores. The
vApp name must be < 60 characters. When you choose the vApp name, also consider how you will
name your clusters. Together the vApp and cluster names must be < 80 characters.
6Select a vSphere resource pool for the OVA and click Next.
Select a top-level resource pool. Child resource pools are not supported by Big Data Extensions even
though you can select a child resource pool. If you select a child resource pool, you will not be able to
create clusters from Big Data Extensions.
7Select shared storage for the OVA and click Next.
If shared storage is not available, local storage is acceptable.
8For each network specified in the OVF template, select a network in the Destination Networks column
in your infrastructure to set up the network mapping.
The first network lets the Management Server communicate with your Hadoop cluster. The second
network lets the Management Server communicate with vCenter Server. If your vCenter Server
deployment does not use IPv6, you can specify the same IPv4 destination network for use by both
source networks.
24 VMware, Inc.
Chapter 2 Installing Big Data Extensions
9Configure the network settings for your environment, and click Next.
aEnter the network settings that let the Management Server communicate with your Hadoop
cluster.
Use a static IPv4 (IP) network. An IPv4 address is four numbers separated by dots as in
aaa.bbb.ccc.ddd, where each number ranges from 0 to 255. You must enter a netmask, such as
255.255.255.0, and a gateway address, such as 192.168.1.253.
If the vCenter Server or any ESXi host or Hadoop distribution repository is resolved using a fully
qualified domain name (FQDN), you must enter a DNS address. Enter the DNS server IP address
as DNS Server 1. If there is a secondary DNS server, enter its IP address as DNS Server 2.
NOTE You cannot use a shared IP pool with Big Data Extensions.
b(Optional) If you are using IPv6 between the Management Server and vCenter Server, select the
Enable Ipv6 Connection? checkbox.
Enter the IPv6 address, or FQDN, of the vCenter Server. The IPv6 address size is 128 bits. The
preferred IPv6 address representation is: xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx where each x is
a hexadecimal digit representing 4 bits. IPv6 addresses range from
0000:0000:0000:0000:0000:0000:0000:0000 to ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff. For convenience, an IPv6
address may be abbreviated to shorter notations by application of the following rules.
Remove one or more leading zeroes from any groups of hexadecimal digits. This is usually
n
done to either all or none of the leading zeroes. For example, the group 0042 is converted to 42.
Replace consecutive sections of zeroes with a double colon (::). You may only use the double
n
colon once in an address, as multiple uses would render the address indeterminate. RFC 5952
recommends that a double colon not be used to denote an omitted single section of zeroes.
The following example demonstrates applying these rules to the address
2001:0db8:0000:0000:0000:ff00:0042:8329.
Removing all leading zeroes results in the address 2001:db8:0:0:0:ff00:42:8329.
n
Omitting consecutive sections of zeroes results in the address 2001:db8::ff00:42:8329.
n
See RFC 4291 for more information on IPv6 address notation.
10 Verify that the Initialize Resources check box is selected and click Next.
If the check box is unselected, the resource pool, data store, and network connection assigned to the
vApp will not be added to Big Data Extensions.
If you do not add the resource pool, datastore, and network when you deploy the vApp, use the
vSphere Web Client or the Serengeti Command-Line Interface Client to specify the resource pool,
datastore, and network information before you create a Hadoop cluster.
11 Verify the vService bindings and click Next.
12 Verify the installation information and click Finish.
vCenter Server deploys the Big Data Extensions vApp. When deployment finishes, two virtual
machines are available in the vApp.
The Management Server virtual machine, management-server (also called the Serengeti
n
Management Server), which is started as part of the OVA deployment.
VMware, Inc. 25
VMware vSphere Big Data Extensions Administrator's and User's Guide
The Hadoop Template virtual machine, hadoop-template, which is not started. Big Data Extensions
n
clones Hadoop nodes from this template when provisioning a cluster. Do not start or stop this
virtual machine without good reason. The template does not include a Hadoop distribution.
IMPORTANT Do not delete any files under the /opt/serengeti/.chef directory. If you delete any of these
files, such as the sernegeti.pem file, subsequent upgrades to Big Data Extensions might fail without
displaying error notifications.
What to do next
Install the Big Data Extensions plug-in within the vSphere Web Client. See “Install the Big Data Extensions
Plug-In,” on page 27.
If the Initialize Resources check box is not selected, add resources to the Big Data Extensions server before
you create a Hadoop cluster.
Install RPMs in the Serengeti Management Server Yum Repository
Install the wsdl4j and mailx RPM packages within the Serengeti Management Server's internal Yum
repository.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
Client,” on page 23.
Procedure
1Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as the
user serengeti.
2Download and install the wsdl4j and mailx RPM packages.
If the Serengeti Management Server can connect to the Internet, run the commands as shown in the
n
example below to download the RPMs, copy the files to the required directory, and create a
repository.
cd /opt/serengeti/www/yum/repos/centos/6/base/RPMS/
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/mailx-12.4-7.el6.x86_64.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/wsdl4j-1.5.2-7.8.el6.noarch.rpm
createrepo ..
If the Serengeti Management Server can not connect to the Internet, you must manually download
n
the RPMs, copy the files to the required directory, and create a repository.
aDownload the RPM files as shown in the example below.
Install the Serengeti Remote Command-Line Interface client. See “Install the Serengeti Remote Command-
Line Interface Client,” on page 29.
26 VMware, Inc.
Install the Big Data Extensions Plug-In
To enable the Big Data Extensions user interface for use with a vCenter Server Web Client, register the plugin with the vSphere Web Client. The Big Data Extensions graphical user interface is supported only when
using vSphere Web Client 5.1 and later. If you install Big Data Extensions on vSphere 5.0, perform all
administrative tasks using the Serengeti Command-Line Interface Client.
The Big Data Extensions plug-in provides a graphical user interface that integrates with the
vSphere Web Client. Using the Big Data Extensions plug-in interface you can perform common Hadoop
infrastructure and cluster management tasks.
NOTE Use only the Big Data Extensions plug-in interface in the vSphere Web Client or the Serengeti
Command-Line Interface Client to monitor and manage your Big Data Extensions environment. Performing
management operations in vCenter Server might cause the Big Data Extensions management tools to
become unsynchronized and unable to accurately report the operational status of your Big Data Extensions
environment.
Prerequisites
Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web
n
Client,” on page 23.
Chapter 2 Installing Big Data Extensions
Ensure that you have login credentials with administrator privileges for the vCenter Server system with
n
which you are registering Big Data Extensions.
NOTE The user name and password you use to login can not contain characters whose UTF-8 encoding
is greater than 0x8000.
If you want to use the vCenter Server IP address to access the vSphere Web Client, and your browser
n
uses a proxy, add the vCenter Server IP address to the list of proxy exceptions.
Procedure
1Open a Web browser and go to the URL of the vSphere Web Client 5.1 or later.
The hostname can be either the DNS hostname or IP address of vCenter Server. By default the port is
9443, but this might have been changed during installation of the vSphere Web Client.
2Type the user name and password with administrative privileges that has permissions on vCenter
Server, and click Login.
3Using the vSphere Web Client Navigator panel, locate the Serengeti Management Server that you want
to register with the plug-in.
You can find the Serengeti Management Server under the datacenter and resource pool into which you
deployed it in the previous task.
4From the inventory tree, select management-server to display information about the Serengeti
Management Server in the center pane.
Click the Summary tab in the center pane to access additional information.
5Note the IP address of the Serengeti Management Server virtual machine.
6Open a Web browser and go to the URL of the management-server virtual machine.
The management-server-ip-address is the IP address you noted in Step 5.
VMware, Inc. 27
VMware vSphere Big Data Extensions Administrator's and User's Guide
7Enter the information to register the plug-in.
OptionDescription
Register or Unregister Radio Button
vCenter Server host name or IP
address
User Name and Password
Big Data Extensions Package URL
Select the Install radio button to install the plug-in. Select Uninstall to uninstall the plug-in.
Type the server host name or IP address of vCenter Server.
NOTE Do not include http:// or https:// when you type the host name
or IP address.
Type the user name and password with administrative privileges that you
use to connect to vCenter Server. The user name and password can not
contain characters whose UTF-8 encoding is greater than 0x8000.
The URL with the IP address of the management-server virtual machine
where the Big Data Extensions plug-in package is located:
The Big Data Extensions plug-in registers with vCenter Server and with the vSphere Web Client.
9Log out of the vSphere Web Client, and log back in using your vCenter Server user name and
password.
The Big Data Extensions icon appears in the list of objects in the inventory.
10 From the Inventory pane, click Big Data Extensions.
What to do next
Connect the Big Data Extensions plug-in to the Big Data Extensions instance that you want to manage by
connecting to its Serengeti Management Server.
Connect to a Serengeti Management Server
To use the Big Data Extensions plug-in to manage and monitor Big Data clusters and Hadoop distributions,
you must connect the Big Data Extensions plug-in to the Serengeti Management Server in your Big Data
Extensions deployment.
You can deploy multiple instances of the Serengeti Management Server in your environment. However, you
can connect the Big Data Extensions plug-in with only one Serengeti Management Server instance at a time.
You can change which Serengeti Management Server instance the plug-in connects to using this procedure,
and use the Big Data Extensions plug-in interface to manage and monitor multiple Hadoop and HBase
distributions deployed in your environment.
IMPORTANT The Serengeti Management Server that you connect to is shared by all users of the
Big Data Extensions plug-in interface within the vSphere Web Client. If a user connects to a different
Serengeti Management Server, all other users are affected by this change.
Prerequisites
Verify that the Big Data Extensions vApp deployment was successful and that the Serengeti
n
Management Server virtual machine is running. See “Deploy the Big Data Extensions vApp in the
vSphere Web Client,” on page 23.
The version of the Serengeti Management Server and the Big Data Extensions plug-in must be the same.
Ensure that vCenter Single Sign-On is enabled and configured for use by Big Data Extensions for
n
vSphere 5.1 and later. See “Deploy the Big Data Extensions vApp in the vSphere Web Client,” on
page 23.
Install the Big Data Extensions plug-in. See “Install the Big Data Extensions Plug-In,” on page 27.
n
28 VMware, Inc.
Chapter 2 Installing Big Data Extensions
Procedure
1Use the vSphere Web Client to log in to vCenter Server.
2Select Big Data Extensions.
3Click the Summary tab.
4In the Connected Server dialog, click the Connect Server link.
5Navigate to the Serengeti Management Server virtual machine within the Big Data Extensions vApp to
which you want to connect, select it, and click OK.
The Big Data Extensions plug-in communicates using SSL with the Serengeti Management Server.
When you connect to a Serengeti server instance, the plug-in verifies that the SSL certificate in use by
the server is installed, valid, and trusted.
The Serengeti server instance appears as the connected server in the Summary tab of the Big Data
Extensions Home.
What to do next
You can add additional resource pool, datastore, and network resources to your Big Data Extensions
deployment, and create Hadoop and HBase clusters that you can provision for use.
Install the Serengeti Remote Command-Line Interface Client
Although the Big Data Extensions Plug-in for vSphere Web Client supports basic resource and cluster
management tasks, you can perform a greater number of the management tasks using the Serengeti
Command-line Interface Client.
IMPORTANT You can only run Hadoop commands from the Serengeti CLI on a cluster running the Apache
Hadoop 1.2.1 distribution. To use the command-line to run Hadoop administrative commands for clusters
running other Hadoop distributions, such as cfg, fs, mr, pig, and hive, use a Hadoop client node to run
these commands.
Prerequisites
Verify that the Big Data Extensions vApp deployment was successful and that the Management Server
n
is running.
Verify that you have the correct user name and password to log into the Serengeti Command-line
n
Interface Client.
If you are deploying on vSphere 5.1 or later, the Serengeti Command-line Interface Client uses your
n
vCenter Single Sign-On credentials.
If you are deploying on vSphere 5.0, the Serengeti Command-line Interface Client uses the default
n
vCenter Server administrator credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment, and that its location is
n
in your PATH environment variable.
Procedure
1Use the vSphere Web Client to log in to vCenter Server.
2Select Big Data Extensions.
3Click the Getting Started tab, and click the Download Serengeti CLI Console link.
A ZIP file containing the Serengeti Command-line Interface Client downloads to your computer.
VMware, Inc. 29
VMware vSphere Big Data Extensions Administrator's and User's Guide
4Unzip and examine the download, which includes the following components in the cli directory.
The serengeti-cli-version JAR file, which includes the Serengeti Command-line Interface Client.
n
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
5Open a command shell, and navigate to the directory where you unzipped the Serengeti Command-line
Interface Client download package.
6Change to the cli directory, and run the following command to open the Serengeti Command-line
Interface Client:
java -jar serengeti-cli-version.jar
What to do next
To learn more about using the Serengeti Command-line Interface Client, see the VMware vSphere Big Data
Extensions Command-line Interface Guide.
Access the Serengeti Command-Line Interface Using the Remote
Command-Line Interface Client
You can access the Serengeti Command-Line Interface using the Serengeti Remote Command-Line Interface
Client. The Serengeti Remote Command-Line Interface Client lets you access the Serengeti Management
Server to deploy, manage, and use Hadoop.
IMPORTANT You can only run Hadoop commands from the Serengeti CLI on a cluster running the Apache
Hadoop 1.2.1 distribution. To use the command-line to run Hadoop administrative commands for clusters
running other Hadoop distributions, such as cfg, fs, mr, pig, and hive, use a Hadoop client node to run
these commands.
Prerequisites
Use the vSphere Web Client to log in to the vCenter Server instance on which you deployed the
n
Serengeti vApp.
Verify that the Big Data Extensions vApp deployment was successful and that the Serengeti
n
Management Server is running.
Verify that you have the correct password to log in to Serengeti Command-Line Interface Client. See
n
“Create a User Name and Password for the Serengeti Command-Line Interface,” on page 66.
The Serengeti Command-Line Interface Client uses its vCenter Server credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment and that its location is
n
in your PATH environment variable.
Procedure
1Open a Web browser to connect to the Serengeti Management Server cli directory.
http://ip_address/cli
2Download the ZIP file for your version and build.
The filename is in the format VMware-Serengeti-cli-version_number-build_number.ZIP.
3Unzip the download.
The download includes the following components.
The serengeti-cli-version_number JAR file, which includes the Serengeti Remote Command-Line
n
Interface Client.
30 VMware, Inc.
Loading...
+ 104 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.