This document supports the version of each product listed and
supports all subsequent versions until the document is
replaced by a new edition. To check for more recent editions
of this document, see http://www.vmware.com/support/pubs.
EN-001536-00
VMware vSphere Big Data Extensions Command-Line Interface Guide
You can find the most up-to-date technical documentation on the VMware Web site at:
http://www.vmware.com/support/
The VMware Web site also provides the latest product updates.
If you have comments about this documentation, submit your feedback to:
Default HBase Cluster Configuration for Serengeti 21
About Cluster Topology 21
About HBase Clusters 24
About MapReduce Clusters 31
About Data Compute Clusters 34
About Customized Clusters 43
VMware, Inc.
Managing Hadoop and HBase Clusters51
4
Stop and Start a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 51
Scale Out a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 52
Scale CPU and RAM with the Serengeti Command-Line Interface 52
Reconfigure a Big Data Cluster with the Serengeti Command-Line Interface 53
About Resource Usage and Elastic Scaling 55
Delete a Cluster by Using the Serengeti Command-Line Interface 60
About vSphere High Availability and vSphere Fault Tolerance 60
Reconfigure a Node Group with the Serengeti Command-Line Interface 60
Recover from Disk Failure with the Serengeti Command-Line Interface Client 61
Monitoring the Big Data Extensions Environment63
5
View List of Application Managers by using the Serengeti Command-Line Interface 63
View Available Hadoop Distributions with the Serengeti Command-Line Interface 64
3
VMware vSphere Big Data Extensions Command-Line Interface Guide
View Supported Distributions for All Application Managers by Using the Serengeti Command-
Line Interface 64
View Configurations or Roles for Application Manager and Distribution by Using the Serengeti
Command-Line Interface 64
View Provisioned Hadoop and HBase Clusters with the Serengeti Command-Line Interface 65
View Datastores with the Serengeti Command-Line Interface 65
View Networks with the Serengeti Command-Line Interface 65
View Resource Pools with the Serengeti Command-Line Interface 66
Cluster Specification Reference67
6
Cluster Specification File Requirements 67
Cluster Definition Requirements 67
Annotated Cluster Specification File 68
Cluster Specification Attribute Definitions 72
White Listed and Black Listed Hadoop Attributes 74
Convert Hadoop XML Files to Serengeti JSON Files 75
Serengeti CLI Command Reference77
7
appmanager Commands 77
cluster Commands 79
connect Command 85
datastore Commands 86
disconnect Command 86
distro list Command 87
fs Commands 87
hive script Command 92
mr Commands 92
network Commands 95
pig script Command 97
resourcepool Commands 97
topology Commands 98
Index99
4 VMware, Inc.
About This Book
VMware vSphere Big Data Extensions Command-Line Interface Guide describes how to use the Serengeti
Command-Line Interface (CLI) to manage the vSphere resources that you use to create Hadoop and HBase
clusters, and how to create, manage, and monitor Hadoop and HBase clusters with the VMware Serengeti™
CLI.
VMware vSphere Big Data Extensions Command-Line Interface Guide also describes how to perform Hadoop and
HBase operations with the Serengeti CLI, and provides cluster specification and Serengeti CLI command
references.
Intended Audience
This guide is for system administrators and developers who want to use Serengeti to deploy and manage
Hadoop clusters. To successfully work with Serengeti, you should be familiar with Hadoop and VMware
vSphere®.
VMware Technical Publications Glossary
VMware Technical Publications provides a glossary of terms that might be unfamiliar to you. For definitions
of terms as they are used in VMware technical documentation, go to
http://www.vmware.com/support/pubs.
®
VMware, Inc.
5
VMware vSphere Big Data Extensions Command-Line Interface Guide
6 VMware, Inc.
Using the Serengeti Remote
Command-Line Interface Client1
The Serengeti Remote Command-Line Interface Client lets you access the Serengeti Management Server to
deploy, manage, and use Hadoop.
This chapter includes the following topics:
“Access the Serengeti CLI By Using the Remote CLI Client,” on page 7
n
“Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client,” on page 8
n
Access the Serengeti CLI By Using the Remote CLI Client
You can access the Serengeti Command-Line Interface (CLI) to perform Serengeti administrative tasks with
the Serengeti Remote CLI Client.
IMPORTANT You can only run Hadoop commands from the Serengeti CLI on a cluster running the Apache
Hadoop 1.2.1 distribution. To use the command-line to run Hadoop administrative commands for clusters
running other Hadoop distributions, such as cfg, fs, mr, pig, and hive, use a Hadoop client node to run
these commands.
Prerequisites
Use the VMware vSphere Web Client to log in to the VMware vCenter Server® on which you deployed
n
the Serengeti vApp.
Verify that the Serengeti vApp deployment was successful and that the Management Server is running.
n
Verify that you have the correct password to log in to Serengeti CLI. See the VMware vSphere Big Data
n
Extensions Administrator's and User's Guide.
The Serengeti CLI uses its vCenter Server credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment and that its location is
n
in your path environment variable.
Procedure
1Open a Web browser to connect to the Serengeti Management Server cli directory.
http://ip_address/cli
2Download the ZIP file for your version and build.
The filename is in the format VMware-Serengeti-cli-version_number-build_number.ZIP.
VMware, Inc.
7
VMware vSphere Big Data Extensions Command-Line Interface Guide
3Unzip the download.
The download includes the following components.
The serengeti-cli-version_number JAR file, which includes the Serengeti Remote CLI Client.
n
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
4Open a command shell, and change to the directory where you unzipped the package.
5Change to the cli directory, and run the following command to enter the Serengeti CLI.
For any language other than French or German, run the following command.
n
java -jar serengeti-cli-version_number.jar
For French or German languages, which use code page 850 (CP 850) language encoding when
n
running the Serengeti CLI from a Windows command console, run the following command.
You must run the connect host command every time you begin a CLI session, and again after the 30
minute session timeout. If you do not run this command, you cannot run any other commands.
aRun the connect command.
connect --host xx.xx.xx.xx:8443
bAt the prompt, type your user name, which might be different from your login credentials for the
Serengeti Management Server.
NOTE If you do not create a user name and password for the Serengeti Command-Line Interface
Client, you can use the default vCenter Server administrator credentials. The Serengeti CommandLine Interface Client uses the vCenter Server login credentials with read permissions on the
Serengeti Management Server.
cAt the prompt, type your password.
A command shell opens, and the Serengeti CLI prompt appears. You can use the help command to get help
with Serengeti commands and command syntax.
To display a list of available commands, type help.
n
To get help for a specific command, append the name of the command to the help command.
n
help cluster create
Press Tab to complete a command.
n
Log in to Hadoop Nodes with the Serengeti Command-Line Interface
Client
To perform troubleshooting or to run your management automation scripts, log in to Hadoop master,
worker, and client nodes with password-less SSH from the Serengeti Management Server using SSH client
tools such as SSH, PDSH, ClusterSSH, and Mussh.
You can use a user name and password authenticated login to connect to Hadoop cluster nodes over SSH.
All deployed nodes are password-protected with either a random password or a user-specified password
that was assigned when the cluster was created.
8 VMware, Inc.
Chapter 1 Using the Serengeti Remote Command-Line Interface Client
Prerequisites
Use the vSphere Web Client to log in to vCenter Server, and verify that the Serengeti Management Server
virtual machine is running.
Procedure
1Right-click the Serengeti Management Server virtual machine and select Open Console.
The password for the Serengeti Management Server appears.
NOTE If the password scrolls off the console screen, press Ctrl+D to return to the command prompt.
2Use the vSphere Web Client to log in to the Hadoop node.
The password for the root user appears on the virtual machine console in the vSphere Web Client.
3Change the Hadoop node’s password by running the set-password -u command.
sudo /opt/serengeti/sbin/set-password -u
VMware, Inc. 9
VMware vSphere Big Data Extensions Command-Line Interface Guide
10 VMware, Inc.
Managing the Big Data Extensions
Environment by Using the Serengeti
Command-Line Interface2
You must manage yourBig Data Extensions, which includes ensuring that if you choose not to add the
resource pool, datastore, and network when you deploy the Serengeti vApp, you add the vSphere resources
before you create a Hadoop or HBase cluster. You must also add additional application managers, if you
want to use either Ambari or Cloudera Manager to manage your Hadoop clusters. You can remove
resources that you no longer need.
This chapter includes the following topics:
“About Application Managers,” on page 11
n
“Add a Resource Pool with the Serengeti Command-Line Interface,” on page 14
n
“Remove a Resource Pool with the Serengeti Command-Line Interface,” on page 15
n
“Add a Datastore with the Serengeti Command-Line Interface,” on page 15
n
“Remove a Datastore with the Serengeti Command-Line Interface,” on page 15
n
“Add a Network with the Serengeti Command-Line Interface,” on page 16
n
“Remove a Network with the Serengeti Command-Line Interface,” on page 16
n
“Reconfigure a Static IP Network with the Serengeti Command-Line Interface,” on page 17
n
About Application Managers
You can use Cloudera Manager, Ambari, and the default application manager to provision and manage
clusters with VMware vSphere Big Data Extensions.
After you add a new Cloudera Manager or Ambari application manager to Big Data Extensions, you can
redirect your software management tasks, including monitoring and managing clusters, to that application
manager.
You can use an application manager to perform the following tasks:
List all available vendor instances, supported distributions, and configurations or roles for a specific
n
application manager and distribution.
Create clusters.
n
Monitor and manage services from the application manager console.
n
Check the documentation for your application manager for tool-specific requirements.
Restrictions
The following restrictions apply to Cloudera Manager and Ambari application managers:
To add a application manager with HTTPS, use the FQDN instead of the URL.
n
VMware, Inc.
11
VMware vSphere Big Data Extensions Command-Line Interface Guide
You cannot rename a cluster that was created with a Cloudera Manager or Ambari application
n
manager.
You cannot change services for a big data cluster from Big Data Extensions if the cluster was created
n
with Ambari or Cloudera application manager.
To change services, configurations, or both, you must make the changes manually from the application
n
manager on the nodes.
If you install new services, Big Data Extensions starts and stops the new services together with old
services.
If you use an application manager to change services and big data cluster configurations, those changes
n
cannot be synced from Big Data Extensions. The nodes that you created with Big Data Extensions do
not contain the new services or configurations.
Add an Application Manager by Using the Serengeti Command-Line Interface
To use either Cloudera Manager or Ambari application managers, you must add the application manager
and add server information to Big Data Extensions.
NOTE If you want to add a Cloudera Manager or Ambari application manager with HTTPS, you should use
the FQDN in place of the URL.
Application manager names can include only alphanumeric characters ([0-9, a-z, A-Z]) and the
following special characters; underscores, hyphens, and blank spaces.
You can use the optional description variable to include a description of the application manager
instance.
3Enter your username and password at the prompt.
4If you specified SSL, enter the file path of the SSL certificate at the prompt.
What to do next
To verify that the application manager was added successfully, run the appmanager list command.
Modify an Application Manager by Using the Serengeti Command-Line Interface
You can modify the information for an application manager with the Serengeti CLI, for example, you can
change the manager server IP address if it is not a static IP, or you can upgrade the administrator account.
Prerequisites
Verify that you have at least one external application manager installed on your Big Data Extensions
environment.
Procedure
1Access the Serengeti CLI.
12 VMware, Inc.
Chapter 2 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
2Run the appmanager modify command.
appmanager modify--name application_manager_name
--url <http[s]://server:port>
Additional parameters are available for this command. For more information about this command, see
“appmanager modify Command,” on page 78.
View Supported Distributions for All Application Managers by Using the
Serengeti Command-Line Interface
You can list the Hadoop distributions that are supported on the application managers in the
Big Data Extensions environment to determine if a particular distribution is available on a an application
manager in your Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list --name application_manager_name [--distros]
If you do not include the --name parameter, the command returns a list of all the Hadoop distributions
that are supported on each of the application managers in the Big Data Extensions environment.
The command returns a list of all distributions that are supported for the application manager of the name
that you specify.
View Configurations or Roles for Application Manager and Distribution by
Using the Serengeti Command-Line Interface
You can use the appManager list command to list the Hadoop configurations or roles for a specific
application manager and distribution.
The configuration list includes those configurations that you can use to configure the cluster in the cluster
specifications.
The role list contains the roles that you can use to create a cluster. You should not use unsupported roles to
create clusters in the application manager.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list --name application_manager_name [--distro distro_name
(--configurations | --roles) ]
The command returns a list of the Hadoop configurations or roles for a specific application manager and
distribution.
VMware, Inc. 13
VMware vSphere Big Data Extensions Command-Line Interface Guide
View List of Application Managers by using the Serengeti Command-Line
Interface
You can use the appManager list command to list the application managers that are installed on the
Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list
The command returns a list of all application managers that are installed on the Big Data Extensions
environment.
Delete an Application Manager by Using the Serengeti Command-Line Interface
You can use the Serengeti CLI to delete an application manager when you no longer need it.
The application manager to delete must not contain clusters or the process fails.
Prerequisites
Verify that you have at least one external application manager installed on your Big Data Extensions
environment.
Procedure
1Access the Serengeti CLI.
2Run the appmanager delete command.
appmanager delete --name application_manager_name
Add a Resource Pool with the Serengeti Command-Line Interface
You add resource pools to make them available for use by Hadoop clusters. Resource pools must be located
at the top level of a cluster. Nested resource pools are not supported.
When you add a resource pool to Big Data Extensions it symbolically represents the actual vSphere resource
pool as recognized by vCenter Server. This symbolic representation lets you use the Big Data Extensions
resource pool name, instead of the full path of the resource pool in vCenter Server, in cluster specification
files.
NOTE After you add a resource pool to Big Data Extensions, do not rename the resource pool in vSphere. If
you rename it, you cannot perform Serengeti operations on clusters that use that resource pool.
Procedure
1Access the Serengeti Command-Line Interface client.
14 VMware, Inc.
Chapter 2 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
2Run the resourcepool add command.
The --vcrp parameter is optional.
This example adds a Serengeti resource pool named myRP to the vSphere rp1 resource pool that is
contained by the cluster1 vSphere cluster.
Remove a Resource Pool with the Serengeti Command-Line Interface
You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove resource
pools when you do not need them or if you want the Hadoop clusters you create in the Serengeti
Management Server to be deployed under a different resource pool. Removing a resource pool removes its
reference in vSphere. The resource pool is not deleted.
Procedure
1Access the Serengeti Command-Line Interface client.
2Run the resourcepool delete command.
If the command fails because the resource pool is referenced by a Hadoop cluster, you can use the
resourcepool list command to see which cluster is referencing the resource pool.
This example deletes the resource pool named myRP.
resourcepool delete --name myRP
Add a Datastore with the Serengeti Command-Line Interface
You can add shared and local datastores to the Serengeti server to make them available to Hadoop clusters.
Procedure
1Access the Serengeti CLI.
2Run the datastore add command.
This example adds a new, local storage datastore named myLocalDS. The --spec parameter’s value,
local*, is a wildcard specifying a set of vSphere datastores. All vSphere datastores whose names begin
with “local” are added and managed as a whole by Serengeti.
datastore add --name myLocalDS --spec local* --type LOCAL
What to do next
After you add a datastore to Big Data Extensions, do not rename the datastore in vSphere. If you rename it,
you cannot perform Serengeti operations on clusters that use that datastore.
Remove a Datastore with the Serengeti Command-Line Interface
You can remove any datastore from Serengeti that is not referenced by any Hadoop clusters. Removing a
datastore removes only the reference to the vCenter Server datastore. The datastore itself is not deleted.
You remove datastores if you do not need them or if you want to deploy the Hadoop clusters that you create
in the Serengeti Management Server under a different datastore.
Procedure
1Access the Serengeti CLI.
VMware, Inc. 15
VMware vSphere Big Data Extensions Command-Line Interface Guide
2Run the datastore delete command.
If the command fails because the datastore is referenced by a Hadoop cluster, you can use the datastore
list command to see which cluster is referencing the datastore.
This example deletes the myDS datastore.
datastore delete --name myDS
Add a Network with the Serengeti Command-Line Interface
You add networks to Serengeti to make their IP addresses available to Hadoop clusters. A network is a port
group, as well as a means of accessing the port group through an IP address.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the
network.
Procedure
1Access the Serengeti CLI.
2Run the network add command.
This example adds a network named myNW to the 10PG vSphere port group. Virtual machines that use
this network use DHCP to obtain the IP addresses.
network add --name myNW --portGroup 10PG --dhcp
This example adds a network named myNW to the 10PG vSphere port group. Hadoop nodes use
addresses in the 192.168.1.2-100 IP address range, the DNS server IP address is 10.111.90.2, the gateway
address is 192.168.1.1, and the subnet mask is 255.255.255.0.
After you add a network to Big Data Extensions, do not rename it in vSphere. If you rename the network,
you cannot perform Serengeti operations on clusters that use that network.
Remove a Network with the Serengeti Command-Line Interface
You can remove networks from Serengeti that are not referenced by any Hadoop clusters. Removing an
unused network frees the IP addresses for reuse.
Procedure
1Access the Serengeti CLI.
2Run the network delete command.
network delete --name network_name
If the command fails because the network is referenced by a Hadoop cluster, you can use the network
list --detail command to see which cluster is referencing the network.
16 VMware, Inc.
Chapter 2 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
Reconfigure a Static IP Network with the Serengeti Command-Line
Interface
You can reconfigure a Serengeti static IP network by adding IP address segments to it. You might need to
add IP address segments so that there is enough capacity for a cluster that you want to create.
If the IP range that you specify includes IP addresses that are already in the network, Serengeti ignores the
duplicated addresses. The remaining addresses in the specified range are added to the network. If the
network is already used by a cluster, the cluster can use the new IP addresses after you add them to the
network. If only part of the IP range is used by a cluster, the unused IP address can be used when you create
a new cluster.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the
network.
Procedure
1Access the Serengeti CLI.
2Run the network modify command.
This example adds IP addresses from 192.168.1.2 to 192.168.1.100 to a network named myNetwork.
VMware vSphere Big Data Extensions Command-Line Interface Guide
18 VMware, Inc.
Creating Hadoop and HBase Clusters3
Big Data Extensions you can create and deploy Hadoop and HBase clusters. A big data cluster is a type of
computational cluster designed for storing and analyzing large amounts of unstructured data in a
distributed computing environment.
Restrictions
When you create an HBase only cluster, you must use the default application manager because the
n
other application managers do not support HBase only clusters.
You cannot rename a cluster that was created with Cloudera Manager or Ambari application manager.
n
Requirements
The resource requirements are different for clusters created with the Serengeti Command-Line Interface and
the Big Data Extensions plug-in for the vSphere Web Client because the clusters use different default
templates. The default clusters created by using the Serengeti CLI are targeted for Project Serengeti users
and proof-of-concept applications, and are smaller than the Big Data Extensions plug-in templates, which
are targeted for larger deployments for commercial use.
Some deployment configurations require more resources than other configurations. For example, if you
create a Greenplum HD 1.2 cluster, you cannot use the small size virtual machine. If you create a default
MapR or Greenplum HD cluster by using the Serengeti CLI, at least 550 GB of storage and 55 GB of memory
are recommended. For other Hadoop distributions, at least 350 GB of storage and 35 GB of memory are
recommended.
VMware, Inc.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the cluster's
virtual machine automatic migration. Although this prevents vSphere from automatically migrating the
virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using
the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters.
Performing such management functions outside of the Big Data Extensions environment can make it
impossible for you to perform some Big Data Extensions operations, such as disk failure recovery.
The requirements for passwords are that passwords be from 8 to 128 characters, and include only
alphanumeric characters ([0-9, a-z, A-Z]) and the following special characters: _ @ # $ % ^ & *.
This chapter includes the following topics:
“About Hadoop and HBase Cluster Deployment Types,” on page 20
n
“Serengeti’s Default Hadoop Cluster Configuration,” on page 20
n
“Default HBase Cluster Configuration for Serengeti,” on page 21
n
“About Cluster Topology,” on page 21
n
19
VMware vSphere Big Data Extensions Command-Line Interface Guide
“About HBase Clusters,” on page 24
n
“About MapReduce Clusters,” on page 31
n
“About Data Compute Clusters,” on page 34
n
“About Customized Clusters,” on page 43
n
About Hadoop and HBase Cluster Deployment Types
With Big Data Extensions, you can create and use several types of big data clusters.
You can create the following types of clusters.
Basic Hadoop Cluster
HBase Cluster
Data and Compute
Separation Cluster
Compute Only Cluster
Compute Workers Only
Cluster
HBase Only Cluster
Simple Hadoop deployment for proof of concept projects and other smallscale data processing tasks. The Basic Hadoop cluster contains HDFS and the
MapReduce framework. The MapReduce framework processes problems in
parallel across huge datasets in the HDFS.
Runs on top of HDFS and provides a fault-tolerant way of storing large
quantities of sparse data.
Separates the data and compute nodes. or clusters that contain compute
nodes only. In this type of cluster, the data node and compute node are not
on the same virtual machine.
You can create a cluster that contain only compute nodes, for example
Jobtracker, Tasktracker, ResourceManager and NodeManager nodes, but not
Namenode and Datanodes. A compute only cluster is used to run
MapReduce jobs on an external HDFS cluster.
Contains only compute worker nodes, for example, Tasktracker and
NodeManager nodes, but not Namenodes and Datanodes. A compute
workers only cluster is used to add more compute worker nodes to an
existing Hadoop cluster.
Contains HBase Master, HBase RegionServer, and Zookeeper nodes, but not
Namenodes or Datanodes. Multiple HBase only clusters can use the same
external HDFS cluster.
Customized Cluster
Uses a cluster specification file to create clusters using the same
configuration as your previously created clusters. You can edit the cluster
specification file to customize the cluster configuration.
Serengeti’s Default Hadoop Cluster Configuration
For basic Hadoop deployments, such as proof of concept projects, you can use Serengeti’s default Hadoop
cluster configuration for clusters that are created with the Command-Line Interface.
The resulting cluster deployment consists of the following nodes and virtual machines:
One master node virtual machine with NameNode and JobTracker services.
n
Three worker node virtual machines, each with DataNode and TaskTracker services.
n
One client node virtual machine containing the Hadoop client environment: the Hadoop client shell,
n
Pig, and Hive.
20 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Hadoop Distributions Supporting MapReduce v1 and MapReduce v2 (YARN)
If you use either Cloudera CDH4 or CDH5 Hadoop distributions, which support both MapReduce v1 and
MapReduce v2 (YARN), the default Hadoop cluster configurations are different. The default hadoop cluster
configuration for CDH4 is a MapReduce v1 cluster. The default hadoop cluster configuration for CDH5 is a
MapReduce v2 cluster. All other distributions support either MapReduce v1 or MapReduce v2 (YARN), but
not both.
Default HBase Cluster Configuration for Serengeti
HBase is an open source distributed columnar database that uses MapReduce and HDFS to manage data.
You can use HBase to build big table applications.
To run HBase MapReduce jobs, configure the HBase cluster to include JobTracker nodes or TaskTracker
nodes. When you create an HBase cluster with the CLI, according to the default Serengeti HBase template,
the resulting cluster consists of the following nodes:
One master node, which runs the NameNode and HBaseMaster services.
n
Three zookeeper nodes, each running the ZooKeeper service.
n
Three data nodes, each running the DataNode and HBase Regionserver services.
n
One client node, from which you can run Hadoop or HBase jobs.
n
The default HBase cluster deployed by Serengeti does not contain Hadoop JobTracker or Hadoop
TaskTracker daemons. To run an HBase MapReduce job, deploy a customized, nondefault HBase cluster.
About Cluster Topology
You can improve workload balance across your cluster nodes, and improve performance and throughput,
by specifying how Hadoop virtual machines are placed using topology awareness. For example, you can
have separate data and compute nodes, and improve performance and throughput by placing the nodes on
the same set of physical hosts.
To get maximum performance out of your big data cluster, configure your cluster so that it has awareness of
the topology of your environment's host and network information. Hadoop performs better when it uses
within-rack transfers, where more bandwidth is available, to off-rack transfers when assigning MapReduce
tasks to nodes. HDFS can place replicas more intelligently to trade off performance and resilience. For
example, if you have separate data and compute nodes, you can improve performance and throughput by
placing the nodes on the same set of physical hosts.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the virtual
machine automatic migration of the cluster. Although this prevents vSphere from migrating the virtual
machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the
vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing
such management functions outside of the Big Data Extensions environment might break the placement
policy of the cluster, such as the number of instances per host and the group associations. Even if you do not
specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN
placement policy constraints.
VMware, Inc. 21
VMware vSphere Big Data Extensions Command-Line Interface Guide
You can specify the following topology awareness configurations.
Hadoop Virtualization
Extensions (HVE)
Enhanced cluster reliability and performance provided by refined Hadoop
replica placement, task scheduling, and balancer policies. Hadoop clusters
implemented on a virtualized infrastructure have full awareness of the
topology on which they are running when using HVE.
To use HVE, your Hadoop distribution must support HVE and you must
create and upload a topology rack-hosts mapping file.
RACK_AS_RACK
Standard topology for Apache Hadoop distributions. Only rack and host
information are exposed to Hadoop. To use RACK_AS_RACK, create and
upload a server topology file.
HOST_AS_RACK
Simplified topology for Apache Hadoop distributions. To avoid placing all
HDFS data block replicas on the same physical host, each physical host is
treated as a rack. Because data block replicas are never placed on a rack, this
avoids the worst case scenario of a single host failure causing the complete
loss of any data block.
Use HOST_AS_RACK if your cluster uses a single rack, or if you do not have
rack information with which to decide about topology configuration options.
None
No topology is specified.
Topology Rack-Hosts Mapping File
Rack-hosts mapping files are plain text files that associate logical racks with physical hosts. These files are
required to create clusters with HVE or RACK_AS_RACK topology.
The format for every line in a topology rack-hosts mapping file is:
rackname: hostname1, hostname2 ...
For example, to assign physical hosts a.b.foo.com and a.c.foo.com to rack1, and physical host c.a.foo.com to
rack2, include the following lines in your topology rack-hosts mapping file.
The placementPolicies field in the cluster specification file controls how nodes are placed in the cluster.
If you specify values for both instancePerHost and groupRacks, there must be a sufficient number of
available hosts. To display the rack hosts information, use the topology list command.
The code shows an example placementPolicies field in a cluster specification file.
instancePerHostOptionalNumber of virtual machine nodes to
place for each physical ESXi host. This
constraint is aimed at balancing the
workload.
groupRacksOptionalMethod of distributing virtual machine
nodes among the cluster’s physical
racks. Specify the following JSON
strings:
n
type. Specify ROUNDROBIN,
which selects candidates fairly and
without priority.
n
racks. Which racks in the
topology map to use.
groupAssociationsOptionalOne or more target node groups with
which this node group associates.
Specify the following JSON strings:
n
reference. Target node group
name
n
type:
STRICT. Place the node group on
n
the target group’s set or subset of
ESXi hosts. If STRICT placement is
not possible, the operation fails.
WEAK. Attempt to place the node
n
group on the target group’s set or
subset of ESXi hosts, but if that is
not possible, use an extra ESXi
host.
Create a Cluster with Topology Awareness with the Serengeti Command-Line
Interface
To achieve a balanced workload or to improve performance and throughput, you can control how Hadoop
virtual machines are placed by adding topology awareness to the Hadoop clusters. For example, you can
have separate data and compute nodes, and improve performance and throughput by placing the nodes on
the same set of physical hosts.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
VMware, Inc. 23
VMware vSphere Big Data Extensions Command-Line Interface Guide
Procedure
1Access the Serengeti CLI.
2(Optional) Run the topology list command to view the list of available topologies.
topology list
3(Optional) If you want the cluster to use HVE or RACK_AS_RACK toplogies, create a topology rack-
hosts mapping file and upload the file to the Serengeti Management Server.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
HBase runs on top of HDFS and provides a fault-tolerant way of storing large quantities of sparse data.
Create a Default HBase Cluster with the Serengeti Command-Line Interface
You can use the Serengeti CLI to deploy HBase clusters on HDFS.
This task creates a default HBase cluster which doesn't contain the MapReduce framework. To run HBase
MapReduce jobs, add Jobtracker and TaskTracker or ResourceManager and NodeManager nodes to the
default HBase cluster sample specification file /opt/serengeti/samples/default_hbase_cluster.json, then
create a cluster using this specification file.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create command, and specify the --type parameter’s value as hbase.
cluster create --name cluster_name --type hbase
What to do next
After you deploy the cluster, you can access an HBase database by using several methods. See the VMware
vSphere Big Data Extensions Administrator's and User's Guide.
24 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Create an HBase Only Cluster in Big Data Extensions
With Big Data Extensions, you can create an HBase only cluster, which contain only HBase Master, HBase
RegionServer, and Zookeeper nodes, but not Namenodes and Datanodes. The advantage of having an
HBase only cluster is that multiple HBase clusters can use the same external HDFS.
Procedure
1Prerequisites for Creating an HBase Only Cluster on page 25
Before you can create an HBase only cluster, you must verify that your system meets all of the
prerequisites.
2Prepare the EMC Isilon OneFS as the External HDFS Cluster on page 25
If you use EMC Isilon OneFS as the external HDFS cluster to the HBase only cluster, you must create
and configure users and user groups, and prepare your Isilon OneFS environment.
3Create an HBase Only Cluster by Using the Serengeti Command-Line Interface on page 26
You can use the Serengeti CLI to create an HBase only cluster.
Prerequisites for Creating an HBase Only Cluster
Before you can create an HBase only cluster, you must verify that your system meets all of the prerequisites.
Prerequisites
Verify that you started the Serengeti vApp.
n
Verify that you have more than one distribution if you want to use a distribution other than the default
n
distribution.
Verify that you have an existing HDFS cluster to use as the external HDFS cluster.
n
To avoid conflicts between the HBase only cluster and the external HDFS cluster, the clusters should
use the same Hadoop distribution and version.
If the external HDFS cluster was not created using Big Data Extensions, verify that the HDFS
n
directory /hadoop/hbase, the group hadoop, and the following users exist in the external HDFS cluster:
hdfs
n
hbase
n
serengeti
n
If you use the EMC Isilon OneFS as the external HDFS cluster, verify that your Isilon environment is
n
prepared.
For information about how to prepare your environment, see “Prepare the EMC Isilon OneFS as the
External HDFS Cluster,” on page 25.
Prepare the EMC Isilon OneFS as the External HDFS Cluster
If you use EMC Isilon OneFS as the external HDFS cluster to the HBase only cluster, you must create and
configure users and user groups, and prepare your Isilon OneFS environment.
Procedure
1Log in to one of the Isilon HDFS nodes as user root.
2Create the users.
hdfs
n
hbase
n
VMware, Inc. 25
VMware vSphere Big Data Extensions Command-Line Interface Guide
serengeti
n
mapred
n
The yarn and mapred users should have write, read, and execute permissions to the entire exported
HDFS directory.
3Create the user group hadoop.
4Create the directory tmp under the root HDFS directory.
5Set the owner as hdfs:hadoop with the read and write permissions set as 777.
6Create the directory hadoop under the root HDFS directory.
7Set the owner as hdfs:hadoop with the read and write permissions set as 775.
8Create the directory hbase under the directory hadoop.
9Set the owner as hbase:hadoop with the read and write permissions set as 775.
10 Set the owner of the root HDFS directory as hdfs:hadoop.
Example: Configuring the EMC Isilon OneFS Environment
The /opt/serengeti/samples/hbase_only_cluster.json file is a sample specification file for HBase only
clusters. It contains the zookeeper, hbase_master, and hbase_regionserver roles, but not the
hadoop_namenode/hadoop_datanode role.
5To verify that the cluster was created, run the cluster list command.
cluster list --name name
After the cluster is created, the system returns Cluster clustername created.
Create an HBase Cluster with vSphere HA Protection with the Serengeti
Command-Line Interface
You can create HBase clusters with separated Hadoop NameNode and HBase Master roles. You can
configure vSphere HA protection for the Master roles.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the characteristics of the cluster, including the node group
roles and vSphere HA protection.
In this example, the cluster has JobTracker and TaskTracker nodes, which let you run HBase
MapReduce jobs. The Hadoop NameNode and HBase Master roles are separated, and both are
protected by vSphere HA.
"instanceType" : "SMALL",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 1,
"memCapacityMB" : 3748,
"haFlag" : "off",
"configuration" : {
}
}
],
// we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste
the output here
"configuration" : {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/coredefault.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes
and here is a sample:
// "io.file.buffer.size": "4096"
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfsdefault.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapreddefault.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": ""
},
"log4j.properties": {
// "hadoop.root.logger": "DEBUG,DRFA",
// "hadoop.security.logger": "DEBUG,DRFA"
},
"fair-scheduler.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/capacity_scheduler.html
},
VMware, Inc. 29
VMware vSphere Big Data Extensions Command-Line Interface Guide
Create an HBase Only Cluster with External Namenode HA HDFS Cluster
You can create an HBase only cluster with two namenodes in an active-passive HA configuration. The HA
namenode provides a hot standby name node that, in the event of a failure, can perform the role of the
active namenode with no downtime.
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
n
30 VMware, Inc.
Loading...
+ 74 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.