This document supports the version of each product listed and
supports all subsequent versions until the document is
replaced by a new edition. To check for more recent editions
of this document, see http://www.vmware.com/support/pubs.
EN-001536-00
VMware vSphere Big Data Extensions Command-Line Interface Guide
You can find the most up-to-date technical documentation on the VMware Web site at:
http://www.vmware.com/support/
The VMware Web site also provides the latest product updates.
If you have comments about this documentation, submit your feedback to:
Default HBase Cluster Configuration for Serengeti 21
About Cluster Topology 21
About HBase Clusters 24
About MapReduce Clusters 31
About Data Compute Clusters 34
About Customized Clusters 43
VMware, Inc.
Managing Hadoop and HBase Clusters51
4
Stop and Start a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 51
Scale Out a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 52
Scale CPU and RAM with the Serengeti Command-Line Interface 52
Reconfigure a Big Data Cluster with the Serengeti Command-Line Interface 53
About Resource Usage and Elastic Scaling 55
Delete a Cluster by Using the Serengeti Command-Line Interface 60
About vSphere High Availability and vSphere Fault Tolerance 60
Reconfigure a Node Group with the Serengeti Command-Line Interface 60
Recover from Disk Failure with the Serengeti Command-Line Interface Client 61
Monitoring the Big Data Extensions Environment63
5
View List of Application Managers by using the Serengeti Command-Line Interface 63
View Available Hadoop Distributions with the Serengeti Command-Line Interface 64
3
VMware vSphere Big Data Extensions Command-Line Interface Guide
View Supported Distributions for All Application Managers by Using the Serengeti Command-
Line Interface 64
View Configurations or Roles for Application Manager and Distribution by Using the Serengeti
Command-Line Interface 64
View Provisioned Hadoop and HBase Clusters with the Serengeti Command-Line Interface 65
View Datastores with the Serengeti Command-Line Interface 65
View Networks with the Serengeti Command-Line Interface 65
View Resource Pools with the Serengeti Command-Line Interface 66
Cluster Specification Reference67
6
Cluster Specification File Requirements 67
Cluster Definition Requirements 67
Annotated Cluster Specification File 68
Cluster Specification Attribute Definitions 72
White Listed and Black Listed Hadoop Attributes 74
Convert Hadoop XML Files to Serengeti JSON Files 75
Serengeti CLI Command Reference77
7
appmanager Commands 77
cluster Commands 79
connect Command 85
datastore Commands 86
disconnect Command 86
distro list Command 87
fs Commands 87
hive script Command 92
mr Commands 92
network Commands 95
pig script Command 97
resourcepool Commands 97
topology Commands 98
Index99
4 VMware, Inc.
About This Book
VMware vSphere Big Data Extensions Command-Line Interface Guide describes how to use the Serengeti
Command-Line Interface (CLI) to manage the vSphere resources that you use to create Hadoop and HBase
clusters, and how to create, manage, and monitor Hadoop and HBase clusters with the VMware Serengeti™
CLI.
VMware vSphere Big Data Extensions Command-Line Interface Guide also describes how to perform Hadoop and
HBase operations with the Serengeti CLI, and provides cluster specification and Serengeti CLI command
references.
Intended Audience
This guide is for system administrators and developers who want to use Serengeti to deploy and manage
Hadoop clusters. To successfully work with Serengeti, you should be familiar with Hadoop and VMware
vSphere®.
VMware Technical Publications Glossary
VMware Technical Publications provides a glossary of terms that might be unfamiliar to you. For definitions
of terms as they are used in VMware technical documentation, go to
http://www.vmware.com/support/pubs.
®
VMware, Inc.
5
VMware vSphere Big Data Extensions Command-Line Interface Guide
6 VMware, Inc.
Using the Serengeti Remote
Command-Line Interface Client1
The Serengeti Remote Command-Line Interface Client lets you access the Serengeti Management Server to
deploy, manage, and use Hadoop.
This chapter includes the following topics:
“Access the Serengeti CLI By Using the Remote CLI Client,” on page 7
n
“Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client,” on page 8
n
Access the Serengeti CLI By Using the Remote CLI Client
You can access the Serengeti Command-Line Interface (CLI) to perform Serengeti administrative tasks with
the Serengeti Remote CLI Client.
IMPORTANT You can only run Hadoop commands from the Serengeti CLI on a cluster running the Apache
Hadoop 1.2.1 distribution. To use the command-line to run Hadoop administrative commands for clusters
running other Hadoop distributions, such as cfg, fs, mr, pig, and hive, use a Hadoop client node to run
these commands.
Prerequisites
Use the VMware vSphere Web Client to log in to the VMware vCenter Server® on which you deployed
n
the Serengeti vApp.
Verify that the Serengeti vApp deployment was successful and that the Management Server is running.
n
Verify that you have the correct password to log in to Serengeti CLI. See the VMware vSphere Big Data
n
Extensions Administrator's and User's Guide.
The Serengeti CLI uses its vCenter Server credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment and that its location is
n
in your path environment variable.
Procedure
1Open a Web browser to connect to the Serengeti Management Server cli directory.
http://ip_address/cli
2Download the ZIP file for your version and build.
The filename is in the format VMware-Serengeti-cli-version_number-build_number.ZIP.
VMware, Inc.
7
VMware vSphere Big Data Extensions Command-Line Interface Guide
3Unzip the download.
The download includes the following components.
The serengeti-cli-version_number JAR file, which includes the Serengeti Remote CLI Client.
n
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
4Open a command shell, and change to the directory where you unzipped the package.
5Change to the cli directory, and run the following command to enter the Serengeti CLI.
For any language other than French or German, run the following command.
n
java -jar serengeti-cli-version_number.jar
For French or German languages, which use code page 850 (CP 850) language encoding when
n
running the Serengeti CLI from a Windows command console, run the following command.
You must run the connect host command every time you begin a CLI session, and again after the 30
minute session timeout. If you do not run this command, you cannot run any other commands.
aRun the connect command.
connect --host xx.xx.xx.xx:8443
bAt the prompt, type your user name, which might be different from your login credentials for the
Serengeti Management Server.
NOTE If you do not create a user name and password for the Serengeti Command-Line Interface
Client, you can use the default vCenter Server administrator credentials. The Serengeti CommandLine Interface Client uses the vCenter Server login credentials with read permissions on the
Serengeti Management Server.
cAt the prompt, type your password.
A command shell opens, and the Serengeti CLI prompt appears. You can use the help command to get help
with Serengeti commands and command syntax.
To display a list of available commands, type help.
n
To get help for a specific command, append the name of the command to the help command.
n
help cluster create
Press Tab to complete a command.
n
Log in to Hadoop Nodes with the Serengeti Command-Line Interface
Client
To perform troubleshooting or to run your management automation scripts, log in to Hadoop master,
worker, and client nodes with password-less SSH from the Serengeti Management Server using SSH client
tools such as SSH, PDSH, ClusterSSH, and Mussh.
You can use a user name and password authenticated login to connect to Hadoop cluster nodes over SSH.
All deployed nodes are password-protected with either a random password or a user-specified password
that was assigned when the cluster was created.
8 VMware, Inc.
Chapter 1 Using the Serengeti Remote Command-Line Interface Client
Prerequisites
Use the vSphere Web Client to log in to vCenter Server, and verify that the Serengeti Management Server
virtual machine is running.
Procedure
1Right-click the Serengeti Management Server virtual machine and select Open Console.
The password for the Serengeti Management Server appears.
NOTE If the password scrolls off the console screen, press Ctrl+D to return to the command prompt.
2Use the vSphere Web Client to log in to the Hadoop node.
The password for the root user appears on the virtual machine console in the vSphere Web Client.
3Change the Hadoop node’s password by running the set-password -u command.
sudo /opt/serengeti/sbin/set-password -u
VMware, Inc. 9
VMware vSphere Big Data Extensions Command-Line Interface Guide
10 VMware, Inc.
Managing the Big Data Extensions
Environment by Using the Serengeti
Command-Line Interface2
You must manage yourBig Data Extensions, which includes ensuring that if you choose not to add the
resource pool, datastore, and network when you deploy the Serengeti vApp, you add the vSphere resources
before you create a Hadoop or HBase cluster. You must also add additional application managers, if you
want to use either Ambari or Cloudera Manager to manage your Hadoop clusters. You can remove
resources that you no longer need.
This chapter includes the following topics:
“About Application Managers,” on page 11
n
“Add a Resource Pool with the Serengeti Command-Line Interface,” on page 14
n
“Remove a Resource Pool with the Serengeti Command-Line Interface,” on page 15
n
“Add a Datastore with the Serengeti Command-Line Interface,” on page 15
n
“Remove a Datastore with the Serengeti Command-Line Interface,” on page 15
n
“Add a Network with the Serengeti Command-Line Interface,” on page 16
n
“Remove a Network with the Serengeti Command-Line Interface,” on page 16
n
“Reconfigure a Static IP Network with the Serengeti Command-Line Interface,” on page 17
n
About Application Managers
You can use Cloudera Manager, Ambari, and the default application manager to provision and manage
clusters with VMware vSphere Big Data Extensions.
After you add a new Cloudera Manager or Ambari application manager to Big Data Extensions, you can
redirect your software management tasks, including monitoring and managing clusters, to that application
manager.
You can use an application manager to perform the following tasks:
List all available vendor instances, supported distributions, and configurations or roles for a specific
n
application manager and distribution.
Create clusters.
n
Monitor and manage services from the application manager console.
n
Check the documentation for your application manager for tool-specific requirements.
Restrictions
The following restrictions apply to Cloudera Manager and Ambari application managers:
To add a application manager with HTTPS, use the FQDN instead of the URL.
n
VMware, Inc.
11
VMware vSphere Big Data Extensions Command-Line Interface Guide
You cannot rename a cluster that was created with a Cloudera Manager or Ambari application
n
manager.
You cannot change services for a big data cluster from Big Data Extensions if the cluster was created
n
with Ambari or Cloudera application manager.
To change services, configurations, or both, you must make the changes manually from the application
n
manager on the nodes.
If you install new services, Big Data Extensions starts and stops the new services together with old
services.
If you use an application manager to change services and big data cluster configurations, those changes
n
cannot be synced from Big Data Extensions. The nodes that you created with Big Data Extensions do
not contain the new services or configurations.
Add an Application Manager by Using the Serengeti Command-Line Interface
To use either Cloudera Manager or Ambari application managers, you must add the application manager
and add server information to Big Data Extensions.
NOTE If you want to add a Cloudera Manager or Ambari application manager with HTTPS, you should use
the FQDN in place of the URL.
Application manager names can include only alphanumeric characters ([0-9, a-z, A-Z]) and the
following special characters; underscores, hyphens, and blank spaces.
You can use the optional description variable to include a description of the application manager
instance.
3Enter your username and password at the prompt.
4If you specified SSL, enter the file path of the SSL certificate at the prompt.
What to do next
To verify that the application manager was added successfully, run the appmanager list command.
Modify an Application Manager by Using the Serengeti Command-Line Interface
You can modify the information for an application manager with the Serengeti CLI, for example, you can
change the manager server IP address if it is not a static IP, or you can upgrade the administrator account.
Prerequisites
Verify that you have at least one external application manager installed on your Big Data Extensions
environment.
Procedure
1Access the Serengeti CLI.
12 VMware, Inc.
Chapter 2 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
2Run the appmanager modify command.
appmanager modify--name application_manager_name
--url <http[s]://server:port>
Additional parameters are available for this command. For more information about this command, see
“appmanager modify Command,” on page 78.
View Supported Distributions for All Application Managers by Using the
Serengeti Command-Line Interface
You can list the Hadoop distributions that are supported on the application managers in the
Big Data Extensions environment to determine if a particular distribution is available on a an application
manager in your Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list --name application_manager_name [--distros]
If you do not include the --name parameter, the command returns a list of all the Hadoop distributions
that are supported on each of the application managers in the Big Data Extensions environment.
The command returns a list of all distributions that are supported for the application manager of the name
that you specify.
View Configurations or Roles for Application Manager and Distribution by
Using the Serengeti Command-Line Interface
You can use the appManager list command to list the Hadoop configurations or roles for a specific
application manager and distribution.
The configuration list includes those configurations that you can use to configure the cluster in the cluster
specifications.
The role list contains the roles that you can use to create a cluster. You should not use unsupported roles to
create clusters in the application manager.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list --name application_manager_name [--distro distro_name
(--configurations | --roles) ]
The command returns a list of the Hadoop configurations or roles for a specific application manager and
distribution.
VMware, Inc. 13
VMware vSphere Big Data Extensions Command-Line Interface Guide
View List of Application Managers by using the Serengeti Command-Line
Interface
You can use the appManager list command to list the application managers that are installed on the
Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list
The command returns a list of all application managers that are installed on the Big Data Extensions
environment.
Delete an Application Manager by Using the Serengeti Command-Line Interface
You can use the Serengeti CLI to delete an application manager when you no longer need it.
The application manager to delete must not contain clusters or the process fails.
Prerequisites
Verify that you have at least one external application manager installed on your Big Data Extensions
environment.
Procedure
1Access the Serengeti CLI.
2Run the appmanager delete command.
appmanager delete --name application_manager_name
Add a Resource Pool with the Serengeti Command-Line Interface
You add resource pools to make them available for use by Hadoop clusters. Resource pools must be located
at the top level of a cluster. Nested resource pools are not supported.
When you add a resource pool to Big Data Extensions it symbolically represents the actual vSphere resource
pool as recognized by vCenter Server. This symbolic representation lets you use the Big Data Extensions
resource pool name, instead of the full path of the resource pool in vCenter Server, in cluster specification
files.
NOTE After you add a resource pool to Big Data Extensions, do not rename the resource pool in vSphere. If
you rename it, you cannot perform Serengeti operations on clusters that use that resource pool.
Procedure
1Access the Serengeti Command-Line Interface client.
14 VMware, Inc.
Chapter 2 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
2Run the resourcepool add command.
The --vcrp parameter is optional.
This example adds a Serengeti resource pool named myRP to the vSphere rp1 resource pool that is
contained by the cluster1 vSphere cluster.
Remove a Resource Pool with the Serengeti Command-Line Interface
You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove resource
pools when you do not need them or if you want the Hadoop clusters you create in the Serengeti
Management Server to be deployed under a different resource pool. Removing a resource pool removes its
reference in vSphere. The resource pool is not deleted.
Procedure
1Access the Serengeti Command-Line Interface client.
2Run the resourcepool delete command.
If the command fails because the resource pool is referenced by a Hadoop cluster, you can use the
resourcepool list command to see which cluster is referencing the resource pool.
This example deletes the resource pool named myRP.
resourcepool delete --name myRP
Add a Datastore with the Serengeti Command-Line Interface
You can add shared and local datastores to the Serengeti server to make them available to Hadoop clusters.
Procedure
1Access the Serengeti CLI.
2Run the datastore add command.
This example adds a new, local storage datastore named myLocalDS. The --spec parameter’s value,
local*, is a wildcard specifying a set of vSphere datastores. All vSphere datastores whose names begin
with “local” are added and managed as a whole by Serengeti.
datastore add --name myLocalDS --spec local* --type LOCAL
What to do next
After you add a datastore to Big Data Extensions, do not rename the datastore in vSphere. If you rename it,
you cannot perform Serengeti operations on clusters that use that datastore.
Remove a Datastore with the Serengeti Command-Line Interface
You can remove any datastore from Serengeti that is not referenced by any Hadoop clusters. Removing a
datastore removes only the reference to the vCenter Server datastore. The datastore itself is not deleted.
You remove datastores if you do not need them or if you want to deploy the Hadoop clusters that you create
in the Serengeti Management Server under a different datastore.
Procedure
1Access the Serengeti CLI.
VMware, Inc. 15
VMware vSphere Big Data Extensions Command-Line Interface Guide
2Run the datastore delete command.
If the command fails because the datastore is referenced by a Hadoop cluster, you can use the datastore
list command to see which cluster is referencing the datastore.
This example deletes the myDS datastore.
datastore delete --name myDS
Add a Network with the Serengeti Command-Line Interface
You add networks to Serengeti to make their IP addresses available to Hadoop clusters. A network is a port
group, as well as a means of accessing the port group through an IP address.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the
network.
Procedure
1Access the Serengeti CLI.
2Run the network add command.
This example adds a network named myNW to the 10PG vSphere port group. Virtual machines that use
this network use DHCP to obtain the IP addresses.
network add --name myNW --portGroup 10PG --dhcp
This example adds a network named myNW to the 10PG vSphere port group. Hadoop nodes use
addresses in the 192.168.1.2-100 IP address range, the DNS server IP address is 10.111.90.2, the gateway
address is 192.168.1.1, and the subnet mask is 255.255.255.0.
After you add a network to Big Data Extensions, do not rename it in vSphere. If you rename the network,
you cannot perform Serengeti operations on clusters that use that network.
Remove a Network with the Serengeti Command-Line Interface
You can remove networks from Serengeti that are not referenced by any Hadoop clusters. Removing an
unused network frees the IP addresses for reuse.
Procedure
1Access the Serengeti CLI.
2Run the network delete command.
network delete --name network_name
If the command fails because the network is referenced by a Hadoop cluster, you can use the network
list --detail command to see which cluster is referencing the network.
16 VMware, Inc.
Chapter 2 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
Reconfigure a Static IP Network with the Serengeti Command-Line
Interface
You can reconfigure a Serengeti static IP network by adding IP address segments to it. You might need to
add IP address segments so that there is enough capacity for a cluster that you want to create.
If the IP range that you specify includes IP addresses that are already in the network, Serengeti ignores the
duplicated addresses. The remaining addresses in the specified range are added to the network. If the
network is already used by a cluster, the cluster can use the new IP addresses after you add them to the
network. If only part of the IP range is used by a cluster, the unused IP address can be used when you create
a new cluster.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the
network.
Procedure
1Access the Serengeti CLI.
2Run the network modify command.
This example adds IP addresses from 192.168.1.2 to 192.168.1.100 to a network named myNetwork.
VMware vSphere Big Data Extensions Command-Line Interface Guide
18 VMware, Inc.
Creating Hadoop and HBase Clusters3
Big Data Extensions you can create and deploy Hadoop and HBase clusters. A big data cluster is a type of
computational cluster designed for storing and analyzing large amounts of unstructured data in a
distributed computing environment.
Restrictions
When you create an HBase only cluster, you must use the default application manager because the
n
other application managers do not support HBase only clusters.
You cannot rename a cluster that was created with Cloudera Manager or Ambari application manager.
n
Requirements
The resource requirements are different for clusters created with the Serengeti Command-Line Interface and
the Big Data Extensions plug-in for the vSphere Web Client because the clusters use different default
templates. The default clusters created by using the Serengeti CLI are targeted for Project Serengeti users
and proof-of-concept applications, and are smaller than the Big Data Extensions plug-in templates, which
are targeted for larger deployments for commercial use.
Some deployment configurations require more resources than other configurations. For example, if you
create a Greenplum HD 1.2 cluster, you cannot use the small size virtual machine. If you create a default
MapR or Greenplum HD cluster by using the Serengeti CLI, at least 550 GB of storage and 55 GB of memory
are recommended. For other Hadoop distributions, at least 350 GB of storage and 35 GB of memory are
recommended.
VMware, Inc.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the cluster's
virtual machine automatic migration. Although this prevents vSphere from automatically migrating the
virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using
the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters.
Performing such management functions outside of the Big Data Extensions environment can make it
impossible for you to perform some Big Data Extensions operations, such as disk failure recovery.
The requirements for passwords are that passwords be from 8 to 128 characters, and include only
alphanumeric characters ([0-9, a-z, A-Z]) and the following special characters: _ @ # $ % ^ & *.
This chapter includes the following topics:
“About Hadoop and HBase Cluster Deployment Types,” on page 20
n
“Serengeti’s Default Hadoop Cluster Configuration,” on page 20
n
“Default HBase Cluster Configuration for Serengeti,” on page 21
n
“About Cluster Topology,” on page 21
n
19
VMware vSphere Big Data Extensions Command-Line Interface Guide
“About HBase Clusters,” on page 24
n
“About MapReduce Clusters,” on page 31
n
“About Data Compute Clusters,” on page 34
n
“About Customized Clusters,” on page 43
n
About Hadoop and HBase Cluster Deployment Types
With Big Data Extensions, you can create and use several types of big data clusters.
You can create the following types of clusters.
Basic Hadoop Cluster
HBase Cluster
Data and Compute
Separation Cluster
Compute Only Cluster
Compute Workers Only
Cluster
HBase Only Cluster
Simple Hadoop deployment for proof of concept projects and other smallscale data processing tasks. The Basic Hadoop cluster contains HDFS and the
MapReduce framework. The MapReduce framework processes problems in
parallel across huge datasets in the HDFS.
Runs on top of HDFS and provides a fault-tolerant way of storing large
quantities of sparse data.
Separates the data and compute nodes. or clusters that contain compute
nodes only. In this type of cluster, the data node and compute node are not
on the same virtual machine.
You can create a cluster that contain only compute nodes, for example
Jobtracker, Tasktracker, ResourceManager and NodeManager nodes, but not
Namenode and Datanodes. A compute only cluster is used to run
MapReduce jobs on an external HDFS cluster.
Contains only compute worker nodes, for example, Tasktracker and
NodeManager nodes, but not Namenodes and Datanodes. A compute
workers only cluster is used to add more compute worker nodes to an
existing Hadoop cluster.
Contains HBase Master, HBase RegionServer, and Zookeeper nodes, but not
Namenodes or Datanodes. Multiple HBase only clusters can use the same
external HDFS cluster.
Customized Cluster
Uses a cluster specification file to create clusters using the same
configuration as your previously created clusters. You can edit the cluster
specification file to customize the cluster configuration.
Serengeti’s Default Hadoop Cluster Configuration
For basic Hadoop deployments, such as proof of concept projects, you can use Serengeti’s default Hadoop
cluster configuration for clusters that are created with the Command-Line Interface.
The resulting cluster deployment consists of the following nodes and virtual machines:
One master node virtual machine with NameNode and JobTracker services.
n
Three worker node virtual machines, each with DataNode and TaskTracker services.
n
One client node virtual machine containing the Hadoop client environment: the Hadoop client shell,
n
Pig, and Hive.
20 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Hadoop Distributions Supporting MapReduce v1 and MapReduce v2 (YARN)
If you use either Cloudera CDH4 or CDH5 Hadoop distributions, which support both MapReduce v1 and
MapReduce v2 (YARN), the default Hadoop cluster configurations are different. The default hadoop cluster
configuration for CDH4 is a MapReduce v1 cluster. The default hadoop cluster configuration for CDH5 is a
MapReduce v2 cluster. All other distributions support either MapReduce v1 or MapReduce v2 (YARN), but
not both.
Default HBase Cluster Configuration for Serengeti
HBase is an open source distributed columnar database that uses MapReduce and HDFS to manage data.
You can use HBase to build big table applications.
To run HBase MapReduce jobs, configure the HBase cluster to include JobTracker nodes or TaskTracker
nodes. When you create an HBase cluster with the CLI, according to the default Serengeti HBase template,
the resulting cluster consists of the following nodes:
One master node, which runs the NameNode and HBaseMaster services.
n
Three zookeeper nodes, each running the ZooKeeper service.
n
Three data nodes, each running the DataNode and HBase Regionserver services.
n
One client node, from which you can run Hadoop or HBase jobs.
n
The default HBase cluster deployed by Serengeti does not contain Hadoop JobTracker or Hadoop
TaskTracker daemons. To run an HBase MapReduce job, deploy a customized, nondefault HBase cluster.
About Cluster Topology
You can improve workload balance across your cluster nodes, and improve performance and throughput,
by specifying how Hadoop virtual machines are placed using topology awareness. For example, you can
have separate data and compute nodes, and improve performance and throughput by placing the nodes on
the same set of physical hosts.
To get maximum performance out of your big data cluster, configure your cluster so that it has awareness of
the topology of your environment's host and network information. Hadoop performs better when it uses
within-rack transfers, where more bandwidth is available, to off-rack transfers when assigning MapReduce
tasks to nodes. HDFS can place replicas more intelligently to trade off performance and resilience. For
example, if you have separate data and compute nodes, you can improve performance and throughput by
placing the nodes on the same set of physical hosts.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the virtual
machine automatic migration of the cluster. Although this prevents vSphere from migrating the virtual
machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the
vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing
such management functions outside of the Big Data Extensions environment might break the placement
policy of the cluster, such as the number of instances per host and the group associations. Even if you do not
specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN
placement policy constraints.
VMware, Inc. 21
VMware vSphere Big Data Extensions Command-Line Interface Guide
You can specify the following topology awareness configurations.
Hadoop Virtualization
Extensions (HVE)
Enhanced cluster reliability and performance provided by refined Hadoop
replica placement, task scheduling, and balancer policies. Hadoop clusters
implemented on a virtualized infrastructure have full awareness of the
topology on which they are running when using HVE.
To use HVE, your Hadoop distribution must support HVE and you must
create and upload a topology rack-hosts mapping file.
RACK_AS_RACK
Standard topology for Apache Hadoop distributions. Only rack and host
information are exposed to Hadoop. To use RACK_AS_RACK, create and
upload a server topology file.
HOST_AS_RACK
Simplified topology for Apache Hadoop distributions. To avoid placing all
HDFS data block replicas on the same physical host, each physical host is
treated as a rack. Because data block replicas are never placed on a rack, this
avoids the worst case scenario of a single host failure causing the complete
loss of any data block.
Use HOST_AS_RACK if your cluster uses a single rack, or if you do not have
rack information with which to decide about topology configuration options.
None
No topology is specified.
Topology Rack-Hosts Mapping File
Rack-hosts mapping files are plain text files that associate logical racks with physical hosts. These files are
required to create clusters with HVE or RACK_AS_RACK topology.
The format for every line in a topology rack-hosts mapping file is:
rackname: hostname1, hostname2 ...
For example, to assign physical hosts a.b.foo.com and a.c.foo.com to rack1, and physical host c.a.foo.com to
rack2, include the following lines in your topology rack-hosts mapping file.
The placementPolicies field in the cluster specification file controls how nodes are placed in the cluster.
If you specify values for both instancePerHost and groupRacks, there must be a sufficient number of
available hosts. To display the rack hosts information, use the topology list command.
The code shows an example placementPolicies field in a cluster specification file.
instancePerHostOptionalNumber of virtual machine nodes to
place for each physical ESXi host. This
constraint is aimed at balancing the
workload.
groupRacksOptionalMethod of distributing virtual machine
nodes among the cluster’s physical
racks. Specify the following JSON
strings:
n
type. Specify ROUNDROBIN,
which selects candidates fairly and
without priority.
n
racks. Which racks in the
topology map to use.
groupAssociationsOptionalOne or more target node groups with
which this node group associates.
Specify the following JSON strings:
n
reference. Target node group
name
n
type:
STRICT. Place the node group on
n
the target group’s set or subset of
ESXi hosts. If STRICT placement is
not possible, the operation fails.
WEAK. Attempt to place the node
n
group on the target group’s set or
subset of ESXi hosts, but if that is
not possible, use an extra ESXi
host.
Create a Cluster with Topology Awareness with the Serengeti Command-Line
Interface
To achieve a balanced workload or to improve performance and throughput, you can control how Hadoop
virtual machines are placed by adding topology awareness to the Hadoop clusters. For example, you can
have separate data and compute nodes, and improve performance and throughput by placing the nodes on
the same set of physical hosts.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
VMware, Inc. 23
VMware vSphere Big Data Extensions Command-Line Interface Guide
Procedure
1Access the Serengeti CLI.
2(Optional) Run the topology list command to view the list of available topologies.
topology list
3(Optional) If you want the cluster to use HVE or RACK_AS_RACK toplogies, create a topology rack-
hosts mapping file and upload the file to the Serengeti Management Server.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
HBase runs on top of HDFS and provides a fault-tolerant way of storing large quantities of sparse data.
Create a Default HBase Cluster with the Serengeti Command-Line Interface
You can use the Serengeti CLI to deploy HBase clusters on HDFS.
This task creates a default HBase cluster which doesn't contain the MapReduce framework. To run HBase
MapReduce jobs, add Jobtracker and TaskTracker or ResourceManager and NodeManager nodes to the
default HBase cluster sample specification file /opt/serengeti/samples/default_hbase_cluster.json, then
create a cluster using this specification file.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create command, and specify the --type parameter’s value as hbase.
cluster create --name cluster_name --type hbase
What to do next
After you deploy the cluster, you can access an HBase database by using several methods. See the VMware
vSphere Big Data Extensions Administrator's and User's Guide.
24 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Create an HBase Only Cluster in Big Data Extensions
With Big Data Extensions, you can create an HBase only cluster, which contain only HBase Master, HBase
RegionServer, and Zookeeper nodes, but not Namenodes and Datanodes. The advantage of having an
HBase only cluster is that multiple HBase clusters can use the same external HDFS.
Procedure
1Prerequisites for Creating an HBase Only Cluster on page 25
Before you can create an HBase only cluster, you must verify that your system meets all of the
prerequisites.
2Prepare the EMC Isilon OneFS as the External HDFS Cluster on page 25
If you use EMC Isilon OneFS as the external HDFS cluster to the HBase only cluster, you must create
and configure users and user groups, and prepare your Isilon OneFS environment.
3Create an HBase Only Cluster by Using the Serengeti Command-Line Interface on page 26
You can use the Serengeti CLI to create an HBase only cluster.
Prerequisites for Creating an HBase Only Cluster
Before you can create an HBase only cluster, you must verify that your system meets all of the prerequisites.
Prerequisites
Verify that you started the Serengeti vApp.
n
Verify that you have more than one distribution if you want to use a distribution other than the default
n
distribution.
Verify that you have an existing HDFS cluster to use as the external HDFS cluster.
n
To avoid conflicts between the HBase only cluster and the external HDFS cluster, the clusters should
use the same Hadoop distribution and version.
If the external HDFS cluster was not created using Big Data Extensions, verify that the HDFS
n
directory /hadoop/hbase, the group hadoop, and the following users exist in the external HDFS cluster:
hdfs
n
hbase
n
serengeti
n
If you use the EMC Isilon OneFS as the external HDFS cluster, verify that your Isilon environment is
n
prepared.
For information about how to prepare your environment, see “Prepare the EMC Isilon OneFS as the
External HDFS Cluster,” on page 25.
Prepare the EMC Isilon OneFS as the External HDFS Cluster
If you use EMC Isilon OneFS as the external HDFS cluster to the HBase only cluster, you must create and
configure users and user groups, and prepare your Isilon OneFS environment.
Procedure
1Log in to one of the Isilon HDFS nodes as user root.
2Create the users.
hdfs
n
hbase
n
VMware, Inc. 25
VMware vSphere Big Data Extensions Command-Line Interface Guide
serengeti
n
mapred
n
The yarn and mapred users should have write, read, and execute permissions to the entire exported
HDFS directory.
3Create the user group hadoop.
4Create the directory tmp under the root HDFS directory.
5Set the owner as hdfs:hadoop with the read and write permissions set as 777.
6Create the directory hadoop under the root HDFS directory.
7Set the owner as hdfs:hadoop with the read and write permissions set as 775.
8Create the directory hbase under the directory hadoop.
9Set the owner as hbase:hadoop with the read and write permissions set as 775.
10 Set the owner of the root HDFS directory as hdfs:hadoop.
Example: Configuring the EMC Isilon OneFS Environment
The /opt/serengeti/samples/hbase_only_cluster.json file is a sample specification file for HBase only
clusters. It contains the zookeeper, hbase_master, and hbase_regionserver roles, but not the
hadoop_namenode/hadoop_datanode role.
5To verify that the cluster was created, run the cluster list command.
cluster list --name name
After the cluster is created, the system returns Cluster clustername created.
Create an HBase Cluster with vSphere HA Protection with the Serengeti
Command-Line Interface
You can create HBase clusters with separated Hadoop NameNode and HBase Master roles. You can
configure vSphere HA protection for the Master roles.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the characteristics of the cluster, including the node group
roles and vSphere HA protection.
In this example, the cluster has JobTracker and TaskTracker nodes, which let you run HBase
MapReduce jobs. The Hadoop NameNode and HBase Master roles are separated, and both are
protected by vSphere HA.
"instanceType" : "SMALL",
"storage" : {
"type" : "shared",
"sizeGB" : 50
},
"cpuNum" : 1,
"memCapacityMB" : 3748,
"haFlag" : "off",
"configuration" : {
}
}
],
// we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste
the output here
"configuration" : {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/coredefault.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes
and here is a sample:
// "io.file.buffer.size": "4096"
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfsdefault.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapreddefault.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": ""
},
"log4j.properties": {
// "hadoop.root.logger": "DEBUG,DRFA",
// "hadoop.security.logger": "DEBUG,DRFA"
},
"fair-scheduler.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/capacity_scheduler.html
},
VMware, Inc. 29
VMware vSphere Big Data Extensions Command-Line Interface Guide
Create an HBase Only Cluster with External Namenode HA HDFS Cluster
You can create an HBase only cluster with two namenodes in an active-passive HA configuration. The HA
namenode provides a hot standby name node that, in the event of a failure, can perform the role of the
active namenode with no downtime.
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
n
30 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
MapReduce v1 worker only clusters and HBase only clusters created using the MapR distribution are
n
not supported.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1To define the characteristics of the new cluster, make a copy of the following cluster specification
2Replace hdfs://hostname-of-namenode:8020 in this spec file with the namenode uniform resource identifier
(URI) of the external namenode HA HDFS cluster. The namenode URI is the value of the fs.defaultFS
parameter in the core-site.xml of the external cluster.
3Change the configuration section of the HBase only cluster specification file as shown in the following
example. All the values can be found in hdfs-site.xml of the external cluster.
MapReduce is a framework for processing problems in parallel across huge data sets. The MapReduce
framework distributes a number of operations on the data set to each node in the network.
Create a MapReduce v2 (YARN) Cluster by Using the Serengeti Command-Line
Interface
You can create MapReduce v2 (YARN) clusters if you want to create a cluster that separates the resource
management and processing components.
To create a MapReduce v2 (YARN) cluster, create a cluster specification file modeled after
the /opt/serengeti/samples/default_hadoop_yarn_cluster.json file, and specify the --specFile parameter
and your cluster specification file in the cluster create ... command.
Prerequisites
Start the Big Data Extensions vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
VMware, Inc. 31
VMware vSphere Big Data Extensions Command-Line Interface Guide
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create ... command.
This example creates a customized MapReduce v2 cluster using the CDH4 distribution according to the
sample cluster specification file default_hadoop_yarn_cluster.json.
Create a MapReduce v1 Worker Only Cluster with External Namenode HA HDFS
Cluster
You can create a MapReduce v1 worker only cluster with two namenodes in an active-passive HA
configuration. The HA namenode provides a hot standby namenode that, in the event of a failure, can
perform the role of the active namenode with no downtime.
The following restrictions apply to this task:
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
n
You cannot use MapR distribution to create MapReduce v1 worker only clusters and HBase only
n
clusters
Prerequisites
Start the Big Data Extensions vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
Ensure that you have an External Namenode HA HDFS cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1To define the characteristics of the new cluster, open the following cluster specification file to
2Replace hdfs://hostname-of-namenode:8020 in this spec file with the namenode uniform resource identifier
(URI) of the external namenode HA HDFS cluster. The namenode URI is the value of the fs.defaultFS
parameter in the core-site.xml of the external cluster.
3Replace the hostname-of-jobtracker in the specification file with the FQDN or IP address of the JobTracker
in the external cluster.
4Change the configuration section of the MapReduce Worker only cluster specification file as shown in
the following example. All the values can be found in hdfs-site.xml of the external cluster.
Create a MapReduce v2 Worker Only Cluster with External Namenode HA HDFS
Cluster
You can create a MapReduce v2 (Yarn) worker only cluster with two namenodes in an active-passive HA
configuration. The HA namenode provides a hot standby namenode that, in the event of a failure, can
perform the role of the active namenode with no downtime.
The following restrictions apply to this task:
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
n
You cannot use a MapR distribution to deploy MapReduce v1 worker only clusters and HBase only
n
clusters.
Prerequisites
Start the Big Data Extensions vApp.
n
Ensure that you have an external Namenode HA HDFS cluster.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1To define the characteristics of the new cluster, open the following cluster specification file to
2Replace hdfs://hostname-of-namenode:8020 in this spec file with the namenode uniform resource identifier
(URI) of the external namenode HA HDFS cluster. The namenode URI is the value of the fs.defaultFS
parameter in the core-site.xml of the external cluster.
3Replace the hostname-of-resourcemanager in the specification file with the FQDN or IP address of the
ResourceManager in the external cluster.
VMware, Inc. 33
VMware vSphere Big Data Extensions Command-Line Interface Guide
4Change the configuration section of the Yarn Worker only cluster specification file as shown in the
following example. All the values can be found in hdfs-site.xml of the external cluster.
You can separate the data and compute nodes in a Hadoop cluster, and you can control how nodes are
placed on the vSphere ESXi hosts in your environment.
You can create a compute-only cluster to run MapReduce jobs. Compute-only clusters run only MapReduce
services that read data from external HDFS clusters and that do not need to store data.
Ambari and Cloudera Manager application managers do not support data-compute separation and
compute-only clusters.
34 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Create a Data-Compute Separated Cluster with Topology Awareness and
Placement Constraints
You can create clusters with separate data and compute nodes, and define topology and placement policy
constraints to distribute the nodes among the physical racks and the virtual machines.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the virtual
machine automatic migration of the cluster. Although this prevents vSphere from migrating the virtual
machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the
vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing
such management functions outside of the Big Data Extensions environment might break the placement
policy of the cluster, such as the number of instances per host and the group associations. Even if you do not
specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN
placement policy constraints.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Create a rack-host mapping information file.
n
Upload the rack-host file to the Serengeti server with the topology upload command.
n
Procedure
1Create a cluster specification file to define the cluster's characteristics, including the node groups,
topology, and placement constraints.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
In this example, the cluster has groupAssociations and instancePerHost constraints for the compute
node group, and a groupRacks constraint for the data node group.
Four data nodes and eight compute nodes are placed on the same four ESXi hosts, which are fairly
selected from rack1, rack2, and rack3. Each ESXi host has one data node and two compute nodes. As
defined for the compute node group, compute nodes are placed only on ESXi hosts that have data
nodes.
This cluster definition requires that you configure datastores and resource pools for at least four hosts,
and that there is sufficient disk space for Serengeti to perform the necessary placements during
deployment.
Create a Data-Compute Separated Cluster with No Node Placement Constraints
You can create a cluster with separate data and compute nodes, without node placement constraints.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the cluster's characteristics.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
In this example, the cluster has separate data and compute nodes, without node placement constraints.
Four data nodes and eight compute nodes are created and put into individual virtual machines. The
number of nodes is configured by the instanceNum attribute.
Create a Data-Compute Separated Cluster with Placement Policy Constraints
You can create a cluster with separate data and compute nodes, and define placement policy constraints to
distribute the nodes among the virtual machines as you want.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the virtual
machine automatic migration of the cluster. Although this prevents vSphere from migrating the virtual
machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the
vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing
such management functions outside of the Big Data Extensions environment might break the placement
policy of the cluster, such as the number of instances per host and the group associations. Even if you do not
specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN
placement policy constraints.
38 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the cluster's characteristics, including the node groups and
placement policy constraints.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
In this example, the cluster has data-compute separated nodes, and each node group has a
placementPolicy constraint. After a successful provisioning, four data nodes and eight compute nodes
are created and put into individual virtual machines. With the instancePerHost=1 constraint, the four
data nodes are placed on four ESXi hosts. The eight compute nodes are put onto four ESXi hosts: two
nodes on each ESXi host.
This cluster specification requires that you configure datastores and resource pools for at least four
hosts, and that there is sufficient disk space for Serengeti to perform the necessary placements during
deployment.
Create a Compute Only Cluster with the Serengeti Command-Line Interface
You can create compute-only clusters to run MapReduce jobs on existing HDFS clusters, including storage
solutions that serve as an external HDFS.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file that is modeled on the Serengeti compute_only_cluster.json sample
cluster specification file found in the Serengeti cli/samples directory.
40 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
2Add the following code to your new cluster specification file.
For HDFS clusters, set port_num to 8020.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
In this example, the externalHDFS field points to an HDFS. Assign the hadoop_jobtracker role to the
master node group and the hadoop_tasktracker role to the worker node group.
The externalHDFS field conflicts with node groups that have hadoop_namenode and hadoop_datanode
roles. This conflict might cause the cluster creation to fail or, if successfully created, the cluster might
not work correctly. To avoid this problem, define only a single HDFS.
Create a Compute Workers Only Cluster With Non-Namenode HA HDFS Cluster
If you already have a physical Hadoop cluster and want to do more CPU or memory intensive operations,
you can increase the compute capacity by provisioning a worker only cluster. The worker only cluster is a
part of the physical Hadoop cluster and can be scaled out elastically.
With the compute workers only clusters, you can "burst out to virtual." It is a temporary operation that
involves borrowing resources when you need them and then returning the resources when you no longer
need them. With "burst out to virtual," you spin up compute only workers nodes and add them to either an
existing physical or virtual Hadoop cluster.
Restrictions
Worker only clusters are not supported on Ambari and Cloudera
n
Manager application managers.
These options are not supported on compute workers only clusters.
n
--appmanager appmanager_name
n
--type cluster_type
n
--hdfsNetworkName hdfs_network_name
n
--mapredNetworkName mapred_network_name
n
Prerequisites
Start the Big Data Extensions vApp.
n
Ensure that you have an existing Hadoop cluster.
n
Verify that you have the IP addresses of the NameNode and ResourceManager node.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1To define the characteristics of the new cluster, make a copy of the following cluster specification
2Replace hdfs://hostname-of-namenode:8020 in the specification file with the namenode uniform resource
identifier (URI) of the external HDFS cluster.
3Replace the hostname-of-jobtracker in the specification file with the FQDN or IP address of the JobTracker
in the external cluster.
4Change the configuration section of the MapReduce Worker only cluster specification file. All the
values can be found in hdfs-site.xml of the external cluster.
42 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
About Customized Clusters
You can use an existing cluster specification file to create clusters by using the same configuration as your
previously created clusters. You can also edit the cluster specification file to customize the cluster
configuration.
Create a Default Serengeti Hadoop Cluster with the Serengeti Command-Line
Interface
You can create as many clusters as you want in your Serengeti environment, but your environment must
meet all prerequisites.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Deploy a default Serengeti Hadoop cluster on vSphere.
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more
n
Hadoop distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
cluster create --name cluster_name
The only valid characters for cluster names are alphanumeric and underscores. When you choose the
cluster name, also consider the applicable vApp name. Together, the vApp and cluster names must be <
80 characters.
During the deployment process, real-time progress updates appear on the command-line.
What to do next
After the deployment finishes, you can run Hadoop commands and view the IP addresses of the Hadoop
node virtual machines from the Serengeti CLI.
Create a Basic Cluster with the Serengeti Command-Line Interface
You can create a basic cluster in your Serengeti environment. A basic cluster is a group of virtual machines
provisioned and managed by Serengeti. Serengeti helps you to plan and provision the virtual machines to
your specifications. You can use the basic cluster's virtual machines to install Big Data applications.
The basic cluster does not install the Big Data application packages used when creating a Hadoop or HBase
cluster. Instead, you can install and manage Big Data applications with third party application management
tools such as Apache Ambari or Cloudera Manager within your Big Data Extensions environment, and
integrate it with your Hadoop software. The basic cluster does not deploy a Hadoop or Hbase cluster. You
must deploy software into the basic cluster's virtual machines using an external third party application
management tool.
VMware, Inc. 43
VMware vSphere Big Data Extensions Command-Line Interface Guide
The Serengeti package includes an annotated sample cluster specification file that you can use as an example
when you create your basic cluster specification file. In the Serengeti Management Server, the sample
specification file is located at /opt/serengeti/samples/basic_cluster.json. You can modify the
configuration values in the sample cluster specification file to meet your requirements. The only value you
cannot change is the value assigned to the role for each node group, which must always be basic.
You can deploy a basic cluster with the Big Data Extension plug-in using a customized cluster specification
file.
To deploy software within the basic cluster virtual machines, use the cluster list --detail command, or
run serengeti-ssh.sh cluster_name to obtain the IP address of the virtual machine. You can then use the IP
address with management applications such as Apache Ambari or Cloudera Manager to provision the
virtual machine with software of your choosing. You can configure the management application to use the
user name serengeti, and the password you specified when creating the basic cluster within Big Data
Extensions when the management tool needs a user name and password to connect to the virtual machines.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the cluster, as well as the Big Data software
n
you intend to deploy.
Procedure
1Create a specification file to define the basic cluster's characteristics.
You must use the basic role for each node group you define for the basic cluster.
NOTE When creating a basic cluster, you do not need to specify a Hadoop distribution type using the
--distro option. The reason for this is that there is no Hadoop distribution being installed within the
basic cluster to be managed by Serengeti.
Create a Cluster with an Application Manager by Using the Serengeti
Command-Line Interface
You can use the Serengeti CLI to add a cluster with an application manager other than the default
application manager. Then you can manage your cluster with the new application manager.
NOTE If you want to create a local yum repository, you must create the repository before you create the
cluster.
Prerequisites
Connect to an application manager.
n
Ensure that you have adequate resources allocated to run the cluster. For information about resource
n
requirements, see the documentation for your application manager.
Verify that you have more than one distribution if you want to use a distribution other than the default
n
distribution. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
If you do not use the appManager parameter, the default application manager is used.
Create a Compute Workers Only Cluster by Using the Web Client
If you already have a physical Hadoop cluster and want to do more CPU or memory intensive operations,
you can increase the compute capacity by provisioning a workers only cluster. The workers only cluster is a
part of the physical Hadoop cluster and can be scaled out elastically.
With the compute workers only clusters, you can "burst out to virtual." It is a temporary operation that
involves borrowing resources when you need them and then returning the resources when you no longer
need them. With "burst out to virtual," you spin up compute only workers nodes and add them to either an
existing physical or virtual Hadoop cluster.
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
VMware, Inc. 45
VMware vSphere Big Data Extensions Command-Line Interface Guide
Prerequisites
Ensure that you have an existing Hadoop cluster.
n
Verify that you have the IP addresses of the NameNode and ResourceManager node.
n
Procedure
1Click Create Big Data Cluster on the objects pane.
2In the Create Big Data Cluster wizard, choose the same distribution as the Hadoop cluster.
3Set the DataMaster URL HDFS:namenode ip or fqdn:8020.
4Set the ComputeMaster URL nodeManager ip or fqdn.
5Follow the steps in the wizard and add the other resources.
There will be three node managers in the cluster. The three new node managers are registered to the
resource manager.
Create a Cluster with a Custom Administrator Password with the Serengeti
Command-Line Interface
When you create a cluster, you can assign a custom administrator password to all the nodes in the cluster.
Custom administrator passwords let you directly log in to the cluster's nodes instead of having to first log in
to the Serengeti Management server.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create command and include the --password parameter.
cluster create --name cluster_name --password
3Enter your custom password, and enter it again.
Passwords are from 8 to 128 characters, and include only alphanumeric characters ([0-9, a-z, A-Z]) and
the following special characters: _ @ # $ % ^ & *.
Your custom password is assigned to all the nodes in the cluster.
Create a Cluster with an Available Distribution with the Serengeti CommandLine Interface
You can choose which Hadoop distribution to use when you deploy a cluster. If you do not specify a
Hadoop distribution, the resulting cluster includes the default distribution, Apache Hadoop.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
46 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Procedure
1Access the Serengeti CLI.
2Run the cluster create command, and include the --distro parameter.
The --distro parameter’s value must match a distribution name displayed by the distro list
command.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
This example deploys a cluster with the Cloudera CDH distribution:
cluster create --name clusterName --distro cdh
This example creates a customized cluster named mycdh that uses the CDH4 Hadoop distribution, and is
configured according to
the /opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json sample cluster
specification file. In this sample file, nameservice0 and nameservice1 are federated. That is,
nameservice0 and nameservice1 are independent and do not require coordination with each other. The
NameNode nodes in the nameservice0 node group are HDFS2 HA enabled. In Serengeti, name node
group names are used as service names for HDFS2.
Create a Cluster with Multiple Networks with the Serengeti Command-Line
Interface
When you create a cluster, you can distribute the management, HDFS, and MapReduce traffic to separate
networks. You might want to use separate networks to improve performance or to isolate traffic for security
reasons.
For optimal performance, use the same network for HDFS and MapReduce traffic in Hadoop and Hadoop
+HBase clusters. HBase clusters use the HDFS network for traffic related to the HBase Master and HBase
RegionServer services.
IMPORTANT You cannot configure multiple networks for clusters that use the MapR Hadoop distribution.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create command and include the --networkName, --hdfsNetworkName, and --
VMware vSphere Big Data Extensions Command-Line Interface Guide
If you omit an optional network parameter, the traffic associated with that network parameter is routed
on the management network that you specify by the --networkName parameter.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
The cluster's management, HDFS, and MapReduce traffic is distributed among the specified networks.
Create a Cluster with Assigned Resources with the Serengeti Command-Line
Interface
By default, when you use Serengeti to deploy a Hadoop cluster, the cluster might contain any or all
available resources: vCenter Server resource pool for the virtual machine's CPU and memory, datastores for
the virtual machine's storage, and a network. You can assign which resources the cluster uses by specifying
specific resource pools, datastores, and/or a network when you create the Hadoop cluster.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create command, and specify any or all of the command’s resource parameters.
This example deploys a cluster named myHadoop on the myDS datastore, under the myRP resource pool,
and uses the myNW network for virtual machine communications.
Create a Cluster with Any Number of Master, Worker, and Client Nodes
You can create a Hadoop cluster with any number of master, worker, and client nodes.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
48 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Procedure
1Create a cluster specification file to define the cluster's characteristics, including the node groups.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
In this example, the cluster has one master MEDIUM size virtual machine, five worker SMALL size
virtual machines, and one client SMALL size virtual machine. The instanceNum attribute configures the
number of virtual machines in a node.
VMware vSphere Big Data Extensions Command-Line Interface Guide
Create a Customized Hadoop or HBase Cluster with the Serengeti CommandLine Interface
You can create clusters that are customized for your requirements, including the number of nodes, virtual
machine RAM and disk size, the number of CPUs, and so on.
The Serengeti package includes several annotated sample cluster specification files that you can use as
models when you create your custom specification files.
In the Serengeti Management Server, the sample cluster specification files are
n
in /opt/serengeti/samples.
If you use the Serengeti Remote CLI client, the sample specification files are in the client directory.
n
Changing a node group role might cause the cluster creation process to fail. For example, workable clusters
require a NameNode, so if there are no NameNode nodes after you change node group roles, you cannot
create a cluster.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the cluster's characteristics such as the node groups.
2Access the Serengeti CLI.
3Run the cluster create command, and specify the cluster specification file.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. If
the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation
process might fail or the cluster is created but does not function.
50 VMware, Inc.
Managing Hadoop and HBase
Clusters4
You can use the vSphere Web Client to start and stop your big data cluster and modify the cluster
configuration. You can also manage a cluster using the Serengeti Command-Line Interface.
CAUTION Do not use vSphere management functions such as migrating cluster nodes to other hosts for
clusters that you create with Big Data Extensions. Performing such management functions outside of the
Big Data Extensions environment can make it impossible for you to perform some Big Data Extensions
operations, such as disk failure recovery.
This chapter includes the following topics:
“Stop and Start a Hadoop or HBase Cluster with the Serengeti Command-Line Interface,” on page 51
n
“Scale Out a Hadoop or HBase Cluster with the Serengeti Command-Line Interface,” on page 52
n
“Scale CPU and RAM with the Serengeti Command-Line Interface,” on page 52
n
“Reconfigure a Big Data Cluster with the Serengeti Command-Line Interface,” on page 53
n
“About Resource Usage and Elastic Scaling,” on page 55
n
“Delete a Cluster by Using the Serengeti Command-Line Interface,” on page 60
n
“About vSphere High Availability and vSphere Fault Tolerance,” on page 60
n
“Reconfigure a Node Group with the Serengeti Command-Line Interface,” on page 60
n
“Recover from Disk Failure with the Serengeti Command-Line Interface Client,” on page 61
n
Stop and Start a Hadoop or HBase Cluster with the Serengeti
Command-Line Interface
You can stop a currently running cluster and start a stopped cluster from the Serengeti CLI. When you start
or stop a cluster through Cloudera Manager or Ambari,only the services are started or stopped. However,
when you start or stop a cluster through Big Data Extensions, not only the services, but also the virtual
machines are started or stopped.
Prerequisites
Verify that the cluster is provisioned.
n
Verify that enough resources, especially CPU and memory, are available to start the virtual machines in
n
the Hadoop cluster.
Procedure
1Access the Serengeti CLI.
VMware, Inc.
51
VMware vSphere Big Data Extensions Command-Line Interface Guide
2Run the cluster stop command.
cluster stop –-name name_of_cluster_to_stop
3Run the cluster start command.
cluster start –-name name_of_cluster_to_start
Scale Out a Hadoop or HBase Cluster with the Serengeti CommandLine Interface
You specify the number of nodes in the cluster when you create Hadoop and HBase clusters. You can later
scale out the cluster by increasing the number of worker nodes and client nodes.
IMPORTANT Even if you changed the user password on the nodes of a cluster, the changed password is not
used for the new nodes that are created when you scale out a cluster. If you set the initial administrator
password for the cluster when you created the cluster, that initial administrator password is used for the
new nodes. If you did not set the initial administrator password for the cluster when you created the cluster,
new random passwords are used for the new nodes.
Prerequisites
Ensure that the cluster is started.
Procedure
1Access the Serengeti CLI.
2Run the cluster resize command.
For node_type, specify worker or client. For the instanceNum parameter’s num_nodes value, use any
number that is larger than the current number of node_type instances.
Scale CPU and RAM with the Serengeti Command-Line Interface
You can increase or decrease a Hadoop or HBase cluster’s compute capacity and RAM to prevent memory
resource contention of running jobs.
Serengeti lets you adjust compute and memory resources without increasing the workload on the master
node. If increasing or decreasing the cluster's CPU is unsuccessful for a node, which is commonly due to
insufficient resources being available, the node is returned to its original CPU setting. If increasing or
decreasing the cluster's RAM is unsuccessful for a node, which is commonly due to insufficient resources,
the swap disk retains its new setting anyway. The disk is not returned to its original memory setting.
Although all node types support CPU and RAM scaling, do not scale a cluster's master node because
Serengeti powers down the virtual machine during the scaling process.
The maximum CPU and RAM settings depend on the virtual machine's version.
Table 4‑1. Maximum CPU and RAM Settings
Virtual Machine VersionMaximum Number of CPUsMaximum RAM, in GB
78255
8321011
9641011
10641011
52 VMware, Inc.
Chapter 4 Managing Hadoop and HBase Clusters
Prerequisites
Start the cluster if it is not running.
Procedure
1Access the Serengeti Command-Line Interface.
2Run the cluster resize command to change the number of CPUs or the amount of RAM of a cluster.
Node types are either worker or client.
n
Specify one or both scaling parameters: --cpuNumPerNode or --memCapacityMbPerNode.
Reconfigure a Big Data Cluster with the Serengeti Command-Line
Interface
You can reconfigure any big data cluster that you create with Big Data Extensions.
The cluster configuration is specified by attributes in Hadoop distribution XML configuration files such as:
core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, yarn-env.sh, yarn-site.sh, and hadoopmetrics.properties.
NOTE Always use the cluster config command to change the parameters specified by the configuration
files. If you manually modify these files, your changes will be erased if the virtual machine is rebooted, or
you use the cluster config, cluster start, cluster stop, or cluster resize commands.
Procedure
1Use the cluster export command to export the cluster specification file for the cluster that you want to
Name of the cluster that you want to reconfigure. Passwords are from 8 to
128 characters, and include only alphanumeric characters ([0-9, a-z, A-Z])
and the following special characters: _ @ # $ % ^ & * .
The file system path to which to export the specification file.
The name with which to label the exported cluster specification file.
2Edit the configuration information located near the end of the exported cluster specification file.
If you are modeling your configuration file on existing Hadoop XML configuration files, use the
convert-hadoop-conf.rb conversion tool to convert Hadoop XML configuration files to the required
JSON format.
…
"configuration": {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/coredefault.html
// note: any value (int, float, boolean, string) must be enclosed in double quotes
and here is a sample:
// "io.file.buffer.size": "4096"
},
VMware, Inc. 53
VMware vSphere Big Data Extensions Command-Line Interface Guide
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfsdefault.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/stable/mapreddefault.html
},
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
// "HADOOP_SECONDARYNAMENODE_OPTS": "",
// "HADOOP_JOBTRACKER_OPTS": "",
// "HADOOP_TASKTRACKER_OPTS": "",
// "HADOOP_CLASSPATH": "",
// "JAVA_HOME": "",
// "PATH": "",
},
"log4j.properties": {
// "hadoop.root.logger": "DEBUG, DRFA ",
// "hadoop.security.logger": "DEBUG, DRFA ",
},
"fair-scheduler.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/fair_scheduler.html
// "text": "the full content of fair-scheduler.xml in one line"
},
"capacity-scheduler.xml": {
// check for all settings at
http://hadoop.apache.org/docs/stable/capacity_scheduler.html
}
}
}
…
3(Optional) If the JAR files of your Hadoop distribution are not in the $HADOOP_HOME/lib directory, add
the full path of the JAR file in $HADOOP_CLASSPATH to the cluster specification file.
This action lets the Hadoop daemons locate the distribution JAR files.
For example, the Cloudera CDH3 Hadoop Fair Scheduler JAR files are
in /usr/lib/hadoop/contrib/fairscheduler/. Add the following to the cluster specification file to
enable Hadoop to use the JAR files.
6(Optional) Reset an existing configuration attribute to its default value.
aRemove the attribute from the configuration section of the cluster configuration file or comment
out the attribute using double back slashes (//).
bRe-run the cluster config command.
About Resource Usage and Elastic Scaling
Scaling lets you adjust the compute capacity of Hadoop data-compute separated clusters. When you enable
elastic scaling for a Hadoop cluster, the Serengeti Management Server can stop and start compute nodes to
match resource requirements to available resources. You can use manual scaling for more explicit cluster
control.
Chapter 4 Managing Hadoop and HBase Clusters
Manual scaling is appropriate for static environments where capacity planning can predict resource
availability for workloads. Elastic scaling is best suited for mixed workload environments where resource
requirements and availability fluctuate.
When you select manual scaling, Big Data Extensions disables elastic scaling. You can configure the target
number of compute nodes for manual scaling. If you do not configure the target number of compute nodes,
Big Data Extensions sets the number of active compute nodes to the current number of active compute
nodes. If nodes become unresponsive, they remain in the cluster and the cluster operates with fewer
functional nodes. In contrast, when you enable elastic scaling, Big Data Extensions manages the number of
active TaskTracker nodes according to the range that you specify, replacing unresponsive or faulty nodes
with live, responsive nodes.
For both manual and elastic scaling, Big Data Extensions, not vCenter Server, controls the number of active
nodes. However, vCenter Server applies the usual reservations, shares, and limits to the resource pool of a
cluster according to the vSphere configuration of the cluster. vSphere DRS operates as usual, allocating
resources between competing workloads, which in turn influences how Big Data Extensions dynamically
adjusts the number of active nodes in competing Hadoop clusters while elastic scaling is in effect.
Big Data Extensions also lets you adjust the access priority for the datastores of cluster nodes by using the
vSphere Storage I/O Control feature. Clusters configured for HIGH I/O shares receive higher priority access
than clusters with NORMAL priority. Clusters configured for NORMAL I/O shares receive higher priority
access than clusters with LOW priority. In general, higher priority provides better disk I/O performance.
Scaling Modes
To change between manual and elastic scaling, you change the scaling mode.
MANUAL. Big Data Extensions disables elastic scaling. When you change to manual scaling, you can
n
configure the target number of compute nodes. If you do not configure the target number of compute
nodes, Big Data Extensions sets the number of active compute nodes to the current number of active
compute nodes.
VMware, Inc. 55
VMware vSphere Big Data Extensions Command-Line Interface Guide
AUTO. Enables elastic scaling. Big Data Extensions manages the number of active compute nodes,
n
maintaining the number of compute nodes in the range from the configured minimum to the
configured maximum number of compute nodes in the cluster. If the minimum number of compute
nodes is undefined, the lower limit is 0. If the maximum number of compute nodes is undefined, the
upper limit is the number of available compute nodes.
Elastic scaling operates on a per-host basis, at a node-level granularity. That is, the more compute nodes
a Hadoop cluster has on a host, the finer the control that Big Data Extensions elasticity can exercise. The
tradeoff is that the more compute nodes you have, the higher the overhead in terms of runtime resource
cost, disk footprint, I/O requirements, and so on.
When resources are overcommitted, elastic scaling reduces the number of powered on compute nodes.
Conversely, if the cluster receives all the resources it requested from vSphere, and Big Data Extensions
determines that the cluster can make use of additional capacity, elastic scaling powers on additional
compute nodes.
Resources can become overcommitted for many reasons, such as:
The compute nodes have lower resource entitlements than a competing workload, according to
n
how vCenter Server applies the usual reservations, shares, and limits as configured for the cluster.
Physical resources are configured to be available, but another workload is consuming those
n
resources.
In elastic scaling, Big Data Extensions has two different behaviors for deciding how many active
compute nodes to maintain. In both behaviors, Big Data Extensions replaces unresponsive or faulty
nodes with live, responsive nodes.
Variable. The number of active, healthy TaskTracker compute nodes is maintained from the
n
configured minimum number of compute nodes to the configured maximum number of compute
nodes. The number of active compute nodes varies as resource availability fluctuates.
Fixed. The number of active, healthy TaskTracker compute nodes is maintained at a fixed number
n
when the same value is configured for the minimum and maximum number of compute nodes.
Default Cluster Scaling Parameter Values
When you create a cluster, its scaling configuration is as follows.
The cluster's scaling mode is MANUAL, for manual scaling.
n
The cluster's minimum number of compute nodes is -1. It appears as "Unset" in the Serengeti CLI
n
displays. Big Data Extensions elastic scaling treats a minComputeNodeNum value of -1 as if it were zero (0).
The cluster's maximum number of compute nodes is -1. It appears as "Unset" in the Serengeti CLI
n
displays. Big Data Extensions elastic scaling treats a maxComputeNodeNum value of -1 as if it were
unlimited.
The cluster's target number of nodes is not applicable. Its value is -1. Big Data Extensions manual
n
scaling operations treat a targetComputeNodeNum value of -1 as if it were unspecified upon a change to
manual scaling.
Interactions Between Scaling and Other Cluster Operations
Some cluster operations cannot be performed while Big Data Extensions is actively scaling a cluster.
If you try to perform the following operations while Big Data Extensions is scaling a cluster in MANUAL
mode, Big Data Extensions warns you that in the cluster's current state, the operation cannot be performed.
Concurrent attempt at manual scaling
n
Switch to AUTO mode while manual scaling operations are in progress
n
56 VMware, Inc.
Chapter 4 Managing Hadoop and HBase Clusters
If a cluster is in AUTO mode for elastic scaling when you perform the following cluster operations on it, Big
Data Extensions changes the scaling mode to MANUAL and changes the cluster to manual scaling. You can
re-enable the AUTO mode for elastic scaling after the cluster operation finishes, except if you delete the
cluster.
Delete the cluster
n
Repair the cluster
n
Stop the cluster
n
If a cluster is in AUTO mode for elastic scaling when you perform the following cluster operations on it, Big
Data Extensions temporarily switches the cluster to MANUAL mode. When the cluster operation finishes,
Big Data Extensions returns the scaling mode to AUTO, which re-enables elastic scaling.
Resize the cluster
n
Reconfigure the cluster
n
If Big Data Extensions is scaling a cluster when you perform an operation that changes the scaling mode to
MANUAL, your requested operation waits until the scaling finishes, and then the requested operation
begins.
Enable Elastic Scaling for a Hadoop Cluster with the Serengeti Command-Line
Interface
When you enable elastic scaling for a data-compute separated Hadoop cluster, Big Data Extensions
optimizes cluster performance and utilization of TaskTracker compute nodes.
To enable elastic scaling, set a data-compute separated Hadoop cluster's scaling mode to AUTO and
configure the minimum and maximum number of compute nodes. If you do not configure the minimum or
maximum number of compute nodes, the previous minimum or maximum setting, respectively, is retained.
To ensure that under contention, elastic scaling keeps a cluster operating with more than a cluster’s initial
default setting of zero compute nodes, configure the minComputeNodeNum parameter value to a nonzero
number. To limit the maximum number of compute nodes that can be used in a Hadoop cluster, configure
the maxComputeNodeNum parameter value to less than the total available compute nodes.
In elastic scaling, Big Data Extensions has two different behaviors for deciding how many active compute
nodes to maintain. In both behaviors, Big Data Extensions replaces unresponsive or faulty nodes with live,
responsive nodes.
Variable. The number of active, healthy TaskTracker compute nodes is maintained from the configured
n
minimum number of compute nodes to the configured maximum number of compute nodes. The
number of active compute nodes varies as resource availability fluctuates.
Fixed. The number of active, healthy TaskTracker compute nodes is maintained at a fixed number when
n
the same value is configured for the minimum and maximum number of compute nodes.
Prerequisites
Understand how elastic scaling and resource usage work. See “About Resource Usage and Elastic
n
Scaling,” on page 55.
Verify that the cluster you want to optimize is data-compute separated. See “About Hadoop and HBase
n
Cluster Deployment Types,” on page 20.
Procedure
1Access the Serengeti CLI.
VMware, Inc. 57
VMware vSphere Big Data Extensions Command-Line Interface Guide
2Run the cluster setParam command, and set the --elasticityMode parameter value to AUTO.
cluster setParam --name cluster_name --elasticityMode AUTO [--minComputeNodeNum minNum]
[--maxComputeNodeNum maxNum]
Enable Manual Scaling for a Hadoop Cluster with the Serengeti Command-Line
Interface
When you enable manual scaling for a cluster, Big Data Extensions disables elastic scaling. When you enable
manual scaling, you can configure the target number of compute nodes. If you do not configure the target
number of compute nodes, Big Data Extensions sets the number of active compute nodes to the current
number of active compute nodes.
Procedure
1Access the Serengeti CLI.
2Run the cluster setParam command, and set the --elasticityMode parameter value to MANUAL.
Configure Scaling Parameters with the Serengeti Command-Line Interface
You can configure scaling parameters, such as the target number of nodes, with or without changing the
scaling mode.
Procedure
1Access the Serengeti CLI.
2To display a cluster's scaling settings, run the cluster list command.
cluster list --detail --name cluster_name
3To configure one or more scaling parameters, run the cluster setParam command.
The --name parameter is required, and you can include as few or as many of the other parameters as
you want. You can repeatedly run the command to configure or reconfigure additional scaling
parameters.
Name of the cluster. Specify this parameter every time you run the
cluster setParam command.
MANUAL or AUTO.
Number of nodes. This parameter is applicable only for MANUAL scaling
mode.
Lower limit of the range of active compute nodes to maintain in the
cluster. This parameter is applicable only for AUTO scaling mode.
Upper limit of the range of active compute nodes to maintain in the
cluster. This parameter is applicable only for AUTO scaling mode.
LOW, NORMAL, or HIGH.
58 VMware, Inc.
Chapter 4 Managing Hadoop and HBase Clusters
4To reset one or more scaling parameters to their default values, run the cluster resetParam command.
The --name parameter is required, and you can include as few or as many of the other parameters as
you want. You can repeatedly run the command to reset additional scaling parameters.
For data-compute separated nodes, you can reset all the scaling parameters to their defaults by using
the --all parameter.
Name of the cluster. Specify this parameter every time you run the
cluster resetParam command.
Reset all scaling parameters to their defaults.
Sets the scaling mode to MANUAL.
Reset targetComputeNodeNum to -1. Big Data Extensions manual scaling
operations treat a targetComputeNodeNum value of -1 as if it were
unspecified upon a change to manual scaling.
Reset minComputeNodeNum to 0. It appears as "Unset" in the Serengeti CLI
displays. Big Data Extensions elastic scaling treats a minComputeNodeNum
value of -1 as if it were zero (0).
Reset maxComputeNodeNum to unlimited. It appears as "Unset" in the
Serengeti CLI displays. Big Data Extensions elastic scaling treats a
maxComputeNodeNum value of -1 as if it were unlimited.
Reset ioShares to NORMAL.
Schedule Fixed Elastic Scaling for a Hadoop Cluster
You can enable fixed, elastic scaling according to a preconfigured schedule. Scheduled fixed, elastic scaling
provides more control than variable, elastic scaling while still improving efficiency, allowing explicit
changes in the number of active compute nodes during periods of predictable usage.
For example, in an office with typical workday hours, there is likely a reduced load on a VMware View
resource pool after the office staff goes home. You could configure scheduled fixed, elastic scaling to specify
a greater number of compute nodes from 8 PM to 4 AM, when you know that the workload would
otherwise be very light.
Prerequisites
From the Serengeti Command-Line Interface, enable the cluster for elastic scaling, and set the
minComputeNodeNum and MaxComputeNodeNum parameters to the same value: the number of active TaskTracker
nodes that you want during the period of scheduled fixed elasticity.
Procedure
1Open a command shell, such as Bash or PuTTY, and log in to the Serengeti Management Server as user
serengeti.
2Use any scheduling mechanism that you want to call
the /opt/serengeti/sbin/set_compute_node_num.sh script to set the number of active TaskTracker
compute nodes that you want.
VMware vSphere Big Data Extensions Command-Line Interface Guide
After the scheduling mechanism calls the set_compute_node_num.sh script, fixed, elastic scaling remains
in effect with the configured number of active TaskTracker compute nodes until the next scheduling
mechanism change or until a user changes the scaling mode or parameters in either the vSphere Web
Client or the Serengeti Command-Line Interface.
This example shows how to use a crontab file on the Serengeti Management Server to schedule specific
numbers of active TaskTracker compute nodes.
# cluster_A: use 20 active TaskTracker compute nodes from 11:00 to 16:00, and 30 compute
nodes the rest of the day
00 11 * * * /opt/serengeti/sbin/set_compute_node_num.sh --name cluster_A
# cluster_C: reset the number of active TaskTracker compute nodes every 6 hours to 15
0 */6 * * * /opt/serengeti/sbin/set_compute_node_num.sh --name cluster_B
Delete a Cluster by Using the Serengeti Command-Line Interface
You can delete a Hadoop cluster that you no longer need, regardless of whether it is running. When a
Hadoop cluster is deleted, all its virtual machines and resource pools are destroyed.
Procedure
1Access the Serengeti CLI.
2Run the cluster delete command.
cluster delete --name cluster_name
About vSphere High Availability and vSphere Fault Tolerance
The Serengeti Management Server leverages vSphere HA to protect the Hadoop master node virtual
machine, which can be monitored by vSphere.
When a Hadoop NameNode or JobTracker service stops unexpectedly, vSphere restarts the Hadoop virtual
machine in another host, reducing unplanned downtime. If vsphere Fault Tolerance is configured and the
master node virtual machine stops unexpectedly because of host failover or loss of network connectivity, the
secondary node is used, without downtime.
Reconfigure a Node Group with the Serengeti Command-Line
Interface
You can reconfigure node groups by modifying node group configuration data in the associated cluster
specification file. When you configure a node group, its configuration overrides any cluster level
configuration of the same name.
Procedure
1Access the Serengeti CLI.
2Run the cluster export command to export the cluster’s cluster specification file.
Recover from Disk Failure with the Serengeti Command-Line Interface
Client
If there is a disk failure in a Hadoop cluster, and the disk does not perform management roles such as
NameNode, JobTracker, ResourceManager, HMaster, or ZooKeeper, you can recover by running the
Serengeti cluster fix command.
Big Data Extensions uses a large number of inexpensive disk drives for data storage (configured as Just a
Bunch of Disks). If several disks fail, the Hadoop data node might shutdown. Big Data Extensions lets you
to recover from disk failures.
Serengeti supports recovery from swap and data disk failure on all supported Hadoop distributions. Disks
are recovered and started in sequence to avoid the temporary loss of multiple nodes at once. A new disk
matches the corresponding failed disk’s storage type and placement policies.
The MapR distribution does not support recovery from disk failure by using the cluster fix command.
IMPORTANT Even if you changed the user password on the cluster's nodes, the changed password is not
used for the new nodes that are created by the disk recovery operation. If you set the cluster's initial
administrator password when you created the cluster, that initial administrator password is used for the
new nodes. If you did not set the cluster's initial administrator password when you created the cluster, new
random passwords are used for the new nodes.
VMware vSphere Big Data Extensions Command-Line Interface Guide
62 VMware, Inc.
Monitoring the Big Data Extensions
Environment5
You can monitor the status of Serengeti-deployed clusters, including their datastores, networks, and
resource pools through the Serengeti Command-Line Interface. You can also view a list of available Hadoop
distributions. Monitoring capabilities are also available in the vSphere Web Client.
This chapter includes the following topics:
“View List of Application Managers by using the Serengeti Command-Line Interface,” on page 63
n
“View Available Hadoop Distributions with the Serengeti Command-Line Interface,” on page 64
n
“View Supported Distributions for All Application Managers by Using the Serengeti Command-Line
n
Interface,” on page 64
“View Configurations or Roles for Application Manager and Distribution by Using the Serengeti
n
Command-Line Interface,” on page 64
“View Provisioned Hadoop and HBase Clusters with the Serengeti Command-Line Interface,” on
n
page 65
“View Datastores with the Serengeti Command-Line Interface,” on page 65
n
“View Networks with the Serengeti Command-Line Interface,” on page 65
n
“View Resource Pools with the Serengeti Command-Line Interface,” on page 66
n
View List of Application Managers by using the Serengeti CommandLine Interface
You can use the appManager list command to list the application managers that are installed on the
Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list
The command returns a list of all application managers that are installed on the Big Data Extensions
environment.
VMware, Inc.
63
VMware vSphere Big Data Extensions Command-Line Interface Guide
View Available Hadoop Distributions with the Serengeti CommandLine Interface
You use the distro list command to view a list of Hadoop distributions that are available in your
Serengeti deployment. When you create clusters, you can use any available Hadoop distribution.
Procedure
1Access the Serengeti CLI.
2Run the distro list command.
The available Hadoop distributions are listed, along with their packages.
What to do next
Before you use a distribution, verify that it includes the services that you want to deploy. If services are
missing , add the appropriate packages to the distribution.
View Supported Distributions for All Application Managers by Using
the Serengeti Command-Line Interface
You can list the Hadoop distributions that are supported on the application managers in the
Big Data Extensions environment to determine if a particular distribution is available on a an application
manager in your Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list --name application_manager_name [--distros]
If you do not include the --name parameter, the command returns a list of all the Hadoop distributions
that are supported on each of the application managers in the Big Data Extensions environment.
The command returns a list of all distributions that are supported for the application manager of the name
that you specify.
View Configurations or Roles for Application Manager and
Distribution by Using the Serengeti Command-Line Interface
You can use the appManager list command to list the Hadoop configurations or roles for a specific
application manager and distribution.
The configuration list includes those configurations that you can use to configure the cluster in the cluster
specifications.
The role list contains the roles that you can use to create a cluster. You should not use unsupported roles to
create clusters in the application manager.
Prerequisites
Verify that you are connected to an application manager.
64 VMware, Inc.
Chapter 5 Monitoring the Big Data Extensions Environment
Procedure
1Access the Serengeti CLI.
2Run the appmanager list command.
appmanager list --name application_manager_name [--distro distro_name
(--configurations | --roles) ]
The command returns a list of the Hadoop configurations or roles for a specific application manager and
distribution.
View Provisioned Hadoop and HBase Clusters with the Serengeti
Command-Line Interface
From the Serengeti Command-Line Interface, you can list the provisioned Hadoop and HBase clusters that
are in the Serengeti deployment.
Procedure
1Access the Serengeti CLI.
2Run the cluster list command.
cluster list
This example displays a specific cluster by including the --name parameter.
cluster list --name cluster_name
This example displays detailed information about a specific cluster by including the --name and --
detail parameters.
cluster list --name cluster_name –-detail
View Datastores with the Serengeti Command-Line Interface
From the Serengeti CLI, you can see the datastores that are in the Serengeti deployment.
Procedure
1Access the Serengeti CLI.
2Run the datastore list command.
This example displays detailed information by including the --detail parameter.
datastore list --detail
This example displays detailed information about a specific datastore by including the --name and --
detail parameters.
datastore list --name datastore_name --detail
View Networks with the Serengeti Command-Line Interface
From the Serengeti CLI, you can see the networks that are in the Serengeti deployment.
Procedure
1Access the Serengeti CLI.
VMware, Inc. 65
VMware vSphere Big Data Extensions Command-Line Interface Guide
2Run the network list command.
This example displays detailed information by including the --detail parameter.
network list --detail
This example displays detailed information about a specific network by including the --name and --
detail parameters.
network list --name network_name --detail
View Resource Pools with the Serengeti Command-Line Interface
From the Serengeti CLI, you can see the resource pools that are in the Serengeti deployment.
Procedure
1Access the Serengeti CLI.
2Run the resourcepool list command.
This example displays detailed information by including the --detail parameter.
resourcepool list --detail
This example displays detailed information about a specific datastore by including the --name and --
detail parameters.
resourcepool list --name resourcepool_name –-detail
66 VMware, Inc.
Cluster Specification Reference6
This information describes the Serengeti cluster specification file’s attributes and their mapping to Hadoop
attributes, and how to convert a Hadoop XML configuration file to a Serengeti configuration file.
This chapter includes the following topics:
“Cluster Specification File Requirements,” on page 67
n
“Cluster Definition Requirements,” on page 67
n
“Annotated Cluster Specification File,” on page 68
n
“Cluster Specification Attribute Definitions,” on page 72
n
“White Listed and Black Listed Hadoop Attributes,” on page 74
n
“Convert Hadoop XML Files to Serengeti JSON Files,” on page 75
n
Cluster Specification File Requirements
A cluster specification file is a text file with the configuration attributes provided in a JSON-like formatted
structure. Cluster specification files must adhere to requirements concerning syntax, quotation mark usage,
and comments.
To parse cluster specification files, Serengeti uses the Jackson JSON Processor. For syntax requirements,
n
such as the truncation policy for float types, see the Jackson JSON Processor Wiki.
Always enclose digital values in quotation marks. For example:
n
"mapred.tasktracker.reduce.tasks.maximum" : "2"
The quotation marks ensure that integers are correctly interpreted instead of being converted to doubleprecision floating point, which can cause unintended consequences.
Do not include any comments.
n
Cluster Definition Requirements
Cluster specification files contain configuration definitions for clusters, such as their roles and node groups.
Cluster definitions must adhere to requirements concerning node group roles, cluster roles, and instance
numbers.
A cluster definition has the following requirements:
Node group roles cannot be empty. You can determine the valid role names for your Hadoop
n
distribution by using the distro list command.
VMware, Inc.
67
VMware vSphere Big Data Extensions Command-Line Interface Guide
The hadoop_namenode and hadoop_jobtracker roles must be configured in a single node group.
n
In Hadoop 2.0 clusters, such as CDH4 or Pivotal HD, the instance number can be greater than 1 to
n
create an HDFS HA or Federation cluster.
Otherwise, the total instance number must be 1.
n
Node group instance numbers must be positive numbers.
n
Annotated Cluster Specification File
The Serengeti cluster specification file defines the nodes, resources, and so on for a cluster. You can use this
annotated cluster specification file, and the sample files in /opt/serengeti/samples, as models when you
create your clusters.
The following code is a typical cluster specification file. For code annotations, see Table 6-1.
The cluster definition elements are defined in the table.
VMware, Inc. 69
VMware vSphere Big Data Extensions Command-Line Interface Guide
Table 6‑1. Example Cluster Specification Annotation
Line(s)AttributeExample ValueDescription
4namemasterNode group name.
5-8rolehadoop_namenode,
9instanceNum1Number of instances in the
10instanceTypeLARGENode group instance type.
11cpuNum2Number of CPUs per virtual
12memCapacityMB4096RAM size, in MB, per virtual
13-16storageSee lines 14-15 for one
hadoop_jobtracker
group's storage attributes
Node group role.
hadoop_namenode and
hadoop_jobtracker are
deployed to the node
group's virtual machine.
node group.
Only one virtual machine is
created for the group.
You can have multiple
n
instances for
hadoop_tasktracker,
hadoop_datanode,
hadoop_client, pig, and
hive.
For HDFS1 clusters, you
n
can have only one
instance of
hadoop_namenode and
hadoop_jobtracker.
For HDFS2 clusters, you
n
can have two
hadoop_namenode
instances.
With a MapR
n
distribution, you can
configure multiple
instances of
hadoop_jobtracker.
Instance types are
predefined virtual machine
specifications, which are
combinations of the number
of CPUs, RAM sizes, and
storage size. The predefined
numbers can be overridden
by the cpuNum,
memCapacityMB, and
storage attributes in the
Serengeti server
specification file.
machine.
This attribute overrides the
number of vCPUs in the
predefined virtual machine
specification.
machine.
This attribute overrides the
RAM size in the predefined
virtual machine
specification.
Node group storage
requirements.
70 VMware, Inc.
Chapter 6 Cluster Specification Reference
Table 6‑1. Example Cluster Specification Annotation (Continued)
Line(s)AttributeExample ValueDescription
14typeSHAREDStorage type.
The node group is deployed
using only shared storage.
15sizeGB20Storage size.
Each node in the node group
is deployed with 20GB
available disk space.
17haFlagonHA protection for the node
group.
The node group is deployed
with vSphere HA protection.
18-20rpNamesrp1Resource pools under which
the node group virtual
machines are deployed.
These pools can be an array
of values.
See lines 3-21, which define
the same attributes for the
master node.
In lines 34-35, data disks are
placed on dsNames4Data
datastores, and system disks
are placed on
dsNames4System datastores.
placement policy
constraints.
You need at least three ESXi
hosts because there are three
instances and a requirement
that each instance be on its
own host. This group is
provisioned on hosts on
rack1, rack2, and rack3 by
using a ROUNDROBIN
algorithm.
See lines 4-16, which define
the same attributes for the
master node.
placement policy
constraints.
You need at least three ESXi
hosts to meet the instance
requirements. The compute
node group references a data
node group through STRICT
typing. The two compute
instances use a data instance
on the ESXi host. The
STRICT association provides
better performance.
VMware, Inc. 71
VMware vSphere Big Data Extensions Command-Line Interface Guide
Table 6‑1. Example Cluster Specification Annotation (Continued)
Line(s)AttributeExample ValueDescription
66-82Node group definition for
the client node
83-86configurationEmpty in the code sampleHadoop configuration
Cluster Specification Attribute Definitions
Cluster definitions include attributes for the cluster itself and for each of the cluster's node groups.
Cluster Specification Outer Attributes
Cluster specification outer attributes apply to the cluster as a whole.
Table 6‑2. Cluster Specification Outer Attributes
AttributeTypeMandatory/ Optional Description
nodeGroupsobject MandatoryOne or more group specifications. See Table 6-3.
externalHDFS string OptionalValid only for compute-only clusters. URI of external HDFS.
See previous node group
definitions.
customization.
Cluster Specification Node Group Objects and Attributes
Node group objects and attributes apply to one node group in a cluster.
Table 6‑3. Cluster Specification’s Node Group Objects and Attributes
Mandatory/
AttributeType
namestringMandatoryUser defined node group name.
roleslist of string MandatoryList of software packages or services to install on the node group’s
instanceNumberintegerMandatoryNumber of virtual machines in the node group:
instanceTypestringOptionalSize of virtual machines in the node group, expressed as the name
OptionalDescription
virtual machine. Values must match the roles displayed by the
distro list command.
Positive integer.
n
Generally, you can have multiple instances for
n
hadoop_tasktracker, hadoop_datanode, hadoop_client,
pig, and hive.
For HDFS1 clusters, you can have only one instance of
n
hadoop_namenode and hadoop_jobtracker.
n
For HDFS2 clusters, you can have two hadoop_namenode
instances.
With a MapR distribution, you can configure multiple
n
instances of hadoop_jobtracker.
of a predefined virtual machine template. See Table 6-4.
SMALL
n
MEDIUM
n
LARGE
n
EXTRA_LARGE
n
If you specify cpuNum , memCapacityMB , or sizeGB attributes,
they override the corresponding value of your selected virtual
machine template for the applicable node group.
72 VMware, Inc.
Chapter 6 Cluster Specification Reference
Table 6‑3. Cluster Specification’s Node Group Objects and Attributes (Continued)
Mandatory/
AttributeType
cpuNumintegerOptional
memCapacityMBintegerOptionalRAM size, in MB, per virtual machine.
StorageobjectOptionalStorage settings.
typestringOptionalStorage type:
sizeGBintegerOptionalData storage size. Must be a positive integer.
dsNameslist of string OptionalArray of datastores the node group can use.
dnNames4Datalist of string OptionalArray of datastores the data node group can use.
dsNames4Systemlist of string OptionalArray of datastores the system can use.
rpNameslist of string OptionalArray of resource pools the node group can use.
haFlagstringOptionalBy default, NameNode and JobTracker nodes are protected by
placementPolicies objectOptionalUp to three optional constraints:
OptionalDescription
Number of CPUs per virtual machine. If the haFlag value is FT,
the cpuNum value must be 1.
NOTE When using MapR 3.1, you must specify a minimum of
5120 MBs of memory capacity for the zookeeper, worker, and
client nodes.
LOCAL. For local storage
n
SHARED. For shared storage.
n
vSphere HA.
on. Protect the node with vSphere HA.
n
ft. Protect the node with vSphere FT.
n
off. Do not use vSphere HA or vSphere FT.
n
instancePerHost
n
groupRacks
n
groupAssociations
n
Serengeti Predefined Virtual Machine Sizes
Serengeti provides predefined virtual machine sizes to use for defining the size of virtual machines in a
cluster node group.
VMware vSphere Big Data Extensions Command-Line Interface Guide
White Listed and Black Listed Hadoop Attributes
White listed attributes are Apache Hadoop attributes that you can configure from Serengeti with the
cluster config command. The majority of Apache Hadoop attributes are white listed. However, there are a
few black listed Apache Hadoop attributes, which you cannot configure from Serengeti.
If you use an attribute in the cluster specification file that is neither a white listed nor a black listed attribute,
and then run the cluster config command, a warning appears and you must answer yes to continue or no
to cancel.
If your cluster includes a NameNode or JobTracker, Serengeti configures the fs.default.name and
dfs.http.address attributes. You can override these attributes by defining them in your cluster
specification.
Table 6‑5. Configuration Attribute White List
FileAttributes
core-site.xmlAll core-default configuration attributes listed on the Apache Hadoop 2.x documentation Web
page. For example, http://hadoop.apache.org/docs/branch_name/core-default.html.
Exclude the attributes defined in the black list.
hdfs-site.xmlAll hdfs-default configuration attributes listed on the Apache Hadoop 2.x documentation Web
page. For example, http://hadoop.apache.org/docs/branch_name/hdfs-default.html.
Exclude the attributes defined in the black list.
mapred-site.xmlAll mapred-default configuration attributes listed on the Apache Hadoop 2.x documentation
Web page. For example, http://hadoop.apache.org/docs/branch_name/mapred-default.html.
Exclude the attributes defined in the black list.
hadoop-env.sh
log4j.properties
fairscheduler.xml
JAVA_HOME
PATH
HADOOP_CLASSPATH
HADOOP_HEAPSIZE
HADOOP_NAMENODE_OPTS
HADOOP_DATANODE_OPTS
HADOOP_SECONDARYNAMENODE_OPTS
HADOOP_JOBTRACKER_OPTS
HADOOP_TASKTRACKER_OPTS
HADOOP_LOG_DIR
hadoop.root.logger
hadoop.security.logger
log4j.appender.DRFA.MaxBackupIndex
log4j.appender.RFA.MaxBackupIndex
log4j.appender.RFA.MaxFileSize
text
All fair_scheduler configuration attributes listed on the Apache Hadoop 2.x documentation
Web page that can be used inside the text field. For example,
http://hadoop.apache.org/docs/branch_name/fair_scheduler.html.
Exclude the attributes defined in the black list.
74 VMware, Inc.
Table 6‑5. Configuration Attribute White List (Continued)
FileAttributes
capacityscheduler.xml
mapred-queueacls.xml
All capacity_scheduler configuration attributes listed on the Apache Hadoop 2.x
documentation Web page. For example,
http://hadoop.apache.org/docs/branch_name/capacity_scheduler.html.
Exclude attributes defined in black list
All mapred-queue-acls configuration attributes listed on the Apache Hadoop 2.x Web page. For
example,
If you defined a lot of attributes in your Hadoop configuration files, you can convert that configuration
information into the JSON format that Serengeti can use.
Procedure
1Copy the directory $HADOOP_HOME/conf/ from your Hadoop cluster to the Serengeti Management Server.
2Open a command shell, such as Bash or PuTTY, log in to the Serengeti Management Server, and run the
convert-hadoop-conf.rb Ruby conversion script.
convert-hadoop-conf.rb path_to_hadoop_conf
The converted Hadoop configuration attributes, in JSON format, appear.
VMware, Inc. 75
VMware vSphere Big Data Extensions Command-Line Interface Guide
3Open the cluster specification file for editing.
4Replace the cluster level configuration or group level configuration items with the output that was
generated by the convert-hadoop-conf.rb Ruby conversion script.
What to do next
Access the Serengeti CLI, and use the new specification file.
To apply the new configuration to a cluster, run the cluster config command. Include the --specFile
n
parameter and its value: the new specification file.
To create a cluster with the new configuration, run the cluster create command. Include the --
n
specFile parameter and its value: the new specification file.
76 VMware, Inc.
Serengeti CLI Command Reference7
This section provides descriptions and syntax requirements for every Serengeti CLI command.
This chapter includes the following topics:
“appmanager Commands,” on page 77
n
“cluster Commands,” on page 79
n
“connect Command,” on page 85
n
“datastore Commands,” on page 86
n
“disconnect Command,” on page 86
n
“distro list Command,” on page 87
n
“fs Commands,” on page 87
n
“hive script Command,” on page 92
n
“mr Commands,” on page 92
n
“network Commands,” on page 95
n
“pig script Command,” on page 97
n
“resourcepool Commands,” on page 97
n
“topology Commands,” on page 98
n
appmanager Commands
The appmanager {*} commands let you add, delete, and manage your application managers.
appmanager add Command
The appmanager add command lets you add an application manager other than the default to your
environment. You can specify either Cloudera Manager or Ambari application managers. The appmanager
add command reads the user name and password in interactive mode. If https is specified, the command
prompts for the file path of the certificate.
ParameterMandatory/Optional Description
--name
application_manager_name
--description description
VMware, Inc. 77
MandatoryApplication manager name
Optional
VMware vSphere Big Data Extensions Command-Line Interface Guide
ParameterMandatory/Optional Description
--type
[ClouderaManager/Ambari]
--url <http[s]://server:port>
MandatoryName of the type of application manager to use, either Cloudera
MandatoryApplication manager service URL, formatted as
appmanager delete Command
You can use the Serengeti CLI to delete an application manager when you no longer need it.
The application manager to delete must not contain clusters or the process fails.
appmanager delete --name application_manager_name
ParameterMandatory or Optional Description
--nameapplication_manager_name
MandatoryApplication manager name
appmanager modify Command
With the appmanager modify command, you can modify the information for an application manager, for
example, you can change the manager server IP address if it is not a static IP, or you could upgrade the
administrator account.
Manager or Ambari Apache
http[s]://application_manager_server_ip_or_hostname:port ,
prompts for a login, username, and password.
IMPORTANT Making an error when you modify an application manager can have serious consequences. For
example, you change a Cloudera Manager URL to the URL for a new application manager. If you create
Big Data Extensions clusters with the old Cloudera Manager instance, the previous Cloudera Manager
cluster cannot be managed again. In addition, the Cloudera Manager cluster is not available to the new
application manager instance.
appmanager modify --name application_manager_name
Mandatory
Parameter
--name
application_manager_name
--url http[s]://server:port
--changeAccountOptionalChanges the login account and password for the application manager.
--changeCertificateOptionalChanges the SSL certificate of the application manager. This parameter
or OptionalDescription
MandatoryApplication manager name
OptionalApplication manager service URL, formatted as
http[s]://application_manager_server_ip_or_hostname:port , prompts for
a login, username, and password. You can use either http or https.
only applies to application managers with a URL that starts with https.
78 VMware, Inc.
appmanager list Command
The appmanager list command returns a list of all available application managers including the default
application manager.
ParameterMandatory/Optional Description
--name application_manager_name
--distro distribution_name
--configurations | --roles
cluster Commands
The cluster {*} commands let you connect to Hadoop and HBase clusters, create and delete clusters, stop
and start clusters, and perform cluster management operations.
Chapter 7 Serengeti CLI Command Reference
OptionalThe application manager name.
OptionalThe name of a specific distribution. If you do not include the
distribution_name variable, the command returns all Hadoop
distributions that are supported by the application manager.
OptionalThe Hadoop configurations or roles for a specific application
manager and distribution. You should not use unsupported
roles to create a cluster.
cluster config Command
The cluster config command lets you modify the configuration of an existing Hadoop or HBase cluster,
whether the cluster is configured according to the Serengeti defaults or you have customized the cluster.
NOTE The cluster config command can only be used with clusters that were created with the default
application manager. For those clusters that were created with either Ambari or Cloudera Manager, any
cluster configuration changes should be made from the application manager. Also, new services and
configurations changed in the external application manager cannot be synced from Big Data Extensions.
You can use the cluster config command with the cluster export command to return cluster services and
the original Hadoop configuration to normal in the following situations:
A service such as NameNode, JobTracker, DataNode, or TaskTracker goes down.
n
You manually changed the Hadoop configuration of one or more of the nodes in a cluster.
n
Run the cluster export command, and then run the cluster config command. Include the new cluster
specification file that you just exported.
If the external HDFS cluster was created by Big Data Extensions, the user should use the clusterconfig
command to add the HBase cluster topology to the HDFS cluster.
The following example depicts the specification file to add the topology:
OptionalAnswer Y to Y/N confirmation. If not specified, manually
type y or n.
OptionalSkip cluster configuration validation.
VMware vSphere Big Data Extensions Command-Line Interface Guide
cluster create Command
You use the cluster create command to create a Hadoop or HBase cluster.
If the cluster specification does not include the required nodes, for example a master node, Serengeti creates
the cluster according to the default cluster configuration that Serengeti deploys.
Parameter
--name cluster_name_in_Serengeti
--appmanagerappmanager_name
--type cluster_type
--password
--specFile spec_file_path
--distro Hadoop_distro_name
--dsNames datastore_names
--networkName management_network_name
--hdfsNetworkName hdfs_network_name
--mapredNetworkName mapred_network_name
--rpNames resource_pool_name
--resume
--topology topology_type
Mandatory or
OptionalDescription
Mandatory.Cluster name.
Optional.Name of an application manager other than the
default to manage your clusters.
Optional.Cluster type:
Hadoop (Default)
n
HBase
n
Optional.
Do not use if you
use the --
resume
parameter.
Optional.Cluster specification filename.
Optional.Hadoop distribution for the cluster.
Optional.Datastore to use to deploy Hadoop cluster in
Mandatory.Network to use for management traffic in Hadoop
Optional.Network to use for HDFS traffic in Hadoop
Optional.Network to use for MapReduce traffic in Hadoop
Optional.Resource pool to use for Hadoop clusters. Multiple
Optional.
Do not use if you
use the --
password
parameter .
Optional.Topology type for rack awareness: HVE,
Custom password for all the nodes in the cluster.
Passwords are from 8 to 128 characters, and
include only alphanumeric characters ([0-9, a-z, AZ]) and the following special characters: _ @ # $ %
^ & *.
Serengeti. Multiple datastores can be used,
separated by comma.
By default, all available datastores are used.
When you specify the --dsNames parameter, the
cluster can use only those datastores that you
provide in this command.
clusters.
If you omit any of the optional network
parameters, the traffic associated with that
parameter is routed on the management network
that you specify with the --networkName
parameter.
clusters.
clusters.
resource pools can be used, separated by comma.
Recover from a failed deployment process.
RACK_AS_RACK, or HOST_AS_RACK.
80 VMware, Inc.
Chapter 7 Serengeti CLI Command Reference
Mandatory or
Parameter
--yes
--skipConfigValidation
--localRepoURL
--externalMapReduce
FQDN_of_Jobtracker/ResourceManager:port
OptionalDescription
Optional.Confirmation whether to proceed following an
error message. If the responses are not specified,
you can type y or n.
If you specify y, the cluster creation continues. If
you do not specify y, the CLI presents the
following prompt after displaying the warning
message:
Are you sure you want to continue (Y/N)?
Optional.Validation whether to skip cluster configuration.
Optional.Option to create a local yum repository.
Optional.The port number is optional.
cluster delete Command
The cluster delete command lets you delete a cluster in Serengeti. When a cluster is deleted, all its virtual
machines and resource pools are destroyed.
ParameterMandatory/Optional Description
--name cluster_name
MandatoryName of cluster to delete
cluster export Command
The cluster export command lets you export cluster data. Depending on the options and parameters that
you specify, you can export the cluster data to a specific location, format the delimiter of the export file,
specify the type of data to export and indicate the value for the topology.
You can use either of two commands to export the cluster specification file.
You can use the cluster export command to print the IP to RACK mapping table, the format of which is: ip
rack, which can be used by the external HDFS cluster to implement the data locality of HBase and
MapReduce cluster.
You can use the cluster export command to print the IP for the management network of all nodes in a
cluster.
ParameterMandatory/Optional Description
--name cluster_name
--type SPEC|RACK|IP
--output path_to_output_file
-- specfile path_to_spec_file
MandatoryName of cluster to export
OptionalThe type of data to export. The value can
be SPEC (for export spec file), RACK (for
export rack topology of all nodes) or IP
(for export IP of all nodes). The default
value is SPEC.
OptionalThe output file in which to save the
exported data
OptionalThe output file in which to save the cluster
specification.
VMware, Inc. 81
VMware vSphere Big Data Extensions Command-Line Interface Guide
ParameterMandatory/Optional Description
--topology
[HOST_AS_RACK|RACK_AS_RACK|HVE|NONE]
--delimiter
cluster fix Command
The cluster fix command lets you recover from a failed disk.
IMPORTANT Even if you changed the user password on the cluster's nodes, the changed password is not
used for the new nodes that are created by the disk recovery operation. If you set the cluster's initial
administrator password when you created the cluster, that initial administrator password is used for the
new nodes. If you did not set the cluster's initial administrator password when you created the cluster, new
random passwords are used for the new nodes.
Table 7‑1.
ParameterMandatory/Optional Description
--name cluster_name
--disk
--nodeGroup nodegroup_name
MandatoryName of cluster that has a failed disk.
RequiredRecover node disks.
OptionalPerform scan and recovery only on the specified node group, not
OptionalThe value for the topology. The default
value is the topology that you specified
when the cluster was created.
OptionalThe string to separate each line in the
result. The default value is "\n" (i.e. line by
line).
on all the management nodes in the cluster.
cluster list Command
The cluster list command lets you view a list of provisioned clusters in Serengeti. You can see the
following information: name, distribution, status, and each node group's information. The node group
information consists of the instance count, CPU, memory, type, and size.
The application managers monitor the services and functions of your Big Data Extensions environment.
Big Data Extensions syncs up the status from the application managers periodically. You can use the
cluster list command to get the latest status of your environment. If there are any warnings displayed,
you can check the details from the application manager console.
Table 7‑2.
ParameterMandatory/Optional Description
--name cluster_name_in_Serengeti
--detail
OptionalName of cluster to list.
OptionalList cluster details, including name in Serengeti, distribution,
deploy status, each node’s information in different roles.
If you specify this option, Serengeti queries the vCenter
Server to get the latest node status.
82 VMware, Inc.
Chapter 7 Serengeti CLI Command Reference
cluster resetParam Command
The cluster resetParam command lets you reset a cluster’s scaling parameters and ioShares level to default
values. You must specify at least one optional parameter.
Table 7‑3.
ParameterMandatory/Optional Description
--name cluster_name
--all
--elasticityMode
--targetComputeNodeNum
--minComputeNodeNum
--maxComputeNodeNum
--ioShares
MandatoryName of cluster for which to reset scaling parameters.
OptionalReset all scaling parameters to their defaults.
OptionalReset auto to false.
OptionalReset to -1.
Big Data Extensions manual scaling operations treat a
targetComputeNodeNum value of -1 as if it were unspecified upon
a change to manual scaling.
OptionalReset to -1.
It appears as "Unset" in the Serengeti CLI displays. Big Data
Extensions elastic scaling treats a minComputeNodeNum value of -1
as if it were zero (0).
OptionalReset to -1. It appears as "Unset" in the Serengeti CLI displays.
Big Data Extensions elastic scaling treats a maxComputeNodeNum
value of -1 as if it were unlimited.
OptionalReset to NORMAL.
cluster resize Command
The cluster resize command lets you change the number of nodes in a node group or scale up/down
cluster CPU or RAM. The cluster resize function creates new nodes with the same services and
configurations as the original nodes. You must specify at least one optional parameter.
If you specify the --instanceNum parameter, you cannot specify either the --cpuNumPerNode parameter or the
--memCapacityMbPerNode parameter.
You can specify the --cpuNumPerNode and the --memCapacityMbPerNode parameters at the same time to scale
the CPU and RAM with a single command.
IMPORTANT Even if you changed the user password on the cluster's nodes, the changed password is not
used for the new nodes that are created when you scale out a cluster. If you set the cluster's initial
administrator password when you created the cluster, that initial administrator password is used for the
new nodes. If you did not set the cluster's initial administrator password when you created the cluster, new
random passwords are used for the new nodes.
Table 7‑4.
ParameterMandatory/Optional Description
--name cluster_name_in_Serengeti
--nodeGroup name_of_the_node_group
--instanceNum instance_number
--cpuNumPerNode num_of_vCPUs
--memCapacityMbPerNode size_in_MB
MandatoryTarget Hadoop cluster in Serengeti.
MandatoryTarget role to scale out in the cluster deployed by
Serengeti.
OptionalTarget count to which to scale out. Must be greater than
the original count.
OptionalNumber of vCPUs to use for a node group
OptionalTotal memory, in MB, to use for the node group
VMware, Inc. 83
VMware vSphere Big Data Extensions Command-Line Interface Guide
cluster setParam Command
The cluster setParam command lets you set scaling parameters and the ioShares priority for a Hadoop
cluster in Serengeti. You must specify at least one optional parameter.
In elastic scaling, Big Data Extensions has two different behaviors for deciding how many active compute
nodes to maintain. In both behaviors, Big Data Extensions replaces unresponsive or faulty nodes with live,
responsive nodes.
Variable. The number of active, healthy TaskTracker compute nodes is maintained from the configured
n
minimum number of compute nodes to the configured maximum number of compute nodes. The
number of active compute nodes varies as resource availability fluctuates.
Fixed. The number of active, healthy TaskTracker compute nodes is maintained at a fixed number when
n
the same value is configured for the minimum and maximum number of compute nodes.
Table 7‑5.
ParameterMandatory/OptionalDescription
--name cluster_name
--elasticityMode mode
--targetComputeNodeNum
numTargetNodes
--minComputeNodeNum
minNum
--maxComputeNodeNum
maxNum
--ioShareslevel
MandatoryName of cluster for which to set elasticity
OptionalMANUAL or AUTO.
Optional
This parameter is applicable only for
the MANUAL scaling mode. If the
cluster is in AUTO mode or you are
changing it to AUTO mode, this
parameter is ignored.
Optional
This parameter is applicable only for
the AUTO scaling mode. If the cluster
is in MANUAL mode or you are
changing it to MANUAL mode, this
parameter is ignored.
Optional
This parameter is applicable only for
the AUTO scaling mode. If the cluster
is in MANUAL mode or you are
changing it to MANUAL mode, this
parameter is ignored.
OptionalPriority access level: LOW, NORMAL, or HIGH.
parameters.
Number of compute nodes for the specified
Hadoop cluster or node group within that
cluster.
Must be an integer >= 0.
If zero (0), all the nodes in the specific
n
Hadoop cluster, or if nodeGroup is
specified, in the node group, are
decommissioned and powered off.
For integers from one to the max number of
n
nodes in the Hadoop cluster, the specified
number of nodes remain commissioned and
powered on, and the remaining nodes are
decommissioned.
For integers > the max number of nodes in
n
the Hadoop cluster, all the nodes in the
specified Hadoop cluster or node group are
re-commissioned and powered on.
Lower limit of the range of active compute
nodes to maintain in the cluster.
Upper limit of the range of active compute
nodes to maintain in the cluster.
84 VMware, Inc.
Chapter 7 Serengeti CLI Command Reference
cluster start Command
The cluster start command lets you start a cluster in Serengeti.
Table 7‑6.
ParameterMandatory/Optional Description
--namecluster_name
MandatoryName of cluster to start.
cluster stop Command
The cluster stop command lets you stop a Hadoop cluster in Serengeti.
Table 7‑7.
ParameterMandatory/Optional Description
--name cluster_name
MandatoryName of cluster to stop.
cluster target Command
The cluster target command lets you connect to a cluster to run fs, mr, pig, and hive commands with the
Serengeti CLI. You must rerun the cluster target command if it has been more than 30 minutes since you ran
it in your current Serengeti Command-Line Interface session.
Table 7‑8.
ParameterMandatory/Optional Description
--name
cluster_name
--info
OptionalName of the cluster to which to connect. If unspecified, the first cluster
OptionalShow targeted cluster information, such as the HDFS URL, Job Tracker
connect Command
The connect command lets you connect and log in to a remote Serengeti server.
The connect command reads the user name and password in interactive mode. You must run the connect
command every time you begin a Serengeti Command-Line Interface session, and again after the 30 minute
session timeout. If you do not run this command, you cannot run any other commands.
Table 7‑9.
Parameter Mandatory/Optional Description
--hostMandatorySerengeti Web service URL, formatted as
listed by the cluster list command is connected.
The --name and --info parameters are mutually exclusive. You can use
either parameter, but not both.
URL and Hive server URL.
The --info and --name parameters are mutually exclusive. You can use
either parameter, but not both.
serengeti_management_server_ip_or_host:port. By default, the Serengeti web service is
started at port 8443.
VMware, Inc. 85
VMware vSphere Big Data Extensions Command-Line Interface Guide
datastore Commands
The datastore {*} commands let you add and delete datastores, and view the list of datastores in a
Serengeti deployment.
datastore add Command
The datastore add command lets you add a datastore to Serengeti.
Table 7‑10.
ParameterMandatory/Optional Description
--name datastore_name_in_SerengetiMandatoryDatastore name in Serengeti.
--spec datastore_name_in_vCenter_Server MandatoryDatastore name in vSphere. You can use a wildcard to
--type {LOCAL|SHARED}
datastore delete Command
Optional(Default=SHARED) Type of datastore: LOCAL or
specify multiple vmfs stores. Supported wildcards are *
and ?.
SHARED.
The datastore delete command lets you delete a datastore from Serengeti.
Table 7‑11.
ParameterMandatory/Optional Description
--name datastore_name_in_Serengeti MandatoryName of datastore to delete.
datastore list Command
The datastore list command lets you view a list of datastores in Serengeti. If you do not specify a
datastore name, all datastores are listed.
Table 7‑12.
ParameterMandatory/Optional Description
--name Name_of_datastore_name_in_Serengeti OptionalName of datastore to list.
--detailOptionalList the datastore details, including the datastore
disconnect Command
The disconnect command lets you disconnect and log out from a remote Serengeti server. After you
disconnect from the server, you cannot run any Serengeti commands until you reconnect with the connect
command.
path in vSphere.
There are no command parameters.
86 VMware, Inc.
distro list Command
The distro list command lets you view the list of roles in a Hadoop distribution.
Table 7‑13.
ParameterMandatory/Optional Description
--name distro_name
OptionalName of distribution to show.
fs Commands
The fs {*} FileSystem (FS) shell commands let you manage files on the HDFS and local systems. Before you
can run an FS command in a Command-Line Interface session, or after the 30 minute session timeout, you
must run the cluster target command
fs cat Command
The fs cat command lets you copy source paths to stdout.
Table 7‑14.
ParameterMandatory/OptionalDescription
file_nameMandatoryFile to be showed in the console.
Chapter 7 Serengeti CLI Command Reference
Multiple files must use quotes. For
example:
"/path/file1 /path/file2".
fs chgrp Command
The fs chgrp command lets you change group associations of one or more files.
Table 7‑15.
ParameterMandatory/Optional Description
--group group_name
--recursive {true|false}
file_nameMandatoryName of file to change. Multiple files must use quotes. For example:
MandatoryGroup name of the file.
OptionalPerform the operation recursively.
"/path/file1 /path/file2".
fs chmod Command
The fs chmod command lets you change permissions of one or more files.
Table 7‑16.
ParameterMandatory/Optional Description
--mode
<permission mode>
--recursive {true|
false}
file_nameMandatoryName of file to change. Multiple files must use quotes. For example:
MandatoryFile permission mode. For example: 755.
OptionalPerform the operation recursively.
"/path/file1 /path/file2"/path/file1 /path/".
VMware, Inc. 87
VMware vSphere Big Data Extensions Command-Line Interface Guide
fs chown Command
The fs chown command lets you change the owner of one or more files.
Table 7‑17.
ParameterMandatory/Optional Description
--owner permission_mode
--recursive {true|false}
file_nameMandatoryName of file to change. Multiple files must use quotes. For example:
MandatoryName of file owner.
OptionalChange the owner recursively through the directory structure.
fs copyFromLocal Command
The fs copyFromLocal command lets you copy one or more source files from the local file system to the
destination file system. The result of this command is the same as the put command.
Table 7‑18.
ParameterMandatory/Optional Description
--from local_file_path MandatoryLocal file path. Multiple files must use quotes. For example:
"/path/file1 /path/file2".
--to HDFS_file_pathMandatory
File path in local. If the from parameter value lists multiple files, the to
parameter value is the directory name.
"/path/file1 /path/file2".
fs copyToLocal Command
The fs copyToLocal command lets you copy one or more files to the local file system. The result of this
command is the same as the get command.
Table 7‑19.
ParameterMandatory/Optional Description
--from HDFS_file_path MandatoryFile path in HDFS. Multiple files must use quotes. For example:
"/path/file1 /path/file".
--to local_file_pathMandatory
File path in local. If the from parameter value lists multiple files, the to
parameter value is the directory name.
fs copyMergeToLocal Command
The fs copyMergeToLocal command lets you concatenate one or more HDFS files to a local file.
Table 7‑20.
ParameterMandatory/Optional Description
--from HDFS_file_pathMandatoryFile path in HDFS. Multiple files must use quotes. For example:
"/path/file1 /path/file2".
--to local_file_pathMandatoryFile path in local.
--endline {true|false}
OptionalAdd end of line (EOL) character.
88 VMware, Inc.
Chapter 7 Serengeti CLI Command Reference
fs count Command
The fs count command lets you count the number of directories, files, bytes, quota, and remaining quota.
Table 7‑21.
ParameterMandatory/Optional Description
--path HDFS_pathMandatoryPath to be counted.
--quota {true|false}
OptionalInclude quota information.
fs cp Command
The fs cp command lets you copy one or more files from source to destination. This command allows
multiple sources, in which case the destination must be a directory.
Table 7‑22.
ParameterMandatory/Optional Description
--from HDFS_source_file_pathMandatoryFile path in local. Multiple files must use quotes. For example:
"/path/file1 /path/file2".
--to HDFS_destination_file_path Mandatory
File path in local. If the from parameter value lists multiple files,
the to parameter value is the directory name.
fs du Command
The fs du command lets you display the size of files and directories that are in the given directory, or if just
a file is specified, the file size.
Table 7‑23.
Parameter Mandatory/Optional Description
file_nameMandatoryFile to display. Multiple files must use quotes. For example:
"/path/file1 /path/file2".
fs expunge Command
The fs expunge command lets you empty the HDFS trash bin. There are no command parameters.
fs get Command
The fs get command lets you copy one or more HDFS files to the local file system.
Table 7‑24.
ParameterMandatory/Optional Description
--from HDFS_file_path MandatoryFile path in HDFS. Multiple files must use quotes. For example:
"/path/file1 /path/file2".
--to local_file_pathMandatory
File path in local. If the from parameter value lists multiple files, the to
parameter value is the directory name.
VMware, Inc. 89
VMware vSphere Big Data Extensions Command-Line Interface Guide
fs ls Command
The fs lscommand lets you view a list of a directory’s files.
Table 7‑25.
ParameterMandatory/Optional Description
path_nameMandatoryPath for which to view the list of files. Multiple paths must use
--recursive
{true|false}
OptionalPerform the operation recursively.
fs mkdir Command
The fs mkdir command lets you create a directory.
Table 7‑26.
Parameter Mandatory/Optional Description
dir_nameMandatoryName of directory to create.
quotes. For example: "/path/file1 /path/file2".
fs moveFromLocal Command
The fs moveFromLocal command copies files similarly to the put command, except that the local source file
is deleted after it is copied.
Table 7‑27.
ParameterMandatory/Optional Description
--from local_file_path MandatoryFile path in local. Multiple files must use quotes. For example:
"/path/file1 /path/file2".
--to HDFS_file_pathMandatory
File path in HDFS. If the from parameter value lists multiple files, the to
parameter value is the directory name.
fs mv Command
The fs mv command lets you move one or more local source files to an HDFS destination.
Table 7‑28.
ParameterMandatory/Optional Description
--from source_file_path MandatoryLocal source path and file. Multiple files must use quotes. For example:
"/path/file1 /path/file2".
--to HDFS_file_pathMandatory
HDFS destination path and file. If the from parameter value lists
multiple files, the to parameter value is the directory name.
90 VMware, Inc.
Chapter 7 Serengeti CLI Command Reference
fs put Command
The fs put command lets you copy one or more local file system sources to an HDFS.
Table 7‑29.
ParameterMandatory/Optional Description
--from local_file_path MandatoryLocal source path and file. Multiple files must use quotes. For example:
"/path/file1 /path/file2".
--to HDFS_file_pathMandatory
HDFS destination path and file. If the from parameter value lists multiple
files, the to parameter value is the directory name.
fs rm Command
The fs rm command lets you remove files from the HDFS.
Table 7‑30.
ParameterMandatory/Optional Description
file_pathMandatoryFile to remove.
--recursive {true|false}
--skipTrash {true|false}
OptionalPerform the operation recursively.
OptionalBypass trash.
fs setrep Command
The fs setrep command lets you change a file’s replication factor.
Table 7‑31.
ParameterMandatory/Optional Description
--path file_pathMandatoryPath and file for which to change the replication factor.
--replica replica_numberMandatoryNumber of replicas.
--recursive {true|false}
--waiting {true|false}
OptionalPerform the operation recursively.
OptionalWait for the replica number to equal the specified number.
fs tail Command
The fs tail command lets you display a file’s last kilobyte of content to stdout.
Table 7‑32.
ParameterMandatory/Optional Description
file_pathMandatoryFile path to display.
--file {true|false}
OptionalShow content while the file grows.
VMware, Inc. 91
VMware vSphere Big Data Extensions Command-Line Interface Guide
fs text Command
The fs text command lets you output a source file in text format.
Table 7‑33.
Parameter Mandatory/Optional Description
file_pathMandatoryFile path to display.
fs touchz Command
The fs touchz command lets you create a zero length file.
Table 7‑34.
Parameter Mandatory/Optional Description
file_pathMandatoryName of file to create.
hive script Command
The hive script command lets you run a Hive or Hive Query Language (HQL) script.
Before you can run the hive script command in a Command-Line Interface session, or after the 30 minute
session timeout, you must run the cluster target command.
Table 7‑35.
ParameterMandatory/Optional Description
--location script_path MandatoryHive or HQL script file name.
mr Commands
The mr {*} commands let you manage MapReduce jobs.
Before you can run an mr {*} command in a Command-Line Interface session, or after the 30 minute session
timeout, you must run the cluster target command
mr jar Command
The mr jar command lets you run a MapReduce job located inside the provided jar.
Table 7‑36.
ParameterMandatory/Optional Description
--jarfile jar_file_pathMandatoryJAR file path.
--mainclass main_class_name Mandatory
--args argOptional
Class that contains the main() method.
Arguments to send to the main class. To send multiple arguments,
double quote them.
92 VMware, Inc.
Chapter 7 Serengeti CLI Command Reference
mr job counter Command
The mr job counter command lets you print the counter value of the MapReduce job.
Table 7‑37.
ParameterMandatory/Optional Description
--jobid job_idMandatoryMR job id.
--groupname group_nameMandatoryCounter’s group name.
Specify either the --dhcp parameter for dynamic addresses or the combination of parameters that are
required for static addresses, but not parameters for both dynamic and static addresses.
VMware, Inc. 95
VMware vSphere Big Data Extensions Command-Line Interface Guide
Table 7‑47.
ParameterMandatory/OptionalDescription
--name network_name_in_Serengeti
--portGroup port_group_name_in_vSphere
--dhcp
--ip IP_range
--dns dns_server_ip_addr
--secondDNS dns_server_ip_addr
--gateway gateway_IP_addr
--mask network_IP_addr_mask
MandatoryName of network resource to add.
MandatoryName of port group in vSphere to
Mandatory for dynamic addresses.
Do not use for static addresses.
Mandatory for static addresses. Do
not use for dynamic addresses.
network delete Command
The network delete command lets you delete a network from Serengeti. Deleting an unused network frees
the network's IP addresses for use by other services.
Table 7‑48.
ParameterMandatory/Optional Description
--namenetwork_name_in_Serengeti
MandatoryDelete the specified network in Serengeti.
add.
Assign dynamic DHCP IP addresses.
Assign static IP addresses.
Express the IP_range in the format
xx.xx.xx.xx-xx[,xx]* .
Express IP addresses in the format
xx.xx.xx.xx.
network list Command
The network list command lets you view a list of available networks in Serengeti. The name, port group in
vSphere, IP address assignment type, assigned IP address, and so on appear.
Table 7‑49.
ParameterMandatory/Optional Description
--name network_name_in_Serengeti
--detail
OptionalName of network to display.
OptionalList network details.
network modify Command
The network modify command lets you reconfigure a Serengeti static IP network by adding IP address
segments to it. You might need to add IP address segments so that there is enough capacity for a cluster that
you want to create.
NOTE If your network uses static IP addresses, be sure that the addresses are not occupied before you add
the network.
Table 7‑50.
ParameterMandatory/OptionalDescription
--name network_name_in_Serengeti
--addIP IP_range
MandatoryModify the specified static IP network
in Serengeti.
MandatoryIP address segments, in the format
xx.xx.xx.xx-xx[,xx]*.
96 VMware, Inc.
pig script Command
The pig script command lets you run a Pig or PigLatin script.
Before you can run the pig script command in a Command-Line Interface session, or after the 30 minute
session timeout, you must run the cluster target command.
Table 7‑51.
ParameterMandatory/Optional Description
--locationscript_path
MandatoryName of the script to execute.
resourcepool Commands
The resourcepool {*} commands let you manage resource pools.
resourcepool add Command
The resourcepool add command lets you add a vSphere resource pool to Serengeti.
When you add a resource pool in Serengeti, it represents the actual vSphere resource pool as recognized by
vCenter Server. This symbolic representation enables you to use the Serengeti resource pool name, instead
of the full path of the resource pool in vCenter Server, in cluster specification files.
Chapter 7 Serengeti CLI Command Reference
Table 7‑52.
ParameterMandatory/Optional Description
--name resource_pool_name_in_Serengeti
--vccluster vSphere_cluster_of_the_resource_pool
--vcrp vSphere_resource_pool_name
MandatoryName of resource pool to add.
OptionalName of vSphere cluster that contains the
resource pool.
MandatoryvSphere resource pool.
resourcepool delete Command
The resourcepool delete command lets you remove a resource pool from Serengeti.
Table 7‑53.
ParameterMandatory/Optional Description
--name resource_pool_name_in_Serengeti
MandatoryResource pool to remove.
resourcepool list Command
The resourcepool list command lets you view a list of Serengeti resource pools. If you do not specify a
name, all Serengeti resource pools are listed.
Table 7‑54.
ParameterMandatory/Optional Description
--name resource_pool_name_in_Serengeti OptionalName and path of resource pool to list.
--detail
OptionalInclude resource pool details.
VMware, Inc. 97
VMware vSphere Big Data Extensions Command-Line Interface Guide
topology Commands
The topology {*} commands let you manage cluster topology.
topology list Command
The topology list command lets you list the RACK-HOSTS mapping topology stored in Serengeti.
There are no command parameters.
topology upload Command
The topology upload command lets you upload a rack-hosts mapping topology file to Serengeti. The
uploaded file overwrites any previous file.
The file format for every line is: rackname: hostname1, hostname2…
Table 7‑55.
ParameterMandatory/Optional Description
--fileName topology_file_name
--yes
MandatoryName of topology file.
OptionalAnswer Y to Y/N confirmation. If not specified, manually type y
or n.
98 VMware, Inc.
Index
A
accessing, Command-Line Interface 7
adding
datastores 15, 86
networks 16
resource pools 14
topology 23
adding clusters, with an application manager 45
adding a software management server, with the
nodes 48
topology-aware 23, 35
vSphere HA-protected 27
with an application manager 45
with assigned networks 47
with assigned resources 48
with available distributions 46
creating HBase only clusters, with the CLI 26
customized clusters, creating 50