This document supports the version of each product listed and
supports all subsequent versions until the document is
replaced by a new edition. To check for more recent editions
of this document, see http://www.vmware.com/support/pubs.
EN-001513-00
VMware vSphere Big Data Extensions Command-Line Interface Guide
You can find the most up-to-date technical documentation on the VMware Web site at:
http://www.vmware.com/support/
The VMware Web site also provides the latest product updates.
If you have comments about this documentation, submit your feedback to:
Create a Default HBase Cluster with the Serengeti Command-Line Interface 36
Create an HBase Cluster with vSphere HA Protection with the Serengeti Command-Line Interface 37
VMware, Inc.
Managing Hadoop and HBase Clusters41
4
Stop and Start a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 42
Scale Out a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 42
Scale CPU and RAM with the Serengeti Command-Line Interface 43
3
VMware vSphere Big Data Extensions Command-Line Interface Guide
Reconfigure a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 43
About Resource Usage and Elastic Scaling 45
Delete a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 51
About vSphere High Availability and vSphere Fault Tolerance 51
Reconfigure a Node Group with the Serengeti Command-Line Interface 51
Recover from Disk Failure with the Serengeti Command-Line Interface Client 51
Monitoring the Big Data Extensions Environment53
5
View Available Hadoop Distributions with the Serengeti Command-Line Interface 53
View Provisioned Hadoop and HBase Clusters with the Serengeti Command-Line Interface 54
View Datastores with the Serengeti Command-Line Interface 54
View Networks with the Serengeti Command-Line Interface 54
View Resource Pools with the Serengeti Command-Line Interface 55
Using Hadoop Clusters from the Serengeti Command-Line Interface57
6
Run HDFS Commands with the Serengeti Command-Line Interface 57
Run MapReduce Jobs with the Serengeti Command-Line Interface 58
Run Pig and PigLatin Scripts with the Serengeti Command-Line Interface 58
Run Hive and Hive Query Language Scripts with the Serengeti Command-Line Interface 59
Cluster Specification Reference61
7
Cluster Specification File Requirements 61
Cluster Definition Requirements 62
Annotated Cluster Specification File 62
Cluster Specification Attribute Definitions 66
White Listed and Black Listed Hadoop Attributes 68
Convert Hadoop XML Files to Serengeti JSON Files 70
Serengeti CLI Command Reference71
8
cfg Commands 72
cluster Commands 74
connect Command 80
datastore Commands 81
disconnect Command 82
distro list Command 82
fs Commands 82
hive script Command 88
mr Commands 89
network Commands 92
pig script Command 94
resourcepool Commands 94
topology Commands 95
Index97
4 VMware, Inc.
About This Book
VMware vSphere Big Data Extensions Command-Line Interface Guide describes how to use the Serengeti
Command-Line Interface (CLI) to manage the vSphere resources that you use to create Hadoop and HBase
clusters, and how to create, manage, and monitor Hadoop and HBase clusters with the Serengeti CLI.
VMware vSphere Big Data Extensions Command-Line Interface Guide also describes how to perform Hadoop and
HBase operations with the Serengeti CLI, and provides cluster specification and Serengeti CLI command
references.
Intended Audience
This guide is for system administrators and developers who want to use Serengeti to deploy and manage
Hadoop clusters. To successfully work with Serengeti, you should be familiar with Hadoop and VMware
vSphere®.
VMware Technical Publications Glossary
VMware Technical Publications provides a glossary of terms that might be unfamiliar to you. For definitions
of terms as they are used in VMware technical documentation, go to
http://www.vmware.com/support/pubs.
®
VMware, Inc.
5
VMware vSphere Big Data Extensions Command-Line Interface Guide
6 VMware, Inc.
Using the Serengeti Remote
Command-Line Interface Client1
The Serengeti Remote Command-Line Interface Client lets you access the Serengeti Management Server to
deploy, manage, and use Hadoop.
Access the Serengeti CLI By Using the Remote Command-Line
Interface Client
You can access the Serengeti Command-Line Interface (CLI) to perform Serengeti administrative tasks with
the Serengeti Remote CLI Client.
IMPORTANT You can only run Hadoop commands from the Serengeti CLI on a cluster running the Apache
Hadoop 1.2.1 distribution. To use the command-line to run Hadoop administrative commands for clusters
running other Hadoop distributions, such as cfg, fs, mr, pig, and hive, use a Hadoop client node to run
these commands.
Prerequisites
Use the vSphere Web Client to log in to the vCenter Server on which you deployed the Serengeti vApp.
n
Verify that the Serengeti vApp deployment was successful and that the Management Server is running.
n
Verify that you have the correct password to log in to Serengeti CLI. See the VMware vSphere Big Data
n
Extensions Administrator's and User's Guide.
The Serengeti CLI uses its vCenter Server credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment and that its location is
n
in your PATH environment variable.
Procedure
1Open a Web browser to connect to the Serengeti Management Server cli directory.
http://ip_address/cli
2Download the ZIP file for your version and build.
The filename is in the format VMware-Serengeti-cli-version_number-build_number.ZIP.
3Unzip the download.
The download includes the following components.
The serengeti-cli-version_number JAR file, which includes the Serengeti Remote CLI Client.
n
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
VMware, Inc.
7
VMware vSphere Big Data Extensions Command-Line Interface Guide
4Open a command shell, and change to the directory where you unzipped the package.
5Change to the cli directory, and run the following command to enter the Serengeti CLI.
For any language other than French of German, run the following command.
n
java -jar serengeti-cli-version_number.jar
For French or German languages, which use code page 850 (CP 850) language encoding when
n
running the Serengeti CLI from a Windows command console, run the following command.
You must run the connect host command every time you begin a CLI session, and again after the 30
minute session timeout. If you do not run this command, you cannot run any other commands.
aRun the connect command.
connect --host xx.xx.xx.xx:8443
bAt the prompt, type your user name, which might be different from your login credentials for the
Serengeti Management Server.
NOTE If you do not create a user name and password for the Serengeti Command-Line Interface
Client, you can use the default vCenter Server administrator credentials. The Serengeti CommandLine Interface Client uses the vCenter Server login credentials with read permissions on the
Serengeti Management Server.
cAt the prompt, type your password.
A command shell opens, and the Serengeti CLI prompt appears. You can use the help command to get help
with Serengeti commands and command syntax.
To display a list of available commands, type help.
n
To get help for a specific command, append the name of the command to the help command.
n
help cluster create
Press Tab to complete a command.
n
8 VMware, Inc.
Managing vSphere Resources for
Hadoop and HBase Clusters2
Big Data Extensions lets you manage the resource pools, datastores, and networks that you use in the
Hadoop and HBase clusters that you create.
Add a Resource Pool with the Serengeti Command-Line Interface on page 10
n
You add resource pools to make them available for use by Hadoop clusters. Resource pools must be
located at the top level of a cluster. Nested resource pools are not supported.
Remove a Resource Pool with the Serengeti Command-Line Interface on page 10
n
You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove
resource pools when you do not need them or if you want the Hadoop clusters you create in the
Serengeti Management Server to be deployed under a different resource pool. Removing a resource
pool removes its reference in vSphere. The resource pool is not deleted.
Add a Datastore with the Serengeti Command-Line Interface on page 10
n
You can add shared and local datastores to the Serengeti server to make them available to Hadoop
clusters.
Remove a Datastore with the Serengeti Command-Line Interface on page 11
n
You can remove any datastore from Serengeti that is not referenced by any Hadoop clusters.
Removing a datastore removes only the reference to the vCenter Server datastore. The datastore itself
is not deleted.
VMware, Inc.
Add a Network with the Serengeti Command-Line Interface on page 11
n
You add networks to Serengeti to make their IP addresses available to Hadoop clusters. A network is a
port group, as well as a means of accessing the port group through an IP address.
Reconfigure a Static IP Network with the Serengeti Command-Line Interface on page 12
n
You can reconfigure a Serengeti static IP network by adding IP address segments to it. You might need
to add IP address segments so that there is enough capacity for a cluster that you want to create.
Remove a Network with the Serengeti Command-Line Interface on page 12
n
You can remove networks from Serengeti that are not referenced by any Hadoop clusters. Removing
an unused network frees the IP addresses for reuse.
9
VMware vSphere Big Data Extensions Command-Line Interface Guide
Add a Resource Pool with the Serengeti Command-Line Interface
You add resource pools to make them available for use by Hadoop clusters. Resource pools must be located
at the top level of a cluster. Nested resource pools are not supported.
When you add a resource pool to Big Data Extensions it symbolically represents the actual vSphere resource
pool as recognized by vCenter Server. This symbolic representation lets you use the Big Data Extensions
resource pool name, instead of the full path of the resource pool in vCenter Server, in cluster specification
files.
Prerequisites
Deploy Big Data Extensions.
Procedure
1Access the Serengeti Command-Line Interface client.
2Run the resourcepool add command.
The --vcrp parameter is optional.
This example adds a Serengeti resource pool named myRP to the vSphere rp1 resource pool that is
contained by the cluster1 vSphere cluster.
After you add a resource pool to Big Data Extensions, do not rename the resource pool in vSphere. If you
rename it, you cannot perform Serengeti operations on clusters that use that resource pool.
Remove a Resource Pool with the Serengeti Command-Line Interface
You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove resource
pools when you do not need them or if you want the Hadoop clusters you create in the Serengeti
Management Server to be deployed under a different resource pool. Removing a resource pool removes its
reference in vSphere. The resource pool is not deleted.
Procedure
1Access the Serengeti Command-Line Interface client.
2Run the resourcepool delete command.
If the command fails because the resource pool is referenced by a Hadoop cluster, you can use the
resourcepool list command to see which cluster is referencing the resource pool.
This example deletes the resource pool named myRP.
resourcepool delete --name myRP
Add a Datastore with the Serengeti Command-Line Interface
You can add shared and local datastores to the Serengeti server to make them available to Hadoop clusters.
Procedure
1Access the Serengeti CLI.
10 VMware, Inc.
Chapter 2 Managing vSphere Resources for Hadoop and HBase Clusters
2Run the datastore add command.
This example adds a new, local storage datastore named myLocalDS. The --spec parameter’s value,
local*, is a wildcard specifying a set of vSphere datastores. All vSphere datastores whose names begin
with “local” are added and managed as a whole by Serengeti.
datastore add --name myLocalDS --spec local* --type LOCAL
What to do next
After you add a datastore to Big Data Extensions, do not rename the datastore in vSphere. If you rename it,
you cannot perform Serengeti operations on clusters that use that datastore.
Remove a Datastore with the Serengeti Command-Line Interface
You can remove any datastore from Serengeti that is not referenced by any Hadoop clusters. Removing a
datastore removes only the reference to the vCenter Server datastore. The datastore itself is not deleted.
You remove datastores if you do not need them or if you want to deploy the Hadoop clusters that you create
in the Serengeti Management Server under a different datastore.
Procedure
1Access the Serengeti CLI.
2Run the datastore delete command.
If the command fails because the datastore is referenced by a Hadoop cluster, you can use the datastore
list command to see which cluster is referencing the datastore.
This example deletes the myDS datastore.
datastore delete --name myDS
Add a Network with the Serengeti Command-Line Interface
You add networks to Serengeti to make their IP addresses available to Hadoop clusters. A network is a port
group, as well as a means of accessing the port group through an IP address.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the
network.
Procedure
1Access the Serengeti CLI.
VMware, Inc. 11
VMware vSphere Big Data Extensions Command-Line Interface Guide
2Run the network add command.
This example adds a network named myNW to the 10PG vSphere port group. Virtual machines that use
this network use DHCP to obtain the IP addresses.
network add --name myNW --portGroup 10PG --dhcp
This example adds a network named myNW to the 10PG vSphere port group. Hadoop nodes use
addresses in the 192.168.1.2-100 IP address range, the DNS server IP address is 10.111.90.2, the gateway
address is 192.168.1.1, and the subnet mask is 255.255.255.0.
After you add a network to Big Data Extensions, do not rename it in vSphere. If you rename the network,
you cannot perform Serengeti operations on clusters that use that network.
Reconfigure a Static IP Network with the Serengeti Command-Line
Interface
You can reconfigure a Serengeti static IP network by adding IP address segments to it. You might need to
add IP address segments so that there is enough capacity for a cluster that you want to create.
If the IP range that you specify includes IP addresses that are already in the network, Serengeti ignores the
duplicated addresses. The remaining addresses in the specified range are added to the network. If the
network is already used by a cluster, the cluster can use the new IP addresses after you add them to the
network. If only part of the IP range is used by a cluster, the unused IP address can be used when you create
a new cluster.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the
network.
Procedure
1Access the Serengeti CLI.
2Run the network modify command.
This example adds IP addresses from 192.168.1.2 to 192.168.1.100 to a network named myNetwork.
Remove a Network with the Serengeti Command-Line Interface
You can remove networks from Serengeti that are not referenced by any Hadoop clusters. Removing an
unused network frees the IP addresses for reuse.
Procedure
1Access the Serengeti CLI.
12 VMware, Inc.
Chapter 2 Managing vSphere Resources for Hadoop and HBase Clusters
2Run the network delete command.
network delete --name network_name
If the command fails because the network is referenced by a Hadoop cluster, you can use the network
list --detail command to see which cluster is referencing the network.
VMware, Inc. 13
VMware vSphere Big Data Extensions Command-Line Interface Guide
14 VMware, Inc.
Creating Hadoop and HBase Clusters3
Big Data Extensions lets you create and deploy Hadoop and HBase clusters. A Hadoop or HBase cluster is a
special type of computational cluster designed specifically for storing and analyzing large amounts of
unstructured data in a distributed computing environment.
The resource requirements are different for clusters created with the Serengeti Command-Line Interface and
the Big Data Extensions plug-in for the vSphere Web Client because the clusters use different default
templates. The default clusters created through the Serengeti Command-Line Interface are targeted for
Project Serengeti users and proof-of-concept applications, and are smaller than the Big Data Extensions
plug-in templates, which are targeted for larger deployments for commercial use.
Additionally, some deployment configurations require more resources than other configurations. For
example, if you create a Greenplum HD 1.2 cluster, you cannot use the SMALL size virtual machine. If you
create a default MapR or Greenplum HD cluster through the Serengeti Command-Line Interface, at least
550GB of storage and 55GB of memory are recommended. For other Hadoop distributions, at least 350GB of
storage and 35GB of memory are recommended.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the cluster's
virtual machine automatic migration. Although this prevents vSphere from automatically migrating the
virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using
the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters.
Performing such management functions outside of the Big Data Extensions environment can make it
impossible for you to perform some Big Data Extensions operations, such as disk failure recovery.
VMware, Inc.
About Hadoop and HBase Cluster Deployment Types on page 17
n
Big Data Extensions lets you deploy several types of Hadoop and HBase clusters. You need to know
about the types of clusters that you can create.
Serengeti’s Default Hadoop Cluster Configuration on page 18
n
For basic Hadoop deployments, such as proof of concept projects, you can use Serengeti’s default
Hadoop cluster configuration for clusters that are created with the Command-Line Interface.
Create a Default Serengeti Hadoop Cluster with the Serengeti Command-Line Interface on page 18
n
You can create as many clusters as you want in your Serengeti environment, but your environment
must meet all prerequisites.
Create a Cluster with a Custom Administrator Password with the Serengeti Command-Line Interface
n
on page 19
When you create a cluster, you can assign a custom administrator password to all the nodes in the
cluster. Custom administrator passwords let you directly log in to the cluster's nodes instead of having
to first log in to the Serengeti Management server.
15
VMware vSphere Big Data Extensions Command-Line Interface Guide
Create a Cluster with an Available Distribution with the Serengeti Command-Line Interface on
n
page 19
You can choose which Hadoop distribution to use when you deploy a cluster. If you do not specify a
Hadoop distribution, the resulting cluster includes the default distribution, Apache Hadoop.
Create a Hadoop Cluster with Assigned Resources with the Serengeti Command-Line Interface on
n
page 20
By default, when you use Serengeti to deploy a Hadoop cluster, the cluster might contain any or all
available resources: vCenter Server resource pool for the virtual machine's CPU and memory,
datastores for the virtual machine's storage, and a network. You can assign which resources the cluster
uses by specifying specific resource pools, datastores, and/or a network when you create the Hadoop
cluster.
Create a Cluster with Multiple Networks with the Serengeti Command-Line Interface on page 21
n
When you create a cluster, you can distribute the management, HDFS, and MapReduce traffic to
separate networks. You might want to use separate networks to improve performance or to isolate
traffic for security reasons.
Create a MapReduce v2 (YARN) Cluster with the Serengeti Command-Line Interface on page 21
n
You can create MapReduce v2 (YARN) cluster with the Serengeti Command-Line Interface.
Create a Customized Hadoop or HBase Cluster with the Serengeti Command-Line Interface on
n
page 22
You can create clusters that are customized for your requirements, including the number of nodes,
virtual machine RAM and disk size, the number of CPUs, and so on.
Create a Hadoop Cluster with Any Number of Master, Worker, and Client Nodes on page 23
n
You can create a Hadoop cluster with any number of master, worker, and client nodes.
Create a Data-Compute Separated Cluster with No Node Placement Constraints on page 24
n
You can create a cluster with separate data and compute nodes, without node placement constraints.
Create a Data-Compute Separated Cluster with Placement Policy Constraints on page 25
n
You can create a cluster with separate data and compute nodes, and define placement policy
constraints to distribute the nodes among the virtual machines as you want.
Create a Compute-Only Cluster with the Serengeti Command-Line Interface on page 27
n
You can create compute-only clusters to run MapReduce jobs on existing HDFS clusters, including
storage solutions that serve as an external HDFS.
Create a Basic Cluster with the Serengeti Command-Line Interface on page 29
n
You can create a basic cluster in your Serengeti environment. A basic cluster is a group of virtual
machines provisioned and managed by Serengeti. Serengeti helps you to plan and provision the
virtual machines to your specifications. You can use the basic cluster's virtual machines to install Big
Data applications.
About Cluster Topology on page 31
n
You can improve workload balance across your cluster nodes, and improve performance and
throughput, by specifying how Hadoop virtual machines are placed using topology awareness. For
example, you can have separate data and compute nodes, and improve performance and throughput
by placing the nodes on the same set of physical hosts.
Create a Cluster with Topology Awareness with the Serengeti Command-Line Interface on page 33
n
To achieve a balanced workload or to improve performance and throughput, you can control how
Hadoop virtual machines are placed by adding topology awareness to the Hadoop clusters. For
example, you can have separate data and compute nodes, and improve performance and throughput
by placing the nodes on the same set of physical hosts.
16 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Create a Data-Compute Separated Cluster with Topology Awareness and Placement Constraints on
n
page 34
You can create clusters with separate data and compute nodes, and define topology and placement
policy constraints to distribute the nodes among the physical racks and the virtual machines.
Serengeti’s Default HBase Cluster Configuration on page 36
n
HBase clusters are required for you to build big table applications. To run HBase MapReduce jobs,
configure the HBase cluster to include JobTracker nodes or TaskTracker nodes.
Create a Default HBase Cluster with the Serengeti Command-Line Interface on page 36
n
Serengeti supports deploying HBase clusters on HDFS.
Create an HBase Cluster with vSphere HA Protection with the Serengeti Command-Line Interface on
n
page 37
You can create HBase clusters with separated Hadoop NameNode and HBase Master roles, and
configure vSphere HA protection for the Masters.
About Hadoop and HBase Cluster Deployment Types
Big Data Extensions lets you deploy several types of Hadoop and HBase clusters. You need to know about
the types of clusters that you can create.
You can create the following types of clusters.
Basic Hadoop Cluster
HBase Cluster
Data-Compute
Separated Hadoop
Cluster
Compute-Only Hadoop
Cluster
Customized Cluster
You can create a simple Hadoop deployment for proof of concept projects
and other small scale data processing tasks using the basic Hadoop cluster.
You can create an HBase cluster. To run HBase MapReduce jobs, configure
the HBase cluster to include JobTracker or TaskTracker nodes.
You can separate the data and compute nodes in a Hadoop cluster, and you
can control how nodes are placed on your environment's vSphere ESXi hosts.
You can create a compute-only cluster to run MapReduce jobs. Computeonly clusters run only MapReduce services that read data from external
HDFS clusters and that do not need to store data.
You can use an existing cluster specification file to create clusters using the
same configuration as your previously created clusters. You can also edit the
file to customize the cluster configuration.
Hadoop Distributions Supporting MapReduce v1 and MapReduce v2 (YARN)
If the Hadoop distribution you use supports both MapReduce v1 and MapReduce v2 (YARN), the default
Hadoop cluster configuration creates a MapReduce v2 cluster.
In addition, if you are using two different versions of a vendor's Hadoop distribution, and both versions
support MapReduce v1 and MapReduce v2, the cluster you create using the latest version with the default
Hadoop cluster will use MapReduce v2. Clusters you create with the earlier Hadoop version will use
MapReduce v1. For example, if you have both Cloudera CDH 5 and CDH 4 installed within Big Data
Extensions, clusters you create with CDH 5 will use MapReduce v2, and clusters you create with CDH 4 will
use MapReduce v1.
VMware, Inc. 17
VMware vSphere Big Data Extensions Command-Line Interface Guide
Serengeti’s Default Hadoop Cluster Configuration
For basic Hadoop deployments, such as proof of concept projects, you can use Serengeti’s default Hadoop
cluster configuration for clusters that are created with the Command-Line Interface.
The resulting cluster deployment consists of the following nodes and virtual machines:
One master node virtual machine with NameNode and JobTracker services.
n
Three worker node virtual machines, each with DataNode and TaskTracker services.
n
One client node virtual machine containing the Hadoop client environment: the Hadoop client shell,
n
Pig, and Hive.
Hadoop Distributions Supporting MapReduce v1 and MapReduce v2 (YARN)
If the Hadoop distribution you use supports both MapReduce v1 and MapReduce v2 (YARN), the default
Hadoop cluster configuration creates a MapReduce v2 cluster.
In addition, if you are using two different versions of a vendor's Hadoop distribution, and both versions
support MapReduce v1 and MapReduce v2, the cluster you create using the latest version with the default
Hadoop cluster will use MapReduce v2. Clusters you create with the earlier Hadoop version will use
MapReduce v1. For example, if you have both Cloudera CDH 5 and CDH 4 installed within Big Data
Extensions, clusters you create with CDH 5 will use MapReduce v2, and clusters you create with CDH 4 will
use MapReduce v1.
Create a Default Serengeti Hadoop Cluster with the Serengeti
Command-Line Interface
You can create as many clusters as you want in your Serengeti environment, but your environment must
meet all prerequisites.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Deploy a default Serengeti Hadoop cluster on vSphere.
cluster create --name cluster_name
The only valid characters for cluster names are alphanumeric and underscores. When you choose the
cluster name, also consider the applicable vApp name. Together, the vApp and cluster names must be <
80 characters.
During the deployment process, real-time progress updates appear on the command-line.
What to do next
After the deployment finishes, you can run Hadoop commands and view the IP addresses of the Hadoop
node virtual machines from the Serengeti CLI.
18 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Create a Cluster with a Custom Administrator Password with the
Serengeti Command-Line Interface
When you create a cluster, you can assign a custom administrator password to all the nodes in the cluster.
Custom administrator passwords let you directly log in to the cluster's nodes instead of having to first log in
to the Serengeti Management server.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create command and include the --password parameter.
cluster create --name cluster_name --password
3Enter your custom password, and enter it again.
Passwords are from 8 to 128 characters, and include only alphanumeric characters ([0-9, a-z, A-Z]) and
the following special characters: _ @ # $ % ^ & *
Your custom password is assigned to all the nodes in the cluster.
Create a Cluster with an Available Distribution with the Serengeti
Command-Line Interface
You can choose which Hadoop distribution to use when you deploy a cluster. If you do not specify a
Hadoop distribution, the resulting cluster includes the default distribution, Apache Hadoop.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
VMware, Inc. 19
VMware vSphere Big Data Extensions Command-Line Interface Guide
2Run the cluster create command, and include the --distro parameter.
The --distro parameter’s value must match a distribution name displayed by the distro list
command.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must
configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS
and FQDN settings, the cluster creation process might fail or the cluster is created but does not
function.
This example deploys a cluster with the Cloudera CDH distribution:
cluster create --name clusterName --distro cdh
This example creates a customized cluster named mycdh that uses the CDH4 Hadoop distribution, and is
configured according to
the /opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json sample cluster
specification file. In this sample file, nameservice0 and nameservice1 are federated. That is,
nameservice0 and nameservice1 are independent and do not require coordination with each other. The
NameNode nodes in the nameservice0 node group are HDFS2 HA enabled. In Serengeti, name node
group names are used as service names for HDFS2.
Create a Hadoop Cluster with Assigned Resources with the Serengeti
Command-Line Interface
By default, when you use Serengeti to deploy a Hadoop cluster, the cluster might contain any or all
available resources: vCenter Server resource pool for the virtual machine's CPU and memory, datastores for
the virtual machine's storage, and a network. You can assign which resources the cluster uses by specifying
specific resource pools, datastores, and/or a network when you create the Hadoop cluster.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create command, and specify any or all of the command’s resource parameters.
This example deploys a cluster named myHadoop on the myDS datastore, under the myRP resource pool,
and uses the myNW network for virtual machine communications.
Create a Cluster with Multiple Networks with the Serengeti CommandLine Interface
When you create a cluster, you can distribute the management, HDFS, and MapReduce traffic to separate
networks. You might want to use separate networks to improve performance or to isolate traffic for security
reasons.
For optimal performance, use the same network for HDFS and MapReduce traffic in Hadoop and Hadoop
+HBase clusters. HBase clusters use the HDFS network for traffic related to the HBase Master and HBase
RegionServer services.
IMPORTANT You cannot configure multiple networks for clusters that use the MapR Hadoop distribution.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Access the Serengeti CLI.
2Run the cluster create command and include the --networkName, --hdfsNetworkName, and --
If you omit an optional network parameter, the traffic associated with that network parameter is routed
on the management network that you specify by the --networkName parameter.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must
configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS
and FQDN settings, the cluster creation process might fail or the cluster is created but does not
function.
The cluster's management, HDFS, and MapReduce traffic is distributed among the specified networks.
Create a MapReduce v2 (YARN) Cluster with the Serengeti CommandLine Interface
You can create MapReduce v2 (YARN) cluster with the Serengeti Command-Line Interface.
When you create a Hadoop cluster with the Serengeti Command-Line Interface, by default you create a
MapReduce v1 cluster. To create a MapReduce v2 (YARN) cluster, create a cluster specification file modeled
after the /opt/serengeti/samples/default_hadoop_yarn_cluster.json file, and specify the --specFile
parameter and your cluster specification file in the cluster create ... command.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
VMware, Inc. 21
VMware vSphere Big Data Extensions Command-Line Interface Guide
Procedure
1Access the Serengeti CLI.
2Run the cluster create ... command.
This example creates a customized MapR v2 cluster according to the sample cluster specification file,
Create a Customized Hadoop or HBase Cluster with the Serengeti
Command-Line Interface
You can create clusters that are customized for your requirements, including the number of nodes, virtual
machine RAM and disk size, the number of CPUs, and so on.
The Serengeti package includes several annotated sample cluster specification files that you can use as
models when you create your custom specification files.
In the Serengeti Management Server, the sample cluster specification files are
n
in /opt/serengeti/samples.
If you use the Serengeti Remote CLI client, the sample specification files are in the client directory.
n
Changing a node group role might cause the cluster creation process to fail. For example, workable clusters
require a NameNode, so if there are no NameNode nodes after you change node group roles, you cannot
create a cluster.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the cluster's characteristics such as the node groups.
2Access the Serengeti CLI.
3Run the cluster create command, and specify the cluster specification file.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must
configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS
and FQDN settings, the cluster creation process might fail or the cluster is created but does not
function.
22 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Create a Hadoop Cluster with Any Number of Master, Worker, and
Client Nodes
You can create a Hadoop cluster with any number of master, worker, and client nodes.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the cluster's characteristics, including the node groups.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must
configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS
and FQDN settings, the cluster creation process might fail or the cluster is created but does not
function.
In this example, the cluster has one master MEDIUM size virtual machine, five worker SMALL size
virtual machines, and one client SMALL size virtual machine. The instanceNum attribute configures the
number of virtual machines in a node.
Create a Data-Compute Separated Cluster with No Node Placement
Constraints
You can create a cluster with separate data and compute nodes, without node placement constraints.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the cluster's characteristics.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must
configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS
and FQDN settings, the cluster creation process might fail or the cluster is created but does not
function.
In this example, the cluster has separate data and compute nodes, without node placement constraints.
Four data nodes and eight compute nodes are created and put into individual virtual machines. The
number of nodes is configured by the instanceNum attribute.
Create a Data-Compute Separated Cluster with Placement Policy
Constraints
You can create a cluster with separate data and compute nodes, and define placement policy constraints to
distribute the nodes among the virtual machines as you want.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the cluster's
virtual machine automatic migration. Although this prevents vSphere from migrating the virtual machines,
it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter
Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such
management functions outside of the Big Data Extensions environment might break the cluster's placement
policy, such as the number of instances per host and the group associations. Even if you do not specify a
placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN placement
policy constraints.
VMware, Inc. 25
VMware vSphere Big Data Extensions Command-Line Interface Guide
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file to define the cluster's characteristics, including the node groups and
placement policy constraints.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must
configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS
and FQDN settings, the cluster creation process might fail or the cluster is created but does not
function.
In this example, the cluster has data-compute separated nodes, and each node group has a
placementPolicy constraint. After a successful provisioning, four data nodes and eight compute nodes
are created and put into individual virtual machines. With the instancePerHost=1 constraint, the four
data nodes are placed on four ESXi hosts. The eight compute nodes are put onto four ESXi hosts: two
nodes on each ESXi host.
This cluster specification requires that you configure datastores and resource pools for at least four
hosts, and that there is sufficient disk space for Serengeti to perform the necessary placements during
deployment.
Create a Compute-Only Cluster with the Serengeti Command-Line
Interface
You can create compute-only clusters to run MapReduce jobs on existing HDFS clusters, including storage
solutions that serve as an external HDFS.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1Create a cluster specification file that is modeled on the Serengeti compute_only_cluster.json sample
cluster specification file found in the Serengeti cli/samples directory.
VMware, Inc. 27
VMware vSphere Big Data Extensions Command-Line Interface Guide
2Add the following code to your new cluster specification file.
For HDFS clusters, set port_num to 8020. For Hadoop 2.0 clusters, such as CDH4 and Pivotal HD
distributions, set port_num to 9000.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must
configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS
and FQDN settings, the cluster creation process might fail or the cluster is created but does not
function.
In this example, the externalHDFS field points to an HDFS. Assign the hadoop_jobtracker role to the
master node group and the hadoop_tasktracker role to the worker node group.
The externalHDFS field conflicts with node groups that have hadoop_namenode and hadoop_datanode
roles. This conflict might cause the cluster creation to fail or, if successfully created, the cluster might
not work correctly. To avoid this problem, define only a single HDFS.
Create a Basic Cluster with the Serengeti Command-Line Interface
You can create a basic cluster in your Serengeti environment. A basic cluster is a group of virtual machines
provisioned and managed by Serengeti. Serengeti helps you to plan and provision the virtual machines to
your specifications. You can use the basic cluster's virtual machines to install Big Data applications.
The basic cluster does not install the Big Data application packages used when creating a Hadoop or HBase
cluster. Instead, you can install and manage Big Data applications with third party application management
tools such as Apache Ambari or Cloudera Manager within your Big Data Extensions environment, and
integrate it with your Hadoop software. The basic cluster does not deploy a Hadoop or Hbase cluster. You
must deploy software into the basic cluster's virtual machines using an external third party application
management tool.
The Serengeti package includes an annotated sample cluster specification file that you can use as an example
when you create your basic cluster specification file. In the Serengeti Management Server, the sample
specification file is located at /opt/serengeti/samples/basic_cluster.json. You can modify the
configuration values in the sample cluster specification file to meet your requirements. The only value you
cannot change is the value assigned to the role for each node group, which must always be basic.
You can deploy a basic cluster with the Big Data Extension plug-in using a customized cluster specification
file.
To deploy software within the basic cluster virtual machines, use the cluster list --detail command, or
run serengeti-ssh.sh cluster_name to obtain the IP address of the virtual machine. You can then use the IP
address with management applications such as Apache Ambari or Cloudera Manager to provision the
virtual machine with software of your choosing. You can configure the management application to use the
user name serengeti, and the password you specified when creating the basic cluster within Big Data
Extensions when the management tool needs a user name and password to connect to the virtual machines.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the cluster, as well as the Big Data software
n
you intend to deploy.
Procedure
1Create a specification file to define the basic cluster's characteristics.
You must use the basic role for each node group you define for the basic cluster.