VMware vSphere Big Data Extensions - 2.3 User’s Manual

VMware vSphere Big Data Extensions
Command-Line Interface Guide
vSphere Big Data Extensions 2.3
This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent editions of this document, see http://www.vmware.com/support/pubs.
EN-001702-00
You can find the most up-to-date technical documentation on the VMware Web site at:
http://www.vmware.com/support/
The VMware Web site also provides the latest product updates.
If you have comments about this documentation, submit your feedback to:
docfeedback@vmware.com
Copyright © 2013 – 2015 VMware, Inc. All rights reserved. Copyright and trademark information. This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 United States License
(http://creativecommons.org/licenses/by-nd/3.0/us/legalcode).
VMware, Inc.
3401 Hillview Ave. Palo Alto, CA 94304 www.vmware.com
2 VMware, Inc.

Contents

About This Book 7
Using the Serengeti Remote Command-Line Interface Client 9
1
Access the Serengeti CLI By Using the Remote CLI Client 9
Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client 10
Managing Application Managers 13
2
About Application Managers 13
Add an Application Manager by Using the Serengeti Command-Line Interface 14
View List of Application Managers by using the Serengeti Command-Line Interface 14
Modify an Application Manager by Using the Serengeti Command-Line Interface 15
View Supported Distributions for All Application Managers by Using the Serengeti Command-
Line Interface 15
View Configurations or Roles for Application Manager and Distribution by Using the Serengeti
Command-Line Interface 15
Delete an Application Manager by Using the Serengeti Command-Line Interface 16
Managing the Big Data Extensions Environment by Using the Serengeti
3
Command-Line Interface 17
About Application Managers 17
Add a Resource Pool with the Serengeti Command-Line Interface 20
Remove a Resource Pool with the Serengeti Command-Line Interface 21
Add a Datastore with the Serengeti Command-Line Interface 21
Remove a Datastore with the Serengeti Command-Line Interface 21
Add a Network with the Serengeti Command-Line Interface 22
Remove a Network with the Serengeti Command-Line Interface 22
Reconfigure a Static IP Network with the Serengeti Command-Line Interface 23
Reconfigure the DNS Type with the Serengeti Command-Line Interface 23
Increase Cloning Performance and Resource Usage of Virtual Machines 24
VMware, Inc.
Managing Users and User Accounts 27
4
Create an LDAP Service Configuration File Using the Serengeti Command-Line Interface 27
Activate Centralized User Management Using the Serengeti Command-Line Interface 29
Create a Cluster With LDAP User Authentication Using the Serengeti Command-Line Interface 29
Change User Management Modes Using the Serengeti Command-Line Interface 30
Modify LDAP Configuration Using the Serengeti Command-Line Interface 31
Creating Hadoop and HBase Clusters 33
5
About Hadoop and HBase Cluster Deployment Types 35
Default Hadoop Cluster Configuration for Serengeti 35
Default HBase Cluster Configuration for Serengeti 36
3
About Cluster Topology 36
About HBase Clusters 39
About MapReduce Clusters 46
About Data Compute Clusters 49
About Customized Clusters 60
Managing Hadoop and HBase Clusters 69
6
Stop and Start a Cluster with the Serengeti Command-Line Interface 69
Scale Out a Cluster with the Serengeti Command-Line Interface 70
Scale CPU and RAM with the Serengeti Command-Line Interface 70
Reconfigure a Cluster with the Serengeti Command-Line Interface 71
Delete a Cluster by Using the Serengeti Command-Line Interface 73
About vSphere High Availability and vSphere Fault Tolerance 73
Reconfigure a Node Group with the Serengeti Command-Line Interface 73
Expanding a Cluster with the Command-Line Interface 74
Recover from Disk Failure with the Serengeti Command-Line Interface Client 75
Recover a Cluster Node Virtual Machine 75
Enter Maintenance Mode to Perform Backup and Restore with the Serengeti Command-Line
Interface Client 76
Monitoring the Big Data Extensions Environment 79
7
View List of Application Managers by using the Serengeti Command-Line Interface 79
View Available Hadoop Distributions with the Serengeti Command-Line Interface 80
View Supported Distributions for All Application Managers by Using the Serengeti Command-
Line Interface 80
View Configurations or Roles for Application Manager and Distribution by Using the Serengeti
Command-Line Interface 80
View Provisioned Clusters with the Serengeti Command-Line Interface 81
View Datastores with the Serengeti Command-Line Interface 81
View Networks with the Serengeti Command-Line Interface 81
View Resource Pools with the Serengeti Command-Line Interface 82
Cluster Specification Reference 83
8
Cluster Specification File Requirements 83
Cluster Definition Requirements 83
Annotated Cluster Specification File 84
Cluster Specification Attribute Definitions 87
White Listed and Black Listed Hadoop Attributes 90
Convert Hadoop XML Files to Serengeti JSON Files 92
Serengeti CLI Command Reference 93
9
appmanager Commands 93
cluster Commands 95
connect Command 102
datastore Commands 102
disconnect Command 103
distro list Command 103
mgmtvmcfg Commands 103
4 VMware, Inc.
network Commands 104
resourcepool Commands 106
template Commands 107
topology Commands 107
usermgmt Commands 107
Index 109
Contents
VMware, Inc. 5
6 VMware, Inc.

About This Book

VMware vSphere Big Data Extensions Command-Line Interface Guide describes how to use the Serengeti Command-Line Interface (CLI) to manage the vSphere resources that you use to create Hadoop and HBase clusters, and how to create, manage, and monitor Hadoop and HBase clusters with the VMware Serengeti™ CLI.
VMware vSphere Big Data Extensions Command-Line Interface Guide also describes how to perform Hadoop and HBase operations with the Serengeti CLI, and provides cluster specification and Serengeti CLI command references.
Intended Audience
This guide is for system administrators and developers who want to use Serengeti to deploy and manage Hadoop clusters. To successfully work with Serengeti, you should be familiar with Hadoop and VMware vSphere®.
VMware Technical Publications Glossary
VMware Technical Publications provides a glossary of terms that might be unfamiliar to you. For definitions of terms as they are used in VMware technical documentation, go to
http://www.vmware.com/support/pubs.
®
VMware, Inc.
7
8 VMware, Inc.
Using the Serengeti Remote
Command-Line Interface Client 1
The Serengeti Remote Command-Line Interface Client lets you access the Serengeti Management Server to deploy, manage, and use Hadoop.
This chapter includes the following topics:
“Access the Serengeti CLI By Using the Remote CLI Client,” on page 9
n
“Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client,” on page 10
n

Access the Serengeti CLI By Using the Remote CLI Client

You can access the Serengeti Command-Line Interface (CLI) to perform Serengeti administrative tasks with the Serengeti Remote CLI Client.
Prerequisites
Use the VMware vSphere Web Client to log in to the VMware vCenter Server® on which you deployed
n
the Serengeti vApp.
Verify that the Serengeti vApp deployment was successful and that the Management Server is running.
n
Verify that you have the correct password to log in to Serengeti CLI. See the VMware vSphere Big Data
n
Extensions Administrator's and User's Guide.
The Serengeti CLI uses its vCenter Server credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment and that its location is
n
in your path environment variable.
Procedure
1 Download the Serengeti CLI package from the Serengeti Management Server.
Open a Web browser and navigate to the following URL: https://server_ip_address/cli/VMware-
Serengeti-CLI.zip
2 Download the ZIP file.
The filename is in the format VMware-Serengeti-cli-version_number-build_number.ZIP.
3 Unzip the download.
The download includes the following components.
The serengeti-cli-version_number JAR file, which includes the Serengeti Remote CLI Client.
n
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
VMware, Inc.
9
4 Open a command shell, and change to the directory where you unzipped the package.
5 Change to the cli directory, and run the following command to enter the Serengeti CLI.
For any language other than French or German, run the following command.
n
java -jar serengeti-cli-version_number.jar
For French or German languages, which use code page 850 (CP 850) language encoding when
n
running the Serengeti CLI from a Windows command console, run the following command.
java -Dfile.encoding=cp850 -jar serengeti-cli-version_number.jar
6 Connect to the Serengeti service.
You must run the connect host command every time you begin a CLI session, and again after the 30 minute session timeout. If you do not run this command, you cannot run any other commands.
a Run the connect command.
connect --host xx.xx.xx.xx:8443
b At the prompt, type your user name, which might be different from your login credentials for the
Serengeti Management Server.
NOTE If you do not create a user name and password for the Serengeti Command-Line Interface Client, you can use the default vCenter Server administrator credentials. The Serengeti Command-Line Interface Client uses the vCenter Server login credentials with read permissions on the Serengeti Management Server.
c At the prompt, type your password.
A command shell opens, and the Serengeti CLI prompt appears. You can use the help command to get help with Serengeti commands and command syntax.
To display a list of available commands, type help.
n
To get help for a specific command, append the name of the command to the help command.
n
help cluster create
Press Tab to complete a command.
n

Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client

To perform troubleshooting or to run your management automation scripts, log in to Hadoop master, worker, and client nodes with SSH from the Serengeti Management Server using SSH client tools such as SSH, PDSH, ClusterSSH, and Mussh, which do not require password authentication.
To connect to Hadoop cluster nodes over SSH, you can use a user name and password authenticated login. All deployed nodes are password-protected with either a random password or a user-specified password that was assigned when the cluster was created.
Prerequisites
Use the vSphere Web Client to log in to vCenter Server, and verify that the Serengeti Management Server virtual machine is running.
10 VMware, Inc.
Chapter 1 Using the Serengeti Remote Command-Line Interface Client
Procedure
1 Right-click the Serengeti Management Server virtual machine and select Open Console.
The password for the Serengeti Management Server appears.
NOTE If the password scrolls off the console screen, press Ctrl+D to return to the command prompt.
2 Use the vSphere Web Client to log in to the Hadoop node.
The password for the root user appears on the virtual machine console in the vSphere Web Client.
3 Change the password of the Hadoop node by running the set-password -u command.
sudo /opt/serengeti/sbin/set-password -u
VMware, Inc. 11
12 VMware, Inc.

Managing Application Managers 2

A key to managing your Hadoop clusters is understanding how to manage the different application managers that you use in your Big Data Extensions environment.
This chapter includes the following topics:
“About Application Managers,” on page 13
n
“Add an Application Manager by Using the Serengeti Command-Line Interface,” on page 14
n
“View List of Application Managers by using the Serengeti Command-Line Interface,” on page 14
n
“Modify an Application Manager by Using the Serengeti Command-Line Interface,” on page 15
n
“View Supported Distributions for All Application Managers by Using the Serengeti Command-Line
n
Interface,” on page 15
“View Configurations or Roles for Application Manager and Distribution by Using the Serengeti
n
Command-Line Interface,” on page 15
“Delete an Application Manager by Using the Serengeti Command-Line Interface,” on page 16
n

About Application Managers

You can use Cloudera Manager, Apache Ambari, and the default application manager to provision and manage clusters with VMware vSphere Big Data Extensions.
After you add a new Cloudera Manager or Ambari application manager to Big Data Extensions, you can redirect your software management tasks, including monitoring and managing clusters, to that application manager.
You can use an application manager to perform the following tasks:
n
n
n
Check the documentation for your application manager for tool-specific requirements.
Restrictions
The following restrictions apply to Cloudera Manager and Ambari application managers:
n
n
VMware, Inc.
List all available vendor instances, supported distributions, and configurations or roles for a specific application manager and distribution.
Create clusters.
Monitor and manage services from the application manager console.
To add an application manager with HTTPS, use the FQDN instead of the URL.
You cannot rename a cluster that was created with a Cloudera Manager or Ambari application manager.
13
You cannot change services for a big data cluster from Big Data Extensions if the cluster was created
n
with Ambari or Cloudera Manager application manager.
To change services, configurations, or both, you must make the changes from the application manager
n
on the nodes.
If you install new services, Big Data Extensions starts and stops the new services together with old services.
If you use an application manager to change services and big data cluster configurations, those changes
n
cannot be synced from Big Data Extensions. The nodes that you create with Big Data Extensions do not contain the new services or configurations.

Add an Application Manager by Using the Serengeti Command-Line Interface

To use either Cloudera Manager or Ambari application managers, you must add the application manager and add server information to Big Data Extensions.
NOTE If you want to add a Cloudera Manager or Ambari application manager with HTTPS, use the FQDN in place of the URL.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager add command.
appmanager add --name application_manager_name --type [ClouderaManager|Ambari]
--url http[s]://server:port
Application manager names can include only alphanumeric characters ([0-9, a-z, A-Z]) and the following special characters; underscores, hyphens, and blank spaces.
You can use the optional description variable to include a description of the application manager instance.
3 Enter your username and password at the prompt.
4 If you specified SSL, enter the file path of the SSL certificate at the prompt.
What to do next
To verify that the application manager was added successfully, run the appmanager list command.
View List of Application Managers by using the Serengeti Command­Line Interface
You can use the appManager list command to list the application managers that are installed on the Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list
14 VMware, Inc.
Chapter 2 Managing Application Managers
The command returns a list of all application managers that are installed on the Big Data Extensions environment.
Modify an Application Manager by Using the Serengeti Command­Line Interface
You can modify the information for an application manager with the Serengeti CLI, for example, you can change the manager server IP address if it is not a static IP, or you can upgrade the administrator account.
Prerequisites
Verify that you have at least one external application manager installed on your Big Data Extensions environment.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager modify command.
appmanager modify --name application_manager_name
--url <http[s]://server:port>
Additional parameters are available for this command. For more information about this command, see
“appmanager modify Command,” on page 94.

View Supported Distributions for All Application Managers by Using the Serengeti Command-Line Interface

Supported distributions are those distributions that are supported by Big Data Extensions. Available distributions are those distributions that have been added into your Big Data Extensions environment. You can view a list of the Hadoop distributions that are supported in the Big Data Extensions environment to determine if a particular distribution is available for a particular application manager.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list --name application_manager_name [--distros]
If you do not include the --name parameter, the command returns a list of all the Hadoop distributions that are supported on each of the application managers in the Big Data Extensions environment.
The command returns a list of all distributions that are supported for the application manager of the name that you specify.

View Configurations or Roles for Application Manager and Distribution by Using the Serengeti Command-Line Interface

You can use the appManager list command to list the Hadoop configurations or roles for a specific application manager and distribution.
The configuration list includes those configurations that you can use to configure the cluster in the cluster specifications.
VMware, Inc. 15
The role list contains the roles that you can use to create a cluster. You should not use unsupported roles to create clusters in the application manager.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list --name application_manager_name [--distro distro_name (--configurations | --roles) ]
The command returns a list of the Hadoop configurations or roles for a specific application manager and distribution.

Delete an Application Manager by Using the Serengeti Command-Line Interface

You can use the Serengeti CLI to delete an application manager when you no longer need it.
Prerequisites
Verify that you have at least one external application manager installed on your Big Data Extensions
n
environment.
Verify that application manager you want to delete does not contain any clusters, or the deletion
n
process will fail.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager delete command.
appmanager delete --name application_manager_name
16 VMware, Inc.
Managing the Big Data Extensions Environment by Using the Serengeti
Command-Line Interface 3
You must manage yourBig Data Extensions, which includes ensuring that if you choose not to add the resource pool, datastore, and network when you deploy the Serengeti vApp, you add the vSphere resources before you create a Hadoop or HBase cluster. You must also add additional application managers, if you want to use either Ambari or Cloudera Manager to manage your Hadoop clusters. You can remove resources that you no longer need.
This chapter includes the following topics:
“About Application Managers,” on page 17
n
“Add a Resource Pool with the Serengeti Command-Line Interface,” on page 20
n
“Remove a Resource Pool with the Serengeti Command-Line Interface,” on page 21
n
“Add a Datastore with the Serengeti Command-Line Interface,” on page 21
n
“Remove a Datastore with the Serengeti Command-Line Interface,” on page 21
n
“Add a Network with the Serengeti Command-Line Interface,” on page 22
n
“Remove a Network with the Serengeti Command-Line Interface,” on page 22
n
“Reconfigure a Static IP Network with the Serengeti Command-Line Interface,” on page 23
n
“Reconfigure the DNS Type with the Serengeti Command-Line Interface,” on page 23
n
“Increase Cloning Performance and Resource Usage of Virtual Machines,” on page 24
n

About Application Managers

You can use Cloudera Manager, Apache Ambari, and the default application manager to provision and manage clusters with VMware vSphere Big Data Extensions.
After you add a new Cloudera Manager or Ambari application manager to Big Data Extensions, you can redirect your software management tasks, including monitoring and managing clusters, to that application manager.
You can use an application manager to perform the following tasks:
List all available vendor instances, supported distributions, and configurations or roles for a specific
n
application manager and distribution.
Create clusters.
n
Monitor and manage services from the application manager console.
n
Check the documentation for your application manager for tool-specific requirements.
VMware, Inc.
17
Restrictions
The following restrictions apply to Cloudera Manager and Ambari application managers:
To add an application manager with HTTPS, use the FQDN instead of the URL.
n
You cannot rename a cluster that was created with a Cloudera Manager or Ambari application
n
manager.
You cannot change services for a big data cluster from Big Data Extensions if the cluster was created
n
with Ambari or Cloudera Manager application manager.
To change services, configurations, or both, you must make the changes from the application manager
n
on the nodes.
If you install new services, Big Data Extensions starts and stops the new services together with old services.
If you use an application manager to change services and big data cluster configurations, those changes
n
cannot be synced from Big Data Extensions. The nodes that you create with Big Data Extensions do not contain the new services or configurations.

Add an Application Manager by Using the Serengeti Command-Line Interface

To use either Cloudera Manager or Ambari application managers, you must add the application manager and add server information to Big Data Extensions.
NOTE If you want to add a Cloudera Manager or Ambari application manager with HTTPS, use the FQDN in place of the URL.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager add command.
appmanager add --name application_manager_name --type [ClouderaManager|Ambari]
--url http[s]://server:port
Application manager names can include only alphanumeric characters ([0-9, a-z, A-Z]) and the following special characters; underscores, hyphens, and blank spaces.
You can use the optional description variable to include a description of the application manager instance.
3 Enter your username and password at the prompt.
4 If you specified SSL, enter the file path of the SSL certificate at the prompt.
What to do next
To verify that the application manager was added successfully, run the appmanager list command.

Modify an Application Manager by Using the Serengeti Command-Line Interface

You can modify the information for an application manager with the Serengeti CLI, for example, you can change the manager server IP address if it is not a static IP, or you can upgrade the administrator account.
Prerequisites
Verify that you have at least one external application manager installed on your Big Data Extensions environment.
18 VMware, Inc.
Chapter 3 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager modify command.
appmanager modify --name application_manager_name
--url <http[s]://server:port>
Additional parameters are available for this command. For more information about this command, see
“appmanager modify Command,” on page 94.

View Supported Distributions for All Application Managers by Using the Serengeti Command-Line Interface

Supported distributions are those distributions that are supported by Big Data Extensions. Available distributions are those distributions that have been added into your Big Data Extensions environment. You can view a list of the Hadoop distributions that are supported in the Big Data Extensions environment to determine if a particular distribution is available for a particular application manager.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list --name application_manager_name [--distros]
If you do not include the --name parameter, the command returns a list of all the Hadoop distributions that are supported on each of the application managers in the Big Data Extensions environment.
The command returns a list of all distributions that are supported for the application manager of the name that you specify.

View Configurations or Roles for Application Manager and Distribution by Using the Serengeti Command-Line Interface

You can use the appManager list command to list the Hadoop configurations or roles for a specific application manager and distribution.
The configuration list includes those configurations that you can use to configure the cluster in the cluster specifications.
The role list contains the roles that you can use to create a cluster. You should not use unsupported roles to create clusters in the application manager.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list --name application_manager_name [--distro distro_name (--configurations | --roles) ]
VMware, Inc. 19
The command returns a list of the Hadoop configurations or roles for a specific application manager and distribution.

View List of Application Managers by using the Serengeti Command-Line Interface

You can use the appManager list command to list the application managers that are installed on the Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list
The command returns a list of all application managers that are installed on the Big Data Extensions environment.

Delete an Application Manager by Using the Serengeti Command-Line Interface

You can use the Serengeti CLI to delete an application manager when you no longer need it.
Prerequisites
Verify that you have at least one external application manager installed on your Big Data Extensions
n
environment.
Verify that application manager you want to delete does not contain any clusters, or the deletion
n
process will fail.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager delete command.
appmanager delete --name application_manager_name

Add a Resource Pool with the Serengeti Command-Line Interface

You add resource pools to make them available for use by Hadoop clusters. Resource pools must be located at the top level of a cluster. Nested resource pools are not supported.
When you add a resource pool to Big Data Extensions it symbolically represents the actual vSphere resource pool as recognized by vCenter Server. This symbolic representation lets you use the Big Data Extensions resource pool name, instead of the full path of the resource pool in vCenter Server, in cluster specification files.
NOTE After you add a resource pool to Big Data Extensions, do not rename the resource pool in vSphere. If you rename it, you cannot perform Serengeti operations on clusters that use that resource pool.
Procedure
1 Access the Serengeti Command-Line Interface client.
20 VMware, Inc.
Chapter 3 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
2 Run the resourcepool add command.
The --vcrp parameter is optional.
This example adds a Serengeti resource pool named myRP to the vSphere rp1 resource pool that is contained by the cluster1 vSphere cluster.
resourcepool add --name myRP --vccluster cluster1 --vcrp rp1

Remove a Resource Pool with the Serengeti Command-Line Interface

You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove resource pools when you do not need them or if you want the Hadoop clusters you create in the Serengeti Management Server to be deployed under a different resource pool. Removing a resource pool removes its reference in vSphere. The resource pool is not deleted.
Procedure
1 Access the Serengeti Command-Line Interface client.
2 Run the resourcepool delete command.
If the command fails because the resource pool is referenced by a Hadoop cluster, you can use the
resourcepool list command to see which cluster is referencing the resource pool.
This example deletes the resource pool named myRP.
resourcepool delete --name myRP

Add a Datastore with the Serengeti Command-Line Interface

You can add shared and local datastores to the Serengeti server to make them available to Hadoop clusters.
NOTE After you add a resource pool to Big Data Extensions, do not rename the resource pool in vSphere. If you rename it, you cannot perform Serengeti operations on clusters that use that resource pool.
Procedure
1 Access the Serengeti CLI.
2 Run the datastore add command.
This example adds a new, local storage datastore named myLocalDS. The value of the --spec parameter,
local*, is a wildcard specifying a set of vSphere datastores. All vSphere datastores whose names begin
with “local” are added and managed as a whole by Serengeti.
datastore add --name myLocalDS --spec local* --type LOCAL
What to do next
After you add a datastore to Big Data Extensions, do not rename the datastore in vSphere. If you rename it, you cannot perform Serengeti operations on clusters that use that datastore.

Remove a Datastore with the Serengeti Command-Line Interface

You can remove any datastore from Serengeti that is not referenced by any Hadoop clusters. Removing a datastore removes only the reference to the vCenter Server datastore. The datastore itself is not deleted.
You remove datastores if you do not need them or if you want to deploy the Hadoop clusters that you create in the Serengeti Management Server under a different datastore.
VMware, Inc. 21
Procedure
1 Access the Serengeti CLI.
2 Run the datastore delete command.
If the command fails because the datastore is referenced by a Hadoop cluster, you can use the datastore
list command to see which cluster is referencing the datastore.
This example deletes the myDS datastore.
datastore delete --name myDS

Add a Network with the Serengeti Command-Line Interface

You add networks to Big Data Extensions to make their IP addresses available to Hadoop clusters. A network is a port group, as well as a means of accessing the port group through an IP address.
After you add a network to Big Data Extensions, do not rename it in vSphere. If you rename the network, you cannot perform Serengeti operations on clusters that use that network.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the network.
Procedure
1 Access the Serengeti CLI.
2 Run the network add command.
This example adds a network named myNetwork to the 10PG vSphere port group. Virtual machines that use this network use DHCP to obtain the IP addresses.
network add --name myNetwork --portGroup 10PG --dhcp
This example adds a network named myNetwork to the 10PG vSphere port group. Hadoop nodes use addresses in the 192.168.1.2-100 IP address range, the DNS server IP address is 10.111.90.2, the gateway address is 192.168.1.1, and the subnet mask is 255.255.255.0.
network add --name myNetwork --portGroup 10PG --ip 192.168.1.2-100 --dns 10.111.90.2
--gateway 192.168.1.1 --mask 255.255.255.0
To specify multiple IP address segments, use multiple strings to express the IP address range in the format xx.xx.xx.xx-xx[,xx]*.
xx.xx.xx.xx-xx, xx.xx.xx.xx-xx, single_ip, single_ip
This example adds a dynamic network with DHCP assigned IP addresses and meaningful host name.
network add --name ddnsNetwork --dhcp --portGroup pg1 --dnsType DYNAMIC

Remove a Network with the Serengeti Command-Line Interface

You can remove networks from Serengeti that are not referenced by any Hadoop clusters. Removing an unused network frees the IP addresses for reuse.
Procedure
1 Access the Serengeti CLI.
22 VMware, Inc.
Chapter 3 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
2 Run the network delete command.
network delete --name network_name
If the command fails because the network is referenced by a Hadoop cluster, you can use the network
list --detail command to see which cluster is referencing the network.

Reconfigure a Static IP Network with the Serengeti Command-Line Interface

You can reconfigure a Serengeti static IP network by adding IP address segments to it. You might need to add IP address segments so that there is enough capacity for a cluster that you want to create.
If the IP range that you specify includes IP addresses that are already in the network, Serengeti ignores the duplicated addresses. The remaining addresses in the specified range are added to the network. If the network is already used by a cluster, the cluster can use the new IP addresses after you add them to the network. If only part of the IP range is used by a cluster, the unused IP address can be used when you create a new cluster.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the network.
Procedure
1 Access the Serengeti CLI.
2 Run the network modify command.
This example adds IP addresses from 192.168.1.2 to 192.168.1.100 to a network named myNetwork.
network modify --name myNetwork --addIP 192.168.1.2-100

Reconfigure the DNS Type with the Serengeti Command-Line Interface

You can reconfigure a network's Domain Name System (DNS) type, and specify that Big Data Extensions generate meaningful host names for the nodes in a Hadoop cluster.
After you add a network to Big Data Extensions, do not rename it in vSphere. If you rename the network, you cannot perform Serengeti operations on clusters that use that network.
VMware, Inc. 23
There are three DNS options you can specify:
Normal
Dynamic
Others
Host names provide easier visual identification, as well as allowing you to use services such as Single Sign­On, which requires the use of a properly configured DNS.
Procedure
1 Access the Serengeti CLI.
2 Run the network modify command.
The DNS server provides both forward and reverse FQDN to IP resolution. Reverse DNS is IP address to domain name mapping. The opposite of forward (normal) DNS which maps domain names to IP addresses. Normal is the default DNS type.
Dynamic DNS (DDNS or DynDNS) is a method of automatically updating a name server in the Domain Name System (DNS) with the active DNS configuration of its configured hostnames, addresses or other information. Big Data Extensions integrates with a Dynamic DNS server in its network through which it provides meaningful host names to the nodes in a Hadoop cluster. . The cluster will then automatically register with the DNS server.
There is no DNS server, or the DNS server doesn't provide normal DNS resolution or Dynamic DNS services. In this case, you must add FQDN/IP mapping for all nodes in the /etc/hosts file for each node in the cluster. Through this mapping of hostnames to IP addresses each node can contact another node in the cluster.
There are three DNS types you can specify: NORMAL, DYNAMIC, and OTHERS. NORMAL is the default value.
This example modifies a network named myNetwork to use a Dynamic DNS type. Virtual machines that use this network will use DHCP to obtain the IP addresses.
network modify --name myNetwork --dnsType DYNAMIC

Increase Cloning Performance and Resource Usage of Virtual Machines

You can rapidly clone and deploy virtual machines using Instant Clone, a feature of vSphere 6.0.
Using Instant Clone, a parent virtual machine is forked, and then a child virtual machine (or instant clone) is created. The child virtual machine leverages the storage and memory of the parent, reducing resource usage.
When provisioning a cluster, Big Data Extensions creates a parent virtual machine for each host on which a cluster node has been placed. After provisioning a new resource pool labeled BDE-ParentVMs-
$serengeti.uuid-$template.name is visible in vCenter Server. This resource pool contains several parent
virtual machines. Normal cluster nodes are instantly cloned from these parent virtual machines. Once the parent virtual machines are created on the cluster hosts, the time required to provision and scale a cluster is significantly reduced.
When scaling a cluster the clone type you specified during cluster creation continues to be used, regardless of what the current clone type is. For example, if you create a cluster using instant clone, then change your Big Data Extensions clone type to fast clone, the cluster you provisioned using instant clone will continue to use instant clone to scale out the cluster.
If you create clusters and later want to make changes to the template virtual machine used to provision those clusters, you must first delete all the existing parent virtual machines before using the new template virtual machine. When you create clusters using the new template, Big Data Extensions creates new parent virtual machines based on the new template.
24 VMware, Inc.
Chapter 3 Managing the Big Data Extensions Environment by Using the Serengeti Command-Line Interface
Prerequisites
Your Big Data Extensions deployment must use vSphere 6.0 to take advantage of Instant Clone.
Procedure
1 Log into the Serengeti Management Server.
2 Edit the /opt/serengeti/conf/serengeti.properties file and change the value of
cluster.clone.service=fast.
The default clone type when running vSphere 6.0 is Instant Clone.
cluster.clone.service = instant
3 To enable Instant Clone, restart the Serengeti Management Server .
sudo /sbin/service tomcat restart
The Serengeti Management Server reads the revised serengeti.properties file and applies the Fast Clone feature to all new clusters you create.
What to do next
All clusters you create will now use Instant Clone to deploy virtual machines. See Chapter 5, “Creating
Hadoop and HBase Clusters,” on page 33.
VMware, Inc. 25
26 VMware, Inc.

Managing Users and User Accounts 4

By default Big Data Extensions provides authentication only for local user accounts. If you want to use LDAP (either Active Directory or an OpenLDAP compatible directory) to authenticate users, you must configure Big Data Extensions for use with your LDAP or Active Directory service.
This chapter includes the following topics:
“Create an LDAP Service Configuration File Using the Serengeti Command-Line Interface,” on
n
page 27
“Activate Centralized User Management Using the Serengeti Command-Line Interface,” on page 29
n
“Create a Cluster With LDAP User Authentication Using the Serengeti Command-Line Interface,” on
n
page 29
“Change User Management Modes Using the Serengeti Command-Line Interface,” on page 30
n
“Modify LDAP Configuration Using the Serengeti Command-Line Interface,” on page 31
n

Create an LDAP Service Configuration File Using the Serengeti Command-Line Interface

Create a configuration file that identifies your LDAP or Active Directory server environment.
VMware, Inc.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Navigate to a directory on the Serengeti Management Server where you want to create and store the
configuration file.
You can use the directory /opt/serengeti/etc to store your configuration file.
27
3 Using a text editor, create a JavaScript Object Notation (JSON) file containing the configuration settings
for your LDAP or Active Directory service.
The format of the configuration file is shown below.
{ "type": "user_mode_type", "primaryUrl": "ldap://AD_LDAP_server_IP_address:network_port", "baseUserDn": "DN_information", "baseGroupDn": "DN_information", "userName": "username", "password": "password", "mgmtVMUserGroupDn":"DN_information" }
Table 41. LDAP Connection Information
type The external user authentication service you will use, which is either AD_AS_LDAP or LDAP.
baseUserDn
baseGroupDn
primaryUrl
mgmtVMUserGroupDn
userName
password
Specify the base user DN.
Specify the base group DN.
Specify the primary server URL of your Active Directory or LDAP server.
(Optional) Specify the base DN for searching groups to access the Serengeti Management Server.
Type the username of the Active Directory or LDAP server administrator account.
Type the password of the Active Directory or LDAP server administrator account.
4 When you complete the file, save your work.
Example: Example LDAP Configuration File
The following example illustrates the configuration file for an LDAP server within the acme.com domain.
{ "type": "LDAP", "primaryUrl": "ldap://acme.com:8888", "baseUserDn": "ou=users,dc=dev,dc=acme,dc=com", "baseGroupDn": "ou=users,dc=dev,dc=acme,dc=com", "userName": "jsmith", "password": "MyPassword", "mgmtVMUserGroupDn":"cn=Administrators,cn=Builtin,dc=dev,dc=acme,dc=com" }
What to do next
With an LDAP configuration file created, you can now activate centralized user management for your Big Data Extensions environment. See “Activate Centralized User Management Using the Serengeti
Command-Line Interface,” on page 29.
28 VMware, Inc.
Chapter 4 Managing Users and User Accounts

Activate Centralized User Management Using the Serengeti Command-Line Interface

You must specify that Big Data Extensions use an external user identity source before you can manage users through your LDAP or Active Directory.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Create a configuration file identifying your LDAP or Active Directory environment for use with
n
Big Data Extensions. See “Create an LDAP Service Configuration File Using the Serengeti Command-
Line Interface,” on page 27
Procedure
1 Access the Serengeti CLI.
2 Run the command usermgmtserver add --cfgfile config_file_path
This example activates centralized user management, specifying the file /opt/serengeti/LDAPConfigFile.cfg as the file containing your LDAP configuration settings.
usermgmtserver add --cfgfile /opt/serengeti/LDAPConfigFile.cfg
3 Run the mgmtvmcfg get to verify successful configuration of your environment by printing out the
LDAP or Active Directory configuration information.
The contents of the active configuration file in use by your Big Data Extensions environment prints to the terminal.
What to do next
When you activate centralized user management, you can create clusters and assign user management to roles using the users and user groups defined by your LDAP or Active Directory service. See “Create a
Cluster With LDAP User Authentication Using the Serengeti Command-Line Interface,” on page 29.

Create a Cluster With LDAP User Authentication Using the Serengeti Command-Line Interface

With centralized user management configured and activated, you can grant privileges to users and user groups in your LDAP or Active Directory service to individual Hadoop clusters that you create.
As an example of how you can use centralized user management in your Big Data Extensions environment, you can assign groups with administrative privileges in your LDAP or Active Directory service access to the Serengeti Management Server. This allows those users to administer Big Data Extensions and the Serengeti Management Server. You can then give another user group access to Hadoop cluster nodes, allowing them to run Hadoop jobs.
To access the Serengeti CLI and Serengeti commands, users must change to the user serengeti after they login. For example, you can use the command su to change to the serengeti user, after which you can access the Serengeti CLI.
su serengeti
VMware, Inc. 29
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Activate centralized user management for your Big Data Extensions deployment. See “Activate
n
Centralized User Management Using the Serengeti Command-Line Interface,” on page 29.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command, and specify the value of the --adminGroupName parameter and --
userGroupName parameter using the names of administrative groups and user groups to whom you
want to grant privileges for the cluster you are creating.
cluster create --name cluster_name --type hbase --adminGroupName AdminGroupName -­userGroupName UserGroupName
What to do next
After you deploy the cluster, you can access the Hadoop cluster by using several methods. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.

Change User Management Modes Using the Serengeti Command-Line Interface

You can change the user management mode of your Big Data Extensions environment. You can choose to use local user management, LDAP, or a combination of the two.
Big Data Extensions lets you authenticate local users, those managed by LDAP or Active Directory, or a combination of these authentication methods.
Table 42. User Authentication Modes
User Mode Description
Local
LDAP user
Mixed mode
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
Specify LOCAL to create and manage users and groups that are stored locally in your Big Data Extensions environment. Local is the default user management solution.
Specify LDAP to create and manage users and groups that are stored in your organization's identity source, such as Active Directory or LDAP. If you choose LDAP user you must configure Big Data Extensions to use an LDAP or Active Directory service (Active Directory as LDAP).
Specify MIXED to use a combination of both local users and users stored in an external identity source. If you choose mixed mode you must configure Big Data Extensions to use an LDAP or Active Directory service (Active Directory as LDAP).
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
30 VMware, Inc.
Chapter 4 Managing Users and User Accounts
2 Run the command mgmtvmcfg modify to specify the user authentication mode you want to use.
Specify LOCAL to create and manage users and groups that are stored locally in your
n
Big Data Extensions environment. LOCAL is the default user management solution when no Active Directory or LDAP service is available.
mgmtvmcfg modify LOCAL
Specify MIXED to use a combination of both local users and users stored in an external identity
n
source. If you choose mixed mode you must configure Big Data Extensions to use an LDAP or Active Directory service.
mgmtvmcfg modify MIXED
Specify LDAP to create and manage users and groups that are stored in your organization's identity
n
source, such as Active Directory as LDAP or LDAP. If you use LDAP you must configure Big Data Extensions to use an LDAP or Active Directory service.
mgmtvmcfg modify LDAP
Big Data Extensions uses the user authentication mode you specify.

Modify LDAP Configuration Using the Serengeti Command-Line Interface

You can modify your LDAP settings and make those changes available to your Big Data Extensions environment.
You can populate changes you make to your LDAP configuration settings to Big Data Extensions. This lets you update your LDAP service information.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Activate centralized user management for your Big Data Extensions deployment. See “Activate
n
Centralized User Management Using the Serengeti Command-Line Interface,” on page 29.
Modify the LDAP configuration file to account for any changes you want to make to your user
n
management settings. See “Create an LDAP Service Configuration File Using the Serengeti Command-
Line Interface,” on page 27
Procedure
1 Access the Serengeti CLI.
2 Run the command usermgmtserver modify --cfgfile config_file_path
usermgmtserver modify --cfgfile config_file_path
Any changes you made to the LDAP configuratino file are applied to your Big Data Extensions environment. Clusters you create will use the new LDAP settings.
What to do next
You can create clusters and assign user management roles using the users and user groups defined by your LDAP or Active Directory service. See “Create a Cluster With LDAP User Authentication Using the
Serengeti Command-Line Interface,” on page 29.
VMware, Inc. 31
32 VMware, Inc.

Creating Hadoop and HBase Clusters 5

Big Data Extensions you can create and deploy Hadoop and HBase clusters. A big data cluster is a type of computational cluster designed for storing and analyzing large amounts of unstructured data in a distributed computing environment.
Restrictions
When you create an HBase only cluster, you must use the default application manager because the
n
other application managers do not support HBase only clusters.
You cannot rename a cluster that was created with Cloudera Manager or Ambari application manager.
n
Temporarily powering off hosts will cause Big Data clusters to fail during cluster creation.
n
When creating Big Data clusters, Big Data Extensions calculates virtual machine placement according to available resources, Hadoop best practices, and user defined placement policies prior to creating the virtual machines. When performing placement calculations, if some hosts are powered off or set to stand-by, either manually, or automatically by VMware Distributed Power Management (VMware DPM), those hosts will not be considered as available resources when Big Data Extensions calculates virtual machine placement for use with a Big Data cluster.
If a host is powered off or set to stand-by after Big Data Extensions calculates virtual machine placement, but before it creates the virtual machines, the cluster fails to create until you power on those hosts. The following workarounds can help you both prevent and recover from this issue.
Disable VMware DPM on those vSphere clusters where you deploy and run Big Data Extensions.
n
Put hosts in maintenance mode before you power them off.
n
If a Big Data cluster fails to create due to its assigned hosts being temporarily unavailable, resume
n
the cluster creation after you power-on the hosts.
Requirements
The resource requirements are different for clusters created with the Serengeti Command-Line Interface and the Big Data Extensions plug-in for the vSphere Web Client because the clusters use different default templates. The default clusters created by using the Serengeti CLI are targeted for Project Serengeti users and proof-of-concept applications, and are smaller than the Big Data Extensions plug-in templates, which are targeted for larger deployments for commercial use.
VMware, Inc.
33
Some deployment configurations require more resources than other configurations. For example, if you create a Greenplum HD 1.2 cluster, you cannot use the small size virtual machine. If you create a default MapR or Greenplum HD cluster by using the Serengeti CLI, at least 550 GB of storage and 55 GB of memory are recommended. For other Hadoop distributions, at least 350 GB of storage and 35 GB of memory are recommended.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the virtual machine automatic migration on the cluster. Although this prevents vSphere from automatically migrating the virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such management functions outside of the Big Data Extensions environment can make it impossible for you to perform some Big Data Extensions operations, such as disk failure recovery.
Passwords must be from 8 to 20 characters, use only visible lowerASCII characters (no spaces), and must contain at least one uppercase alphabetic character (A - Z), at least one lowercase alphabetic character (a - z), at least one digit (0 - 9), and at least one of the following special characters: _, @, #, $, %, ^, &, *
This chapter includes the following topics:
“About Hadoop and HBase Cluster Deployment Types,” on page 35
n
“Default Hadoop Cluster Configuration for Serengeti,” on page 35
n
“Default HBase Cluster Configuration for Serengeti,” on page 36
n
“About Cluster Topology,” on page 36
n
“About HBase Clusters,” on page 39
n
“About MapReduce Clusters,” on page 46
n
“About Data Compute Clusters,” on page 49
n
“About Customized Clusters,” on page 60
n
34 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters

About Hadoop and HBase Cluster Deployment Types

With Big Data Extensions, you can create and use several types of big data clusters.
Basic Hadoop Cluster
HBase Cluster
Data and Compute Separation Cluster
Compute Only Cluster
Compute Workers Only Cluster
HBase Only Cluster
Simple Hadoop deployment for proof of concept projects and other small­scale data processing tasks. The Basic Hadoop cluster contains HDFS and the MapReduce framework. The MapReduce framework processes problems in parallel across huge datasets in the HDFS.
Runs on top of HDFS and provides a fault-tolerant way of storing large quantities of sparse data.
Separates the data and compute nodes, or clusters that contain compute nodes only. In this type of cluster, the data node and compute node are not on the same virtual machine.
You can create a cluster that contain only compute nodes, for example Jobtracker, Tasktracker, ResourceManager and NodeManager nodes, but not Namenode and Datanodes. A compute only cluster is used to run MapReduce jobs on an external HDFS cluster.
Contains only compute worker nodes, for example, Tasktracker and NodeManager nodes, but not Namenodes and Datanodes. A compute workers only cluster is used to add more compute worker nodes to an existing Hadoop cluster.
Contains HBase Master, HBase RegionServer, and Zookeeper nodes, but not Namenodes or Datanodes. Multiple HBase only clusters can use the same external HDFS cluster.
Customized Cluster
Uses a cluster specification file to create clusters using the same configuration as your previously created clusters. You can edit the cluster specification file to customize the cluster configuration.

Default Hadoop Cluster Configuration for Serengeti

For basic Hadoop deployments, such as proof of concept projects, you can use the default Hadoop cluster configuration for Serengeti for clusters that are created with the CLI.
The resulting cluster deployment consists of the following nodes and virtual machines:
One master node virtual machine with NameNode and JobTracker services.
n
Three worker node virtual machines, each with DataNode and TaskTracker services.
n
One client node virtual machine containing the Hadoop client environment: the Hadoop client shell,
n
Pig, and Hive.

Hadoop Distributions Supporting MapReduce v1 and MapReduce v2 (YARN)

If you use either Cloudera CDH4 or CDH5 Hadoop distributions, which support both MapReduce v1 and MapReduce v2 (YARN), the default Hadoop cluster configurations are different. The default hadoop cluster configuration for CDH4 is a MapReduce v1 cluster. The default hadoop cluster configuration for CDH5 is a MapReduce v2 cluster. All other distributions support either MapReduce v1 or MapReduce v2 (YARN), but not both.
VMware, Inc. 35

Default HBase Cluster Configuration for Serengeti

HBase is an open source distributed columnar database that uses MapReduce and HDFS to manage data. You can use HBase to build big table applications.
To run HBase MapReduce jobs, configure the HBase cluster to include JobTracker nodes or TaskTracker nodes. When you create an HBase cluster with the CLI, according to the default Serengeti HBase template, the resulting cluster consists of the following nodes:
One master node, which runs the NameNode and HBaseMaster services.
n
Three zookeeper nodes, each running the ZooKeeper service.
n
Three data nodes, each running the DataNode and HBase Regionserver services.
n
One client node, from which you can run Hadoop or HBase jobs.
n
The default HBase cluster deployed by Serengeti does not contain Hadoop JobTracker or Hadoop TaskTracker daemons. To run an HBase MapReduce job, deploy a customized, nondefault HBase cluster.

About Cluster Topology

You can improve workload balance across your cluster nodes, and improve performance and throughput, by specifying how Hadoop virtual machines are placed using topology awareness. For example, you can have separate data and compute nodes, and improve performance and throughput by placing the nodes on the same set of physical hosts.
To get maximum performance out of your big data cluster, configure your cluster so that it has awareness of the topology of your environment's host and network information. Hadoop performs better when it uses within-rack transfers, where more bandwidth is available, to off-rack transfers when assigning MapReduce tasks to nodes. HDFS can place replicas more intelligently to trade off performance and resilience. For example, if you have separate data and compute nodes, you can improve performance and throughput by placing the nodes on the same set of physical hosts.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the virtual machine automatic migration of the cluster. Although this prevents vSphere from migrating the virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such management functions outside of the Big Data Extensions environment might break the placement policy of the cluster, such as the number of instances per host and the group associations. Even if you do not specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN placement policy constraints.
36 VMware, Inc.
You can specify the following topology awareness configurations.
Hadoop Virtualization Extensions (HVE)
Enhanced cluster reliability and performance provided by refined Hadoop replica placement, task scheduling, and balancer policies. Hadoop clusters implemented on a virtualized infrastructure have full awareness of the topology on which they are running when using HVE.
To use HVE, your Hadoop distribution must support HVE and you must create and upload a topology rack-hosts mapping file.
Chapter 5 Creating Hadoop and HBase Clusters
RACK_AS_RACK
Standard topology for Apache Hadoop distributions. Only rack and host information are exposed to Hadoop. To use RACK_AS_RACK, create and upload a server topology file.
HOST_AS_RACK
Simplified topology for Apache Hadoop distributions. To avoid placing all HDFS data block replicas on the same physical host, each physical host is treated as a rack. Because data block replicas are never placed on a rack, this avoids the worst case scenario of a single host failure causing the complete loss of any data block.
Use HOST_AS_RACK if your cluster uses a single rack, or if you do not have rack information with which to decide about topology configuration options.
None
No topology is specified.

Topology Rack-Hosts Mapping File

Rack-hosts mapping files are plain text files that associate logical racks with physical hosts. These files are required to create clusters with HVE or RACK_AS_RACK topology.
The format for every line in a topology rack-hosts mapping file is:
rackname: hostname1, hostname2 ...
For example, to assign physical hosts a.b.foo.com and a.c.foo.com to rack1, and physical host c.a.foo.com to rack2, include the following lines in your topology rack-hosts mapping file.
rack1: a.b.foo.com, a.c.foo.com rack2: c.a.foo.com

Topology Placement Policy Definition Files

The placementPolicies field in the cluster specification file controls how nodes are placed in the cluster.
If you specify values for both instancePerHost and groupRacks, there must be a sufficient number of available hosts. To display the rack hosts information, use the topology list command.
The code shows an example placementPolicies field in a cluster specification file.
{ "nodeGroups":[ … { "name": "group_name", … "placementPolicies": { "instancePerHost": 2, "groupRacks": { "type": "ROUNDROBIN", "racks": ["rack1", "rack2", "rack3"] },
VMware, Inc. 37
"groupAssociations": [{ "reference": "another_group_name", "type": "STRICT" // or "WEAK" }] } }, … }
Table 51. placementPolicies Object Definition
JSON field Type Description
instancePerHost Optional Number of virtual machine nodes to
groupRacks Optional Method of distributing virtual machine
groupAssociations Optional One or more target node groups with
place for each physical ESXi host. This constraint is aimed at balancing the workload.
nodes among the cluster’s physical racks. Specify the following JSON strings:
n
type. Specify ROUNDROBIN, which selects candidates fairly and without priority.
n
racks. Which racks in the topology map to use.
which this node group associates. Specify the following JSON strings:
n
reference. Target node group name
n
type:
STRICT. Place the node group on
n
the target group’s set or subset of ESXi hosts. If STRICT placement is not possible, the operation fails.
WEAK. Attempt to place the node
n
group on the target group’s set or subset of ESXi hosts, but if that is not possible, use an extra ESXi host.

Create a Cluster with Topology Awareness with the Serengeti Command-Line Interface

To achieve a balanced workload or to improve performance and throughput, you can control how Hadoop virtual machines are placed by adding topology awareness to the Hadoop clusters. For example, you can have separate data and compute nodes, and improve performance and throughput by placing the nodes on the same set of physical hosts.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
38 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
Procedure
1 Access the Serengeti CLI.
2 (Optional) Run the topology list command to view the list of available topologies.
topology list
3 (Optional) If you want the cluster to use HVE or RACK_AS_RACK toplogies, create a topology rack-
hosts mapping file and upload the file to the Serengeti Management Server.
topology upload --fileName name_of_rack_hosts_mapping_file
4 Run the cluster create command to create the cluster.
cluster create --name cluster-name ... --topology {HVE|RACK_AS_RACK|HOST_AS_RACK}
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
This example creates an HVE topology.
cluster create --name cluster-name --topology HVE --distro name_of_HVE-supported_distro
5 View the allocated nodes on each rack.
cluster list --name cluster-name –-detail

About HBase Clusters

HBase runs on top of HDFS and provides a fault-tolerant way of storing large quantities of sparse data.

Create a Default HBase Cluster by Using the Serengeti Command-Line Interface

You can use the Serengeti CLI to deploy HBase clusters on HDFS.
This task creates a default HBase cluster which does not contain the MapReduce framework. To run HBase MapReduce jobs, add Jobtracker and TaskTracker or ResourceManager and NodeManager nodes to the default HBase cluster sample specification file /opt/serengeti/samples/default_hbase_cluster.json, then create a cluster using this specification file.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command, and specify the value of the --type parameter as hbase.
cluster create --name cluster_name --type hbase
What to do next
After you deploy the cluster, you can access an HBase database by using several methods. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
VMware, Inc. 39

Create an HBase Only Cluster in Big Data Extensions

With Big Data Extensions, you can create an HBase only cluster, which contain only HBase Master, HBase RegionServer, and Zookeeper nodes, but not Namenodes and Datanodes. The advantage of having an HBase only cluster is that multiple HBase clusters can use the same external HDFS.
Procedure
1 Prerequisites for Creating an HBase Only Cluster on page 40
Before you can create an HBase only cluster, you must verify that your system meets all of the prerequisites.
2 Prepare the EMC Isilon OneFS as the External HDFS Cluster on page 40
If you use EMC Isilon OneFS as the external HDFS cluster to the HBase only cluster, you must create and configure users and user groups, and prepare your Isilon OneFS environment.
3 Create an HBase Only Cluster by Using the Serengeti Command-Line Interface on page 41
You can use the Serengeti CLI to create an HBase only cluster.
Prerequisites for Creating an HBase Only Cluster
Before you can create an HBase only cluster, you must verify that your system meets all of the prerequisites.
Prerequisites
Verify that you started the Serengeti vApp.
n
Verify that you have more than one distribution if you want to use a distribution other than the default
n
distribution.
Verify that you have an existing HDFS cluster to use as the external HDFS cluster.
n
To avoid conflicts between the HBase only cluster and the external HDFS cluster, the clusters should use the same Hadoop distribution and version.
If the external HDFS cluster was not created using Big Data Extensions, verify that the HDFS
n
directory /hadoop/hbase, the group hadoop, and the following users exist in the external HDFS cluster:
hdfs
n
hbase
n
serengeti
n
If you use the EMC Isilon OneFS as the external HDFS cluster, verify that your Isilon environment is
n
prepared.
For information about how to prepare your environment, see “Prepare the EMC Isilon OneFS as the
External HDFS Cluster,” on page 40.
Prepare the EMC Isilon OneFS as the External HDFS Cluster
If you use EMC Isilon OneFS as the external HDFS cluster to the HBase only cluster, you must create and configure users and user groups, and prepare your Isilon OneFS environment.
Procedure
1 Log in to one of the Isilon HDFS nodes as user root.
2 Create the users.
hdfs
n
hbase
n
40 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
serengeti
n
mapred
n
The yarn and mapred users should have write, read, and execute permissions to the entire exported HDFS directory.
3 Create the user group hadoop.
4 Create the directory tmp under the root HDFS directory.
5 Set the owner as hdfs:hadoop with the read and write permissions set as 777.
6 Create the directory hadoop under the root HDFS directory.
7 Set the owner as hdfs:hadoop with the read and write permissions set as 775.
8 Create the directory hbase under the directory hadoop.
9 Set the owner as hbase:hadoop with the read and write permissions set as 775.
10 Set the owner of the root HDFS directory as hdfs:hadoop.
Example: Configuring the EMC Isilon OneFS Environment
isi auth users create --name="hdfs" isi auth users create --name="hbase" isi auth users create --name="serengeti" isi auth groups create --name="hadoop" pw useradd mapred -G wheel pw useradd yarn -G wheel chown hdfs:hadoop /ifs mkdir /ifs/tmp chmod 777 /ifs/tmp chown hdfs:hadoop /ifs/tmp mkdir -p /ifs/hadoop/hbase chmod -R 775 /ifs/hadoop chown hdfs:hadoop /ifs/hadoop chown hbase:hadoop /ifs/hadoop/hbase
What to do next
You are now ready to create the HBase only cluster with the EMC Isilon OneFS as the external cluster.
Create an HBase Only Cluster by Using the Serengeti Command-Line Interface
You can use the Serengeti CLI to create an HBase only cluster.
You must use the default application manager because the other application managers do not support HBase only clusters.
Procedure
1 To define the characteristics of the new cluster, make a copy of the following cluster specification
file: /opt/serengeti/samples/hbase_only_cluster.json
2 Replace hdfs://hostname-of-namenode:8020 in the specification file with the namenode uniform
resource identifier (URI) of the external HDFS cluster.
3 Access the Serengeti CLI.
4 Run the cluster create command.
cluster create --name clustername --distro distroname
--specfile specfile_location
VMware, Inc. 41
The /opt/serengeti/samples/hbase_only_cluster.json file is a sample specification file for HBase only clusters. It contains the zookeeper, hbase_master, and hbase_regionserver roles, but not the hadoop_namenode/hadoop_datanode role.
5 To verify that the cluster was created, run the cluster list command.
cluster list --name name
After the cluster is created, the system returns Cluster clustername created.

Create an HBase Cluster with vSphere HA Protection with the Serengeti Command-Line Interface

You can create HBase clusters with separated Hadoop NameNode and HBase Master roles. You can configure vSphere HA protection for the Master roles.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define the characteristics of the cluster, including the node group
roles and vSphere HA protection.
In this example, the cluster has JobTracker and TaskTracker nodes, which let you run HBase MapReduce jobs. The Hadoop NameNode and HBase Master roles are separated, and both are protected by vSphere HA.
{ "nodeGroups" : [ { "name" : "zookeeper", "roles" : [ "zookeeper" ], "instanceNum" : 3, "instanceType" : "SMALL", "storage" : { "type" : "shared", "sizeGB" : 20 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "on", "configuration" : { } }, { "name" : "hadoopmaster", "roles" : [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum" : 1,
42 VMware, Inc.
"instanceType" : "MEDIUM", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 2, "memCapacityMB" : 7500, "haFlag" : "on", "configuration" : { } }, { "name" : "hbasemaster", "roles" : [ "hbase_master" ], "instanceNum" : 1, "instanceType" : "MEDIUM", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 2, "memCapacityMB" : 7500, "haFlag" : "on", "configuration" : { } },
Chapter 5 Creating Hadoop and HBase Clusters
{ "name" : "worker", "roles" : [ "hadoop_datanode", "hadoop_tasktracker", "hbase_regionserver" ], "instanceNum" : 3, "instanceType" : "SMALL", "storage" : { "type" : "local", "sizeGB" : 50 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "off", "configuration" : { } }, { "name" : "client", "roles" : [ "hadoop_client", "hbase_client" ], "instanceNum" : 1,
VMware, Inc. 43
"instanceType" : "SMALL", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "off", "configuration" : { } } ], // we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste the output here "configuration" : { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/core­default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample: // "io.file.buffer.size": "4096" }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs­default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/mapred­default.html }, "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", // "HADOOP_SECONDARYNAMENODE_OPTS": "", // "HADOOP_JOBTRACKER_OPTS": "", // "HADOOP_TASKTRACKER_OPTS": "", // "HADOOP_CLASSPATH": "", // "JAVA_HOME": "", // "PATH": "" }, "log4j.properties": { // "hadoop.root.logger": "DEBUG,DRFA", // "hadoop.security.logger": "DEBUG,DRFA" }, "fair-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html // "text": "the full content of fair-scheduler.xml in one line" }, "capacity-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html },
44 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
"mapred-queue-acls.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons // "mapred.queue.queue-name.acl-submit-job": "", // "mapred.queue.queue-name.acl-administer-jobs", "" } }, "hbase": { "hbase-site.xml": { // check for all settings at http://hbase.apache.org/configuration.html#hbase.site }, "hbase-env.sh": { // "JAVA_HOME": "", // "PATH": "", // "HBASE_CLASSPATH": "", // "HBASE_HEAPSIZE": "", // "HBASE_OPTS": "", // "HBASE_USE_GC_LOGFILE": "", // "HBASE_JMX_BASE": "", // "HBASE_MASTER_OPTS": "", // "HBASE_REGIONSERVER_OPTS": "", // "HBASE_THRIFT_OPTS": "", // "HBASE_ZOOKEEPER_OPTS": "", // "HBASE_REGIONSERVERS": "", // "HBASE_SSH_OPTS": "", // "HBASE_NICENESS": "", // "HBASE_SLAVE_SLEEP": "" }, "log4j.properties": { // "hbase.root.logger": "DEBUG,DRFA" } }, "zookeeper": { "java.env": { // "JVMFLAGS": "-Xmx2g" }, "log4j.properties": { // "zookeeper.root.logger": "DEBUG,DRFA" } } } }
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the cluster specification file.
cluster create --name cluster_name --specFile full_path/spec_filename

Create an HBase Only Cluster with External Namenode HA HDFS Cluster

You can create an HBase only cluster with two namenodes in an active-passive HA configuration. The HA namenode provides a hot standby name node that, in the event of a failure, can perform the role of the active namenode with no downtime.
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
n
VMware, Inc. 45
MapReduce v1 worker only clusters and HBase only clusters created using the MapR distribution are
n
not supported.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 To define the characteristics of the new cluster, make a copy of the following cluster specification
file: /opt/serengeti/samples/hbase_only_cluster.json
2 Replace hdfs://hostname-of-namenode:8020 in this spec file with the namenode uniform resource identifier
(URI) of the external namenode HA HDFS cluster. The namenode URI is the value of the fs.defaultFS parameter in the core-site.xml of the external cluster.
3 Change the configuration section of the HBase only cluster specification file as shown in the following
example. All the values can be found in hdfs-site.xml of the external cluster.
"configuration" : { "hadoop": { "hdfs-site.xml": { "dfs.nameservices": "dataMaster", "dfs.ha.namenodes.dataMaster": "namenode0,namenode1", "dfs.client.failover.proxy.provider.dataMaster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.namenode.rpc-address.dataMaster.namenode0": "10.555.xx.xxx:xxx1", "dfs.namenode.http-address.dataMaster.namenode0": "10.555.xx.xxx:xxx2", "dfs.namenode.rpc-address.dataMaster.namenode1": "10.555.xx.xxx:xxx3", "dfs.namenode.http-address.dataMaster.namenode1": "10.555.xx.xxx:xxx4" } } }

About MapReduce Clusters

MapReduce is a framework for processing problems in parallel across huge data sets. The MapReduce framework distributes a number of operations on the data set to each node in the network.

Create a MapReduce v2 (YARN) Cluster by Using the Serengeti Command-Line Interface

You can create MapReduce v2 (YARN) clusters if you want to create a cluster that separates the resource management and processing components.
To create a MapReduce v2 (YARN) cluster, create a cluster specification file modeled after the /opt/serengeti/samples/default_hadoop_yarn_cluster.json file, and specify the --specFile parameter and your cluster specification file in the cluster create ... command.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
46 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create ... command.
This example creates a customized MapReduce v2 cluster using the CDH4 distribution according to the sample cluster specification file default_hadoop_yarn_cluster.json.
cluster create --name cluster_name --distro cdh4 --specFile /opt/serengeti/samples/default_hadoop_yarn_cluster.json

Create a MapReduce v1 Worker Only Cluster with External Namenode HA HDFS Cluster

You can create a MapReduce v1 worker only cluster with two namenodes in an active-passive HA configuration. The HA namenode provides a hot standby namenode that, in the event of a failure, can perform the role of the active namenode with no downtime.
The following restrictions apply to this task:
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
n
You cannot use MapR distribution to create MapReduce v1 worker only clusters and HBase only
n
clusters
Prerequisites
Start the Big Data Extensions vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
Ensure that you have an External Namenode HA HDFS cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 To define the characteristics of the new cluster, open the following cluster specification file to
modify: /opt/serengeti/samples/compute_workers_only_mr1.json
2 Replace hdfs://hostname-of-namenode:8020 in this spec file with the namenode uniform resource identifier
(URI) of the external namenode HA HDFS cluster. The namenode URI is the value of the fs.defaultFS parameter in the core-site.xml of the external cluster.
3 Replace the hostname-of-jobtracker in the specification file with the FQDN or IP address of the JobTracker
in the external cluster.
4 Change the configuration section of the MapReduce Worker only cluster specification file as shown in
the following example. All the values can be found in hdfs-site.xml of the external cluster.
{ "externalHDFS": "hdfs://dataMaster", "externalMapReduce": "xx.xxx.xxx.xxx:8021", "nodeGroups":[ { "name": "worker", "roles": [ "hadoop_tasktracker" ],
VMware, Inc. 47
"instanceNum": 3, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 } } ], "configuration" : { "hadoop": { "hdfs-site.xml": { "dfs.nameservices": "dataMaster", "dfs.ha.namenodes.dataMaster": "namenode0,namenode1", "dfs.client.failover.proxy.provider.dataMaster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.namenode.rpc-address.dataMaster.namenode0": "10.111.xx.xxx:xxx2", "dfs.namenode.http-address.dataMaster.namenode0": "10.111.xx.xxx:xxx3", "dfs.namenode.rpc-address.dataMaster.namenode1": "10.111.xx.xxx:xxx4", "dfs.namenode.http-address.dataMaster.namenode1": "10.111.xx.xxx:xxx5" } } } }

Create a MapReduce v2 Worker Only Cluster with External Namenode HA HDFS Cluster

You can create a MapReduce v2 (Yarn) worker only cluster with two namenodes in an active-passive HA configuration. The HA namenode provides a hot standby namenode that, in the event of a failure, can perform the role of the active namenode with no downtime.
The following restrictions apply to this task:
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
n
You cannot use a MapR distribution to deploy MapReduce v1 worker only clusters and HBase only
n
clusters.
Prerequisites
Start the Big Data Extensions vApp.
n
Ensure that you have an external Namenode HA HDFS cluster.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 To define the characteristics of the new cluster, open the following cluster specification file to
modify: /opt/serengeti/samples/compute_workers_only_yarn.json
2 Replace hdfs://hostname-of-namenode:8020 in this spec file with the namenode uniform resource identifier
(URI) of the external namenode HA HDFS cluster. The namenode URI is the value of the fs.defaultFS parameter in the core-site.xml of the external cluster.
3 Replace the hostname-of-resourcemanager in the specification file with the FQDN or IP address of the
ResourceManager in the external cluster.
48 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
4 Change the configuration section of the Yarn Worker only cluster specification file as shown in the
following example. All the values can be found in hdfs-site.xml of the external cluster.
{ "externalHDFS": "hdfs://dataMaster", "externalMapReduce": "xx.xxx.xxx.xxx:8021", "nodeGroups":[ { "name": "worker", "roles": [ "hadoop_nodemanager" ], "instanceNum": 3, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 } } ], "configuration" : { "hadoop": { "hdfs-site.xml": { "dfs.nameservices": "dataMaster", "dfs.ha.namenodes.dataMaster": "namenode0,namenode1", "dfs.client.failover.proxy.provider.dataMaster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider", "dfs.namenode.rpc-address.dataMaster.namenode0": "10.555.xx.xxx:xxx1", "dfs.namenode.http-address.dataMaster.namenode0": "10.555.xx.xxx:xxx2", "dfs.namenode.rpc-address.dataMaster.namenode1": "10.555.xx.xxx:xxx3", "dfs.namenode.http-address.dataMaster.namenode1": "10.555.xx.xxx:xxx4" } } } }

About Data Compute Clusters

You can separate the data and compute nodes in a Hadoop cluster, and you can control how nodes are placed on the vSphere ESXi hosts in your environment.
You can create a compute-only cluster to run MapReduce jobs. Compute-only clusters run only MapReduce services that read data from external HDFS clusters and that do not need to store data.
Ambari and Cloudera Manager application managers do not support data-compute separation and compute-only clusters.
VMware, Inc. 49

Create a Data-Compute Separated Cluster with Topology Awareness and Placement Constraints

You can create clusters with separate data and compute nodes, and define topology and placement policy constraints to distribute the nodes among the physical racks and the virtual machines.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the virtual machine automatic migration of the cluster. Although this prevents vSphere from migrating the virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such management functions outside of the Big Data Extensions environment might break the placement policy of the cluster, such as the number of instances per host and the group associations. Even if you do not specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN placement policy constraints.
Prerequisites
Start the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Create a rack-host mapping information file.
n
Upload the rack-host file to the Serengeti server with the topology upload command.
n
Procedure
1 Create a cluster specification file to define the characteristics of the cluster, including the node groups,
topology, and placement constraints.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
In this example, the cluster has groupAssociations and instancePerHost constraints for the compute node group, and a groupRacks constraint for the data node group.
Four data nodes and eight compute nodes are placed on the same four ESXi hosts, which are fairly selected from rack1, rack2, and rack3. Each ESXi host has one data node and two compute nodes. As defined for the compute node group, compute nodes are placed only on ESXi hosts that have data nodes.
This cluster definition requires that you configure datastores and resource pools for at least four hosts, and that there is sufficient disk space for Serengeti to perform the necessary placements during deployment.
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1,
50 VMware, Inc.
"cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1, "groupRacks": { "type": "ROUNDROBIN", "racks": ["rack1", "rack2", "rack3"] }, } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, "placementPolicies": { "instancePerHost": 2, "groupAssociations": [ { "reference": "data", "type": "STRICT" } } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL",
Chapter 5 Creating Hadoop and HBase Clusters
VMware, Inc. 51
"sizeGB": 50 } } ], "configuration": { } }
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the cluster specification file.
cluster create --name cluster_name --specFile full_path/spec_filename

Create a Data-Compute Separated Cluster with No Node Placement Constraints

You can create a cluster with separate data and compute nodes without node placement constraints.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define the characteristics of the cluster.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
In this example, the cluster has separate data and compute nodes, without node placement constraints. Four data nodes and eight compute nodes are created and put into individual virtual machines. The number of nodes is configured by the instanceNum attribute.
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748,
52 VMware, Inc.
"storage": { "type": "LOCAL", "sizeGB": 50 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { } }
Chapter 5 Creating Hadoop and HBase Clusters
2 Access the Serengeti CLI.
3 Run the cluster create command and specify the cluster specification file.
cluster create --name cluster_name --specFile full_path/spec_filename

Create a Data-Compute Separated Cluster with Placement Policy Constraints

You can create a cluster with separate data and compute nodes, and define placement policy constraints to distribute the nodes among the virtual machines as you want.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the virtual machine automatic migration of the cluster. Although this prevents vSphere from migrating the virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such management functions outside of the Big Data Extensions environment might break the placement policy of the cluster, such as the number of instances per host and the group associations. Even if you do not specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN placement policy constraints.
VMware, Inc. 53
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define characteristics of the cluster, including the node groups and
placement policy constraints.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
In this example, the cluster has data-compute separated nodes, and each node group has a
placementPolicy constraint. After a successful provisioning, four data nodes and eight compute nodes
are created and put into individual virtual machines. With the instancePerHost=1 constraint, the four data nodes are placed on four ESXi hosts. The eight compute nodes are put onto four ESXi hosts: two nodes on each ESXi host.
This cluster specification requires that you configure datastores and resource pools for at least four hosts, and that there is sufficient disk space for Serengeti to perform the necessary placements during deployment.
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1 } }, {
54 VMware, Inc.
"name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, "placementPolicies": { "instancePerHost": 2 } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { } }
Chapter 5 Creating Hadoop and HBase Clusters
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the cluster specification file.
cluster create --name cluster_name --specFile full_path/spec_filename

Create a Compute-Only Cluster with the Default Application Manager

You can create compute-only clusters to run MapReduce jobs on existing HDFS clusters, including storage solutions that serve as an external HDFS.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD 1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
VMware, Inc. 55
Procedure
1 Create a cluster specification file that is modeled on the Serengeti compute_only_cluster.json sample
cluster specification file found in the Serengeti cli/samples directory.
2 Add the following content to a new cluster specification file.
In this example, the externalHDFS field points to an HDFS. Assign the hadoop_jobtracker role to the
master node group and the hadoop_tasktracker role to the worker node group.
The externalHDFS field conflicts with node groups that have hadoop_namenode and hadoop_datanode roles. This conflict might cause the cluster creation to fail or, if successfully created, the cluster might not work correctly. To avoid this problem, define only a single HDFS.
{ "externalHDFS": "hdfs://hostname-of-namenode:8020", "nodeGroups": [ { "name": "master", "roles": [ "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "worker", "roles": [ "hadoop_tasktracker", ], "instanceNum": 4, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 }, } ], “configuration” : { } }
56 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
3 Access the Serengeti CLI.
4 Run the cluster create command and include the cluster specification file parameter and associated
filename.
cluster create --name cluster_name --distro distro_name --specFile path/spec_file_name

Create a Compute-Only Cluster with the Cloudera Manager Application Manager

You can create compute-only clusters to run MapReduce jobs on existing HDFS clusters, including storage solutions that serve as an external HDFS.
You can use a Cloudera Manager application manager with any external HDFS.
If you use EMC Isilon OneFS as the external HDFS cluster to the HBase only cluster, you must create and configure users and user groups, and prepare your Isilon OneFS environment. See “Prepare the EMC Isilon
OneFS as the External HDFS Cluster,” on page 40
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file that is modeled on the yarn_compute_only_cluster.json sample
cluster specification file found in the directory /opt/serengeti/samples/cloudera-manager/ on the Serengeti server.
2 Add the following code to your new cluster specification file.
In this cluster specification file, the default_fs_name field points to an HDFS Namenode URI and the
webhdfs_url field points to an HDFS web URL.
{ "nodeGroups": [ { "name": "master", "roles": [ "YARN_RESOURCE_MANAGER", "YARN_JOB_HISTORY" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "SHARED", "sizeGB": 50 }, "haFlag": "on", "configuration": { } }, { "name": "worker", "roles": [
VMware, Inc. 57
"YARN_NODE_MANAGER", "GATEWAY" ], "instanceNum": 3, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 50 }, "haFlag": "off", "configuration": { } } ], "configuration": { "ISILON": { // service level configurations // check for all settings by running "appmanager list --name <name> --configurations" "default_fs_name": "hdfs://FQDN:8020", "webhdfs_url": "hdfs://FQDN:8020/webhdfs/v1" }, "YARN": { // service level configurations }, "YARN_RESOURCE_MANAGER": { }, "YARN_NODE_MANAGER": { "yarn_nodemanager_local_dirs": "/yarn/nm" } } }
3 Access the Serengeti CLI.
4 Run the cluster create command and include the cluster specification file parameter and associated
filename.
cluster create --name computeOnlyCluster_name -- appManager appManager_name
--distro distro_name --specFile path/spec_file_name

Create a Compute-Only Cluster with the Ambari Application Manager and Isilon

You can create a compute-only cluster with the Ambari application manager by using Isilon OneFS. To create a compute-only cluster with Isilon OneFS you must enable Isilon SmartConnect (network load balancing).
To use EMC Isilon OneFS as the external HDFS cluster to the HBase-only cluster, you must create and configure users and user groups, and prepare your Isilon OneFS environment. See “Prepare the EMC Isilon
OneFS as the External HDFS Cluster,” on page 40
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
58 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
To use a Hadoop distribution other than the default Apache Bigtop distribution, add one or more
n
vendor's distributions to your Big Data Extensions environment. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Verify that the Hadoop distribution you wish to use is compatible with Isilon OneFS. Visit the EMC
n
Web site and see Supported Hadoop Distributions in OneFS.
Procedure
1 Create a cluster specification file modeled on one of the following the sample cluster specification files,
hdp_v2_1_yarn_compute_only_cluster.json or hdp_v2_2_yarn_compute_only_cluster.json which you
can find in the directory /opt/serengeti/samples/ambari/ on the Serengeti server.
2 Enable Isilon SmartConnect.
isi networks modify subnet --sc-service-addr=SmartConnect_IP --name=subnet_name isi networks modify pool --name=subnet_name:pool_name --sc-subnet=subnet_name -­zone=zone_name
3 Specify the Ambari server and name node FQDN on your Islion environment.
isi zone zones modify System --hdfs-ambari-namenode=smart_connect_FQDN isi zone zones modify System --hdfs-ambari-server=ambari_server_FQDN
4 Edit the cluster specification
file, /opt/serengeti/samples/ambari/hdp_v2_*_yarn_compute_only_cluster.json, and set
externalNamenode to the Isilon SmartConnect FQDN. If the externalSecondaryNamenode attribute in the
cluster specification file is set to the same value as the externalNamenode, remove the entry for the
externalSecondaryNamenode.
5 Access the Serengeti CLI.
6 Run the cluster create command and include the cluster specification file parameter and associated
filename.
cluster create --name computeOnlyCluster_name -- appManager appManager_name
--distro distro_name --specFile path/spec_file_name
What to do next
Verify that your Ambari managed, compute-only cluster creates successfully, using the configuration necessary for your environment and usage.

Create a Compute Workers Only Cluster With Non-Namenode HA HDFS Cluster

If you already have a physical Hadoop cluster and want to do more CPU or memory intensive operations, you can increase the compute capacity by provisioning a worker only cluster. The worker only cluster is a part of the physical Hadoop cluster and can be scaled out elastically.
With the compute workers only clusters, you can "burst out to virtual." It is a temporary operation that involves borrowing resources when you need them and then returning the resources when you no longer need them. With "burst out to virtual," you spin up compute only workers nodes and add them to either an existing physical or virtual Hadoop cluster.
Restrictions
Worker only clusters are not supported on Ambari and Cloudera
n
Manager application managers.
These options are not supported on compute workers only clusters.
n
--appmanager appmanager_name
n
--type cluster_type
n
--hdfsNetworkName hdfs_network_name
n
VMware, Inc. 59
--mapredNetworkName mapred_network_name
n
Prerequisites
Start the Big Data Extensions vApp.
n
Ensure that you have an existing Hadoop cluster.
n
Verify that you have the IP addresses of the NameNode and ResourceManager node.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 To define the characteristics of the new cluster, make a copy of the following cluster specification
file:/opt/serengeti/samples/compute_workers_only_mr1.json
2 Replace hdfs://hostname-of-namenode:8020 in the specification file with the namenode uniform resource
identifier (URI) of the external HDFS cluster.
3 Replace the hostname-of-jobtracker in the specification file with the FQDN or IP address of the JobTracker
in the external cluster.
4 Change the configuration section of the MapReduce Worker only cluster specification file. All the
values can be found in hdfs-site.xml of the external cluster.

About Customized Clusters

You can use an existing cluster specification file to create clusters by using the same configuration as your previously created clusters. You can also edit the cluster specification file to customize the cluster configuration.

Create a Default Serengeti Hadoop Cluster with the Serengeti Command-Line Interface

You can create as many clusters as you want in your Serengeti environment but your environment must meet all prerequisites.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Deploy a default Serengeti Hadoop cluster on vSphere.
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Bigtop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
cluster create --name cluster_name
60 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
The only valid characters for cluster names are alphanumeric and underscores. When you choose the cluster name, also consider the applicable vApp name. Together, the vApp and cluster names must be < 80 characters.
During the deployment process, real-time progress updates appear on the command-line.
What to do next
After the deployment finishes, you can run Hadoop commands and view the IP addresses of the Hadoop node virtual machines from the Serengeti CLI.

Create a Basic Cluster with the Serengeti Command-Line Interface

You can create a basic cluster in your Serengeti environment. A basic cluster is a group of virtual machines provisioned and managed by Serengeti. Serengeti helps you to plan and provision the virtual machines to your specifications and use the virtual machines to install Big Data applications.
The basic cluster does not install the Big Data application packages used when creating a cluster. Instead, you can install and manage Big Data applications with third party application management tools such as Ambari or Cloudera Manager within your Big Data Extensions environment, and integrate it with your Hadoop software. The basic cluster does not deploy a cluster. You must deploy software into the virtual machines using an external third party application management tool.
The Serengeti package includes an annotated sample cluster specification file that you can use as an example when you create your basic cluster specification file. In the Serengeti Management Server, the sample specification file is located at /opt/serengeti/samples/basic_cluster.json. You can modify the configuration values in the sample cluster specification file to meet your requirements. The only value you cannot change is the value assigned to the role for each node group, which must always be basic.
You can deploy a basic cluster with the Big Data Extension plug-in using a customized cluster specification file.
To deploy software within the basic cluster virtual machines, use the cluster list --detail command, or run serengeti-ssh.sh cluster_name to obtain the IP address of the virtual machine. You can then use the IP address with management applications such as Ambari or Cloudera Manager to provision the virtual machine with software of your choosing. You can configure the management application to use the user name serengeti, and the password you specified when creating the basic cluster within Big Data Extensions when the management tool needs a user name and password to connect to the virtual machines.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the cluster, as well as the Big Data software
n
you intend to deploy.
Procedure
1 Create a specification file to define the basic cluster's characteristics.
You must use the basic role for each node group you define for the basic cluster.
{ "nodeGroups":[ { "name": "master", "roles": [ "basic" ], "instanceNum": 1, "cpuNum": 2,
VMware, Inc. 61
"memCapacityMB": 3768, "storage": { "type": "LOCAL", "sizeGB": 250 }, "haFlag": "on" }, { "name": "worker", "roles": [ "basic" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 3768, "storage": { "type": "LOCAL", "sizeGB": 250 }, "haFlag": "off" } ] }
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the basic cluster specification file.
cluster create --name cluster_name --specFile /opt/serengeti/samples/basic_cluster.json -­password
NOTE When creating a basic cluster, you do not need to specify a Hadoop distribution type using the
--distro option. The reason for this is that there is no Hadoop distribution being installed within the
basic cluster to be managed by Serengeti.

Create a Cluster with an Application Manager by Using the Serengeti Command-Line Interface

You can use the Serengeti CLI to add a cluster with an application manager other than the default application manager. Then you can manage your cluster with the new application manager.
NOTE If you want to create a local yum repository, you must create the repository before you create the cluster.
Prerequisites
Connect to an application manager.
n
Ensure that you have adequate resources allocated to run the cluster. For information about resource
n
requirements, see the documentation for your application manager.
Verify that you have more than one distribution if you want to use a distribution other than the default
n
distribution. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
62 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
2 Run the cluster command.
cluster create --name cluster_name --appManager appmanager_name
--[localrepoURL local_repository_url]
If you do not use the appManager parameter, the default application manager is used.

Create a Compute Workers Only Cluster by Using the vSphere Web Client

If you already have a physical Hadoop cluster and want to do more CPU or memory intensive operations, you can increase the compute capacity by provisioning a workers only cluster. The workers only cluster is a part of the physical Hadoop cluster and can be scaled out elastically.
With the compute workers only clusters, you can "burst out to virtual." It is a temporary operation that involves borrowing resources when you need them and then returning the resources when you no longer need them. With "burst out to virtual," you spin up compute only workers nodes and add them to either an existing physical or virtual Hadoop cluster.
Worker only clusters are not supported on Ambari and Cloudera Manager application managers.
Prerequisites
Ensure that you have an existing Hadoop cluster.
n
Verify that you have the IP addresses of the NameNode and ResourceManager node.
n
Procedure
1 Click Create Big Data Cluster on the objects pane.
2 In the Create Big Data Cluster wizard, choose the same distribution as the Hadoop cluster.
3 Set the DataMaster URL HDFS:namenode ip or fqdn:8020.
4 Set the ComputeMaster URL nodeManager ip or fqdn.
5 Follow the steps in the wizard and add the other resources.
There will be three node managers in the cluster. The three new node managers are registered to the resource manager.

Create a Cluster with a Custom Administrator Password with the Serengeti Command-Line Interface

When you create a cluster, you can assign a custom administrator password to all the nodes in the cluster. Custom administrator passwords let you directly log in to the nodes instead of having to first log in to the Serengeti Management server.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command and include the --password parameter.
cluster create --name cluster_name --password
VMware, Inc. 63
3 Enter your custom password, and enter it again.
Passwords must be from 8 to 20 characters, use only visible lowerASCII characters (no spaces), and must contain at least one uppercase alphabetic character (A - Z), at least one lowercase alphabetic character (a - z), at least one digit (0 - 9), and at least one of the following special characters: _, @, #, $, %, ^, &, *
Your custom password is assigned to all the nodes in the cluster.
Create a Cluster with an Available Distribution with the Serengeti Command­Line Interface
You can choose which Hadoop distribution to use when you deploy a cluster. If you do not specify a Hadoop distribution, the resulting cluster is created with the default distribution, Apache Bigtop.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command, and include the --distro parameter.
The --distro parameter’s value must match a distribution name displayed by the distro list command.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
This example deploys a cluster with the Cloudera CDH distribution:
cluster create --name clusterName --distro cdh
This example creates a customized cluster named mycdh that uses the CDH5 Hadoop distribution, and is configured according to the /opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json sample cluster specification file. In this sample file, nameservice0 and nameservice1 are federated. That is,
nameservice0 and nameservice1 are independent and do not require coordination with each other. The
NameNode nodes in the nameservice0 node group are HDFS2 HA enabled. In Serengeti, name node group names are used as service names for HDFS2.
cluster create --name mycdh --distro cdh5 --specFile /opt/serengeti/samples/default_cdh5_ha_hadoop_cluster.json
64 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters

Create a Cluster with Multiple Networks with the Serengeti Command-Line Interface

When you create a cluster, you can distribute the management, HDFS, and MapReduce traffic to separate networks. You might want to use separate networks to improve performance or to isolate traffic for security reasons.
For optimal performance, use the same network for HDFS and MapReduce traffic in Hadoop and Hadoop +HBase clusters. HBase clusters use the HDFS network for traffic related to the HBase Master and HBase RegionServer services.
IMPORTANT You cannot configure multiple networks for clusters that use the MapR Hadoop distribution.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command and include the --networkName, --hdfsNetworkName, and --
mapredNetworkName parameters.
cluster create --name cluster_name --networkName management_network [--hdfsNetworkName hdfs_network] [--mapredNetworkName mapred_network]
If you omit an optional network parameter, the traffic associated with that network parameter is routed on the management network that you specify by the --networkName parameter.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
The cluster's management, HDFS, and MapReduce traffic is distributed among the specified networks.

Create a Cluster with Assigned Resources with the Serengeti Command-Line Interface

By default, when you use Serengeti to deploy a Hadoop cluster, the cluster might contain any or all available resources: vCenter Server resource pool for the virtual machine's CPU and memory, datastores for the virtual machine's storage, and a network. You can assign which resources the cluster uses by specifying specific resource pools, datastores, and/or a network when you create the Hadoop cluster.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
VMware, Inc. 65
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command, and specify any or all of the command’s resource parameters.
This example deploys a cluster named myHadoop on the myDS datastore, under the myRP resource pool, and uses the myNW network for virtual machine communications.
cluster create --name myHadoop --rpNames myRP --dsNames myDS --networkName myNW

Create a Cluster with Any Number of Master, Worker, and Client Nodes

You can create a Hadoop cluster with any number of master, worker, and client nodes.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define the cluster's characteristics, including the node groups.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
In this example, the cluster has one master MEDIUM size virtual machine, five worker SMALL size virtual machines, and one client SMALL size virtual machine. The instanceNum attribute configures the number of virtual machines in a node.
{ "nodeGroups" : [ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM" }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL" }, { "name": "client", "roles": [
66 VMware, Inc.
Chapter 5 Creating Hadoop and HBase Clusters
"hadoop_client", "hive", "hive_server", "pig" ], "instanceNum": 1, "instanceType": "SMALL" } ] }
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the cluster specification file.
cluster create --name cluster_name --specFile directory_path/spec_filename
Create a Customized Hadoop or HBase Cluster with the Serengeti Command­Line Interface
You can create clusters that are customized for your requirements, including the number of nodes, virtual machine RAM and disk size, the number of CPUs, and so on.
The Serengeti package includes several annotated sample cluster specification files that you can use as models when you create your custom specification files.
In the Serengeti Management Server, the sample cluster specification files are
n
in /opt/serengeti/samples.
If you use the Serengeti Remote CLI client, the sample specification files are in the client directory.
n
Changing a node group role might cause the cluster creation process to fail. For example, workable clusters require a NameNode, so if there are no NameNode nodes after you change node group roles, you cannot create a cluster.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define the cluster's characteristics such as the node groups.
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the cluster specification file.
Use the full path to specify the file.
cluster create --name cluster_name --specFile full_path/spec_filename
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, or Pivotal PHD
1.1 or later cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce network traffic. If the DNS server cannot provide valid forward and reverse FQDN/IP resolution, the cluster creation process might fail or the cluster is created but does not function.
VMware, Inc. 67
68 VMware, Inc.
Managing Hadoop and HBase
Clusters 6
You can use the vSphere Web Client to start and stop your big data cluster and modify the cluster configuration. You can also manage a cluster using the Serengeti Command-Line Interface.
CAUTION Do not use vSphere management functions such as migrating cluster nodes to other hosts for clusters that you create with Big Data Extensions. Performing such management functions outside of the Big Data Extensions environment can make it impossible for you to perform some Big Data Extensions operations, such as disk failure recovery.
This chapter includes the following topics:
“Stop and Start a Cluster with the Serengeti Command-Line Interface,” on page 69
n
“Scale Out a Cluster with the Serengeti Command-Line Interface,” on page 70
n
“Scale CPU and RAM with the Serengeti Command-Line Interface,” on page 70
n
“Reconfigure a Cluster with the Serengeti Command-Line Interface,” on page 71
n
“Delete a Cluster by Using the Serengeti Command-Line Interface,” on page 73
n
“About vSphere High Availability and vSphere Fault Tolerance,” on page 73
n
“Reconfigure a Node Group with the Serengeti Command-Line Interface,” on page 73
n
“Expanding a Cluster with the Command-Line Interface,” on page 74
n
“Recover from Disk Failure with the Serengeti Command-Line Interface Client,” on page 75
n
“Recover a Cluster Node Virtual Machine,” on page 75
n
“Enter Maintenance Mode to Perform Backup and Restore with the Serengeti Command-Line Interface
n
Client,” on page 76

Stop and Start a Cluster with the Serengeti Command-Line Interface

You can stop a currently running cluster and start a stopped cluster from the Serengeti CLI. When you start or stop a cluster through Cloudera Manager or Ambari, only the services are started or stopped. However, when you start or stop a cluster through Big Data Extensions, not only the services, but also the virtual machines are started or stopped.
Prerequisites
Verify that the cluster is provisioned.
n
Verify that enough resources, especially CPU and memory, are available to start the virtual machines in
n
the Hadoop cluster.
VMware, Inc.
69
Procedure
1 Access the Serengeti CLI.
2 Run the cluster stop command.
cluster stop –-name name_of_cluster_to_stop
3 Run the cluster start command.
cluster start –-name name_of_cluster_to_start

Scale Out a Cluster with the Serengeti Command-Line Interface

You specify the number of nodes in the cluster when you create Hadoop and HBase clusters. You can later scale out the cluster by increasing the number of worker nodes and client nodes.
IMPORTANT Even if you changed the user password on the nodes of a cluster, the changed password is not used for the new nodes that are created when you scale out a cluster. If you set the initial administrator password for the cluster when you created the cluster, that initial administrator password is used for the new nodes. If you did not set the initial administrator password for the cluster when you created the cluster, new random passwords are used for the new nodes.
Prerequisites
Ensure that the cluster is started.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster resize command.
For node_type, specify worker or client. For the instanceNum parameter’s num_nodes value, use any number that is larger than the current number of node_type instances.
cluster resize --name name_of_cluster_to_resize --nodeGroup node_type --instanceNum num_nodes

Scale CPU and RAM with the Serengeti Command-Line Interface

You can increase or decrease the compute capacity and RAM of a cluster to prevent memory resource contention of running jobs.
Serengeti lets you adjust compute and memory resources without increasing the workload on the master node. If increasing or decreasing the CPU of a cluster is unsuccessful for a node, which is commonly due to insufficient resources being available, the node is returned to its original CPU setting. If increasing or decreasing the RAM of a cluster is unsuccessful for a node, which is commonly due to insufficient resources, the swap disk retains its new setting anyway. The disk is not returned to its original memory setting.
Although all node types support CPU and RAM scaling, do not scale the master node of a cluster because Serengeti powers down the virtual machine during the scaling process.
The maximum CPU and RAM settings depend on the version of the virtual machine.
Table 61. Maximum CPU and RAM Settings
Virtual Machine Version Maximum Number of CPUs Maximum RAM, in GB
7 8 255
8 32 1011
9 64 1011
10 64 1011
70 VMware, Inc.
Chapter 6 Managing Hadoop and HBase Clusters
Prerequisites
Start the cluster if it is not running.
Procedure
1 Access the Serengeti Command-Line Interface.
2 Run the cluster resize command to change the number of CPUs or the amount of RAM of a cluster.
Node types are either worker or client.
n
Specify one or both scaling parameters: --cpuNumPerNode or --memCapacityMbPerNode.
n
cluster resize --name cluster_name --nodeGroup node_type [--cpuNumPerNode vCPUs_per_node] [--memCapacityMbPerNode memory_per_node]

Reconfigure a Cluster with the Serengeti Command-Line Interface

You can reconfigure any big data cluster that you create with Big Data Extensions.
The cluster configuration is specified by attributes in Hadoop distribution XML configuration files such as:
core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, yarn-env.sh, yarn-site.sh, and hadoop­metrics.properties.
For details about the Serengeti JSON-formatted configuration file and associated attributes in Hadoop distribution files see the VMware vSphere Big Data Extensions Command-Line Interface Guide.
For details about the Serengeti JSON-formatted configuration file and associated attributes in Hadoop distribution files see Chapter 8, “Cluster Specification Reference,” on page 83.
NOTE Always use the cluster config command to change the parameters specified by the configuration files. If you manually modify these files, your changes will be erased if the virtual machine is rebooted, or you use the cluster config, cluster start, cluster stop, or cluster resize commands.
Procedure
1 Use the cluster export command to export the cluster specification file for the cluster that you want to
reconfigure.
cluster export --name cluster_name --specFile file_path/cluster_spec_file_name
Option Description
cluster_name
file_path
cluster_spec_file_name
Name of the cluster that you want to reconfigure.
The file system path to which to export the specification file.
The name with which to label the exported cluster specification file.
2 Edit the configuration information located near the end of the exported cluster specification file.
If you are modeling your configuration file on existing Hadoop XML configuration files, use the
convert-hadoop-conf.rb conversion tool to convert Hadoop XML configuration files to the required
JSON format.
… "configuration": { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/core­default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample:
VMware, Inc. 71
// "io.file.buffer.size": "4096" }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs­default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/mapred­default.html }, "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", // "HADOOP_SECONDARYNAMENODE_OPTS": "", // "HADOOP_JOBTRACKER_OPTS": "", // "HADOOP_TASKTRACKER_OPTS": "", // "HADOOP_CLASSPATH": "", // "JAVA_HOME": "", // "PATH": "", }, "log4j.properties": { // "hadoop.root.logger": "DEBUG, DRFA ", // "hadoop.security.logger": "DEBUG, DRFA ", }, "fair-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html // "text": "the full content of fair-scheduler.xml in one line" }, "capacity-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html } } } …
3 (Optional) If the JAR files of your Hadoop distribution are not in the $HADOOP_HOME/lib directory, add
the full path of the JAR file in $HADOOP_CLASSPATH to the cluster specification file.
This action lets the Hadoop daemons locate the distribution JAR files.
For example, the Cloudera CDH3 Hadoop Fair Scheduler JAR files are in /usr/lib/hadoop/contrib/fairscheduler/. Add the following to the cluster specification file to enable Hadoop to use the JAR files.
… "configuration": { "hadoop": { "hadoop-env.sh": { "HADOOP_CLASSPATH": "/usr/lib/hadoop/contrib/fairscheduler/*:$HADOOP_CLASSPATH" }, "mapred-site.xml": { "mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler" … },
72 VMware, Inc.
Chapter 6 Managing Hadoop and HBase Clusters
"fair-scheduler.xml": { … } } } …
4 Access the Serengeti CLI.
5 Run the cluster config command to apply the new Hadoop configuration.
cluster config --name cluster_name --specFile file_path/cluster_spec_file_name
6 (Optional) Reset an existing configuration attribute to its default value.
a Remove the attribute from the configuration section of the cluster configuration file or comment
out the attribute using double back slashes (//).
b Re-run the cluster config command.

Delete a Cluster by Using the Serengeti Command-Line Interface

You can delete a cluster that you no longer need, regardless of whether it is running. When a cluster is deleted, all its virtual machines and resource pools are destroyed.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster delete command.
cluster delete --name cluster_name

About vSphere High Availability and vSphere Fault Tolerance

The Serengeti Management Server leverages vSphere HA to protect the Hadoop master node virtual machine, which can be monitored by vSphere.
When a Hadoop NameNode or JobTracker service stops unexpectedly, vSphere restarts the Hadoop virtual machine in another host, reducing unplanned downtime. If vsphere Fault Tolerance is configured and the master node virtual machine stops unexpectedly because of host failover or loss of network connectivity, the secondary node is used, without downtime.

Reconfigure a Node Group with the Serengeti Command-Line Interface

You can reconfigure node groups by modifying node group configuration data in the associated cluster specification file. When you configure a node group, its configuration overrides any cluster level configuration of the same name.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster export command to export the cluster’s cluster specification file.
cluster export --name cluster_name --specFile path_name/spec_file_name
3 In the specification file, modify the node group’s configuration section with the same content as a
cluster-level configuration.
4 Add the customized Hadoop configuration for the node group that you want to reconfigure.
VMware, Inc. 73
5 Run the cluster config command to apply the new Hadoop configuration.
cluster config --name cluster_name --specFile path_name/spec_file_name

Expanding a Cluster with the Command-Line Interface

You can expand an existing Big Data cluster by adding additional node groups.
Procedure
1 Access the Serengeti CLI.
2 Edit the cluster specification file to include the new node groups you want to add to the cluster.
When editing the cluster specification file to expand the cluster, keep the following items in mind.
The new, expanded node groups must not use the same names as the existing node groups in the
n
cluster.
Ensure you use the correct syntax when editing the cluster specification file. Each element and it's
n
configuration value must be correct, or the expansion operation will fail.
This example illustrates an updated nodeGroups configuration from the larger cluster specification file.
{ "nodeGroups":[ { "name": "master1", "roles": [ "basic" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 3768, "storage": { "type": "SHARED", "sizeGB": 10 }, "haFlag": "on" }, { "name": "worker1", "roles": [ "basic" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 3768, "storage": { "type": "LOCAL", "sizeGB": 10 }, "haFlag": "off" } ] }
74 VMware, Inc.
Chapter 6 Managing Hadoop and HBase Clusters
3 Run the cluster expand command to apply the new cluster configuration with the expanded node
groups.
cluster expand --name cluster_name --specFile path_name/spec_file_name
If the cluster expand operation fails, the cluster's status changes to PROVISION_ERROR. To recover from this condition, verify that the cluster specification file uses the correct syntax, and run the cluster
expand command once again to recover from the failure.
What to do next
You can verify that the node groups were added to the cluster using the cluster list command. See “View
Provisioned Clusters with the Serengeti Command-Line Interface,” on page 81.

Recover from Disk Failure with the Serengeti Command-Line Interface Client

If there is a disk failure in a cluster, and the disk does not perform management roles such as NameNode, JobTracker, ResourceManager, HMaster, or ZooKeeper, you can recover by running the Serengeti cluster
fix command.
Big Data Extensions uses a large number of inexpensive disk drives for data storage (configured as JBOD). If several disks fail, the Hadoop data node might shutdown. Big Data Extensions enables you to recover from disk failures.
Serengeti supports recovery from swap and data disk failure on all supported Hadoop distributions. Disks are recovered and started in sequence to avoid the temporary loss of multiple nodes at once. A new disk matches the storage type and placement policies of the corresponding failed disk.
The MapR distribution does not support recovery from disk failure by using the cluster fix command.
IMPORTANT Even if you changed the user password on the nodes of the cluster, the changed password is not used for the new nodes that are created by the disk recovery operation. If you set the initial administrator password of the cluster when you created the cluster, that initial administrator password is used for the new nodes. If you did not set the initial administrator password of the cluster when you created the cluster, new random passwords are used for the new nodes.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster fix command.
The nodeGroup parameter is optional.
cluster fix --name cluster_name --disk [--nodeGroup nodegroup_name]

Recover a Cluster Node Virtual Machine

You can recover cluster node virtual machines that have become disassociated from their managed object identifier (MOID), or resource pool and virtual machine name.
In rare instances, the managed object identifier (MOID) of a cluster node virtual machine can change. This may occur when a host crashes and re-registers with vCenter Server. When BDE can not locate a node virtual machine in vCenter Server by its MOID, it first tries to locate the node by it resource pool and virtual machine name. If this it not possible, you can recover the cluster node virtual machine with the cluster
recover command.
Procedure
1 Access the Serengeti CLI.
VMware, Inc. 75
2 Run the cluster recover command to update the cluster, and recover the cluster node virtual machine.
cluster recover
What to do next
You can verify that the cluster node virtual machine was successfully recovered.

Enter Maintenance Mode to Perform Backup and Restore with the Serengeti Command-Line Interface Client

Before performing backup and restore operations, or other maintenance tasks, you must place Big Data Extensions into maintenance mode.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the default distribution, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Log into the Serengeti Management Server.
2 Run the script /opt/serengeti/sbin/serengeti-maintenance.sh to place Big Data Extensions into
maintenance mode, or check maintenance status.
serengeti-maintenance.sh on | off | status
Option Description
on
off
status
Turns on maintenance mode. Upon entering maintenance mode, Big Data Extensions continues executing jobs that have already been started, but will not respond to any new requests.
Turn off maintenance mode, and returns Big Data Extensions to its normal operating state.
Displays the maintenance status of Big Data Extensions.
n
A status of safe means it is safe to backup or perform other maintenance tasks on your Big Data Extensions deployment.
n
A status of off means maintenance mode has been turned off, and it is not safe to perform maintenance tasks such as backup and restore.
n
A status of on means Big Data Extensions has entered maintenance mode, but it is not yet safe to perform back and restore operations. You must wait until the system returns the safe status message.
To place your Big Data Extensions deployment into maintenance mode, run the serengeti-
maintenance.sh script with the on option.
serengeti-maintenance.sh on
3 Verify that Big Data Extensions is in maintenance mode.
When Big Data Extensions completes all jobs that have been submitted, the maintenance status will enter safe mode. Run the serengeti-maintenance.sh with the status parameter repeatedly until it returns the safe system status message.
serengeti-maintenance.sh status safe
4 Perform the necessary system maintenance tasks.
76 VMware, Inc.
Chapter 6 Managing Hadoop and HBase Clusters
5 Once you have completed the necessary system maintenance tasks, return Big Data Extensions to its
normal operating state by manually exiting maintenance mode.
serengeti-maintenance.sh off
VMware, Inc. 77
78 VMware, Inc.
Monitoring the Big Data Extensions
Environment 7
You can monitor the status of Serengeti-deployed clusters, including their datastores, networks, and resource pools through the Serengeti Command-Line Interface. You can also view a list of available Hadoop distributions. Monitoring capabilities are also available in the vSphere Web Client.
This chapter includes the following topics:
“View List of Application Managers by using the Serengeti Command-Line Interface,” on page 79
n
“View Available Hadoop Distributions with the Serengeti Command-Line Interface,” on page 80
n
“View Supported Distributions for All Application Managers by Using the Serengeti Command-Line
n
Interface,” on page 80
“View Configurations or Roles for Application Manager and Distribution by Using the Serengeti
n
Command-Line Interface,” on page 80
“View Provisioned Clusters with the Serengeti Command-Line Interface,” on page 81
n
“View Datastores with the Serengeti Command-Line Interface,” on page 81
n
“View Networks with the Serengeti Command-Line Interface,” on page 81
n
“View Resource Pools with the Serengeti Command-Line Interface,” on page 82
n
View List of Application Managers by using the Serengeti Command­Line Interface
You can use the appManager list command to list the application managers that are installed on the Big Data Extensions environment.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list
The command returns a list of all application managers that are installed on the Big Data Extensions environment.
VMware, Inc.
79
View Available Hadoop Distributions with the Serengeti Command­Line Interface
Supported distributions are those distributions that are supported by Big Data Extensions. Available distributions are those distributions that have been added into your Big Data Extensions environment. You use the distro list command to view a list of Hadoop distributions that are available in your Serengeti deployment. When you create clusters, you can use any available Hadoop distribution.
Procedure
1 Access the Serengeti CLI.
2 Run the distro list command.
The available Hadoop distributions are listed, along with their packages.
What to do next
Before you use a distribution, verify that it includes the services that you want to deploy. If services are missing, add the appropriate packages to the distribution.

View Supported Distributions for All Application Managers by Using the Serengeti Command-Line Interface

Supported distributions are those distributions that are supported by Big Data Extensions. Available distributions are those distributions that have been added into your Big Data Extensions environment. You can view a list of the Hadoop distributions that are supported in the Big Data Extensions environment to determine if a particular distribution is available for a particular application manager.
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list --name application_manager_name [--distros]
If you do not include the --name parameter, the command returns a list of all the Hadoop distributions that are supported on each of the application managers in the Big Data Extensions environment.
The command returns a list of all distributions that are supported for the application manager of the name that you specify.

View Configurations or Roles for Application Manager and Distribution by Using the Serengeti Command-Line Interface

You can use the appManager list command to list the Hadoop configurations or roles for a specific application manager and distribution.
The configuration list includes those configurations that you can use to configure the cluster in the cluster specifications.
The role list contains the roles that you can use to create a cluster. You should not use unsupported roles to create clusters in the application manager.
80 VMware, Inc.
Chapter 7 Monitoring the Big Data Extensions Environment
Prerequisites
Verify that you are connected to an application manager.
Procedure
1 Access the Serengeti CLI.
2 Run the appmanager list command.
appmanager list --name application_manager_name [--distro distro_name (--configurations | --roles) ]
The command returns a list of the Hadoop configurations or roles for a specific application manager and distribution.

View Provisioned Clusters with the Serengeti Command-Line Interface

From the Serengeti CLI, you can list the provisioned clusters that are in the Serengeti deployment.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster list command.
cluster list
This example displays a specific cluster by including the --name parameter.
cluster list --name cluster_name
This example displays detailed information about a specific cluster by including the --name and --
detail parameters.
cluster list --name cluster_name –-detail

View Datastores with the Serengeti Command-Line Interface

From the Serengeti CLI, you can see the datastores that are in the Serengeti deployment.
Procedure
1 Access the Serengeti CLI.
2 Run the datastore list command.
This example displays detailed information by including the --detail parameter.
datastore list --detail
This example displays detailed information about a specific datastore by including the --name and --
detail parameters.
datastore list --name datastore_name --detail

View Networks with the Serengeti Command-Line Interface

From the Serengeti CLI, you can see the networks that are in the Serengeti deployment.
Procedure
1 Access the Serengeti CLI.
VMware, Inc. 81
2 Run the network list command.
This example displays detailed information by including the --detail parameter.
network list --detail
This example displays detailed information about a specific network by including the --name and --
detail parameters.
network list --name network_name --detail

View Resource Pools with the Serengeti Command-Line Interface

From the Serengeti CLI, you can see the resource pools that are in the Serengeti deployment.
Procedure
1 Access the Serengeti CLI.
2 Run the resourcepool list command.
This example displays detailed information by including the --detail parameter.
resourcepool list --detail
This example displays detailed information about a specific datastore by including the --name and --
detail parameters.
resourcepool list --name resourcepool_name –-detail
82 VMware, Inc.

Cluster Specification Reference 8

To customize your clusters, you must know how to use Serengeti cluster specification files and define the cluster requirements with the various attributes and objects. After you create your configuration files you can convert them to JSON file format.
This chapter includes the following topics:
“Cluster Specification File Requirements,” on page 83
n
“Cluster Definition Requirements,” on page 83
n
“Annotated Cluster Specification File,” on page 84
n
“Cluster Specification Attribute Definitions,” on page 87
n
“White Listed and Black Listed Hadoop Attributes,” on page 90
n
“Convert Hadoop XML Files to Serengeti JSON Files,” on page 92
n

Cluster Specification File Requirements

A cluster specification file is a text file with the configuration attributes provided in a JSON-like formatted structure. Cluster specification files must adhere to requirements concerning syntax, quotation mark usage, and comments.
To parse cluster specification files, Serengeti uses the Jackson JSON Processor. For syntax requirements,
n
such as the truncation policy for float types, see the Jackson JSON Processor Wiki.
Always enclose digital values in quotation marks. For example:
n
"mapred.tasktracker.reduce.tasks.maximum" : "2"
The quotation marks ensure that integers are correctly interpreted instead of being converted to double­precision floating point, which can cause unintended consequences.
You can include only single-line comments using the pound sign (#) to identify the comment.
n

Cluster Definition Requirements

Cluster specification files contain configuration definitions for clusters, such as their roles and node groups. Cluster definitions must adhere to requirements concerning node group roles, cluster roles, and instance numbers.
A cluster definition has the following requirements:
Node group roles cannot be empty. You can determine the valid role names for your Hadoop
n
distribution by using the distro list command.
VMware, Inc.
83
The hadoop_namenode and hadoop_jobtracker roles must be configured in a single node group.
n
In Hadoop 2.0 clusters, such as CDH4 or Pivotal HD, the instance number can be greater than 1 to
n
create an HDFS HA or Federation cluster.
Otherwise, the total instance number must be 1.
n
Node group instance numbers must be positive numbers.
n

Annotated Cluster Specification File

The Serengeti cluster specification file defines the different Hadoop and HBase nodes and their resources for use by your Big Data cluster. You can use this annotated cluster specification file, and the sample files in /opt/serengeti/samples, as models to emulate when you create your Big Data clusters.
The following code is a typical cluster specification file. For code annotations, see Table 8-1.
1 { 2 "nodeGroups" : [ 3 { 4 "name": "master", 5 "roles": [ 6 "hadoop_namenode", 7 "hadoop_resourcemanager" 8 ], 9 "instanceNum": 1, 10 "instanceType": "LARGE", 11 "cpuNum": 2, 12 "memCapacityMB":4096, 13 "storage": { 14 "type": "SHARED", 15 "sizeGB": 20 16 }, 17 "haFlag":"on", 18 "rpNames": [ 19 "rp1" 20 ] 21 }, 22 { 23 "name": "data", 24 "roles": [ 25 "hadoop_datanode" 26 ], 27 "instanceNum": 3, 28 "instanceType": "MEDIUM", 29 "cpuNum": 2, 30 "memCapacityMB":2048, 31 "storage": { 32 "type": "LOCAL", 33 "sizeGB": 50, 34 "dsNames4Data": ["DSLOCALSSD"], 35 "dsNames4System": ["DSNDFS"] 36 } 37 "placementPolicies": { 38 "instancePerHost": 1, 39 "groupRacks": { 40 "type": "ROUNDROBIN", 41 "racks": ["rack1", "rack2", "rack3"]
84 VMware, Inc.
42 } 43 } 44 }, 45 { 46 "name": "compute", 47 "roles": [ 48 "hadoop_nodemanger" 49 ], 50 "instanceNum": 6, 51 "instanceType": "SMALL", 52 "cpuNum": 2, 53 "memCapacityMB":2048, 54 "storage": { 55 "type": "LOCAL", 56 "sizeGB": 10 57 } 58 "placementPolicies": { 59 "instancePerHost": 2, 60 "groupAssociations": [{ 61 "reference": "data", 62 "type": "STRICT" 63 }] 64 } 65 }, 66 { 67 "name": "client", 68 "roles": [ 69 "hadoop_client", 70 "hive", 71 "hive_server", 72 "pig" 73 ], 74 "instanceNum": 1, 75 "instanceType": "SMALL", 76 "memCapacityMB": 2048, 77 "storage": { 78 "type": "LOCAL", 79 "sizeGB": 10, 80 "dsNames": [“ds1”, “ds2”] 81 } 82 } 83 ], 84 "configuration": { 85 } 86 }
Chapter 8 Cluster Specification Reference
The cluster definition elements are defined in the table.
Table 81. Example Cluster Specification Annotation
Line(s) Attribute Example Value Description
4 name master Node group name.
5-8 role hadoop_namenode,
hadoop_jobtracker
VMware, Inc. 85
Node group role. hadoop_namenode and hadoop_jobtracker are
deployed to the node group's virtual machine.
Table 81. Example Cluster Specification Annotation (Continued)
Line(s) Attribute Example Value Description
9 instanceNum 1 Number of instances in the node group.
10 instanceType LARGE Node group instance type.
11 cpuNum 2 Number of CPUs per virtual machine.
12 memCapacityMB 4096 RAM size, in MB, per virtual machine.
13-16 storage See lines 14-15 for one
group's storage attributes
14 type SHARED Storage type.
15 sizeGB 20 Storage size.
17 haFlag on HA protection for the node group.
18-20 rpNames rp1 Resource pools under which the node group
22-36 Node group
definition for the data node
37-44 placementPolicies See code sample Data node group's placement policy constraints.
Only one virtual machine is created for the group.
You can have multiple instances for
n
hadoop_tasktracker, hadoop_datanode, hadoop_client, pig, and hive.
For HDFS1 clusters, you can have only one
n
instance of hadoop_namenode and hadoop_jobtracker.
For HDFS2 clusters, you can have two
n
hadoop_namenode instances.
With a MapR distribution, you can configure
n
multiple instances of hadoop_jobtracker.
Instance types are predefined virtual machine specifications, which are combinations of the number of CPUs, RAM sizes, and storage size. The predefined numbers can be overridden by the cpuNum, memCapacityMB, and storage attributes in the Serengeti server specification file.
This attribute overrides the number of vCPUs in the predefined virtual machine specification.
This attribute overrides the RAM size in the predefined virtual machine specification.
Node group storage requirements.
The node group is deployed using only shared storage.
Each node in the node group is deployed with 20GB available disk space.
The node group is deployed with vSphere HA protection.
virtual machines are deployed. These pools can be an array of values.
See lines 3-21, which define the same attributes for the master node.
In lines 34-35, data disks are placed on dsNames4Data datastores, and system disks are placed on dsNames4System datastores.
You need at least three ESXi hosts because there are three instances and a requirement that each instance be on its own host. This group is provisioned on hosts on rack1, rack2, and rack3 by using a ROUNDROBIN algorithm.
86 VMware, Inc.
Table 81. Example Cluster Specification Annotation (Continued)
Line(s) Attribute Example Value Description
45-57 Node group
definition for the compute node
58-65 placementPolicies See code sample Compute node group's placement policy
66-82 Node group
definition for the client node
83-86 configuration Empty in the code
sample

Cluster Specification Attribute Definitions

Chapter 8 Cluster Specification Reference
See lines 4-16, which define the same attributes for the master node.
constraints.
You need at least three ESXi hosts to meet the instance requirements. The compute node group references a data node group through STRICT typing. The two compute instances use a data instance on the ESXi host. The STRICT association provides better performance.
See previous node group definitions.
Hadoop configuration customization.
Cluster definitions include attributes for the cluster itself and for each of the node groups.
Cluster Specification Outer Attributes
Cluster specification outer attributes apply to the cluster as a whole.
Table 82. Cluster Specification Outer Attributes
Attribute Type Mandatory/ Optional Description
nodeGroups object Mandatory One or more group specifications. See Table 8-3.
configuration object Optional Customizable Hadoop configuration key/value pairs.
externalHDFS string Optional Valid only for compute-only clusters. URI of external HDFS.
Cluster Specification Node Group Objects and Attributes
Node group objects and attributes apply to one node group in a cluster.
Table 83. Cluster Specification’s Node Group Objects and Attributes
Attribute Type Mandatory/ Optional Description
name string Mandatory User defined node group name.
roles list of string Mandatory List of software packages or services to
install on the virtual machine. Values must match the roles displayed by the distro list command.
VMware, Inc. 87
Table 83. Cluster Specification’s Node Group Objects and Attributes (Continued)
Attribute Type Mandatory/ Optional Description
instanceNum integer Mandatory Number of virtual machines in the node
instanceType string Optional Size of virtual machines in the node
cpuNum integer Optional Number of CPUs per virtual machine. If
memCapacityMB integer Optional RAM size, in MB, per virtual machine.
swapRatio float Optional Defines the ratio of OS Swap Disk Size
group:
Positive integer.
n
Generally, you can have multiple
n
instances for hadoop_tasktracker, hadoop_datanode, hadoop_client, pig, and hive.
For HDFS1 clusters, you can have
n
only one instance of
hadoop_namenode and hadoop_jobtracker.
For HDFS2 clusters, you can have
n
two hadoop_namenode instances.
With a MapR distribution, you can
n
configure multiple instances of hadoop_jobtracker.
group, expressed as the name of a predefined virtual machine template. See
Table 8-4.
SMALL
n
MEDIUM
n
LARGE
n
EXTRA_LARGE
n
If you specify cpuNum , memCapacityMB , or sizeGB attributes, they override the corresponding value of your selected virtual machine template for the applicable node group.
the haFlag value is FT, the cpuNum value must be 1.
NOTE When using MapR 3.1, you must specify a minimum of 5120 MBs of memory capacity for the zookeeper, worker, and client nodes.
and Memory Size.
For example, if memory is 4GB (4096MB) and the swapRatio is 1. The swap disk will be 4GB. If you specify a swapRatio of 2, the swap disk will be 8GB.
You can also specify a float for the swapRatio attribute. Specifying a value of 0.5 with if memory is 4GB of memory creates a swap disk of 2GB.
88 VMware, Inc.
Chapter 8 Cluster Specification Reference
Table 83. Cluster Specification’s Node Group Objects and Attributes (Continued)
Attribute Type Mandatory/ Optional Description
latencySensitivity string Optional You can specify LOW, NORMAL,
MEDIUM, or HIGH, which defines the virtual machine’s latency sensitivity setting within vCenter Server to optimize cluster performance.
When deploying an HBase cluster you can optimize HBase performance by setting latencySensitivity to HIGH. You must then also set the reservedMemRatio parameter (see below) to 1.
reservedMemRatio integer Optional You can specify 0 or 1 to define the ratio
of reserved memory.
When deploying an HBase cluster you can optimize HBase performance by setting parameter to 1. You must also set the latencySensitivity parameter (see above) to HIGH.
reservedCpuRatio integer Optional You can specify 0 or 1 to define the ratio
of reserved CPU.
Storage object Optional Storage settings.
type string Optional Storage type:
LOCAL. For local storage
n
SHARED. For shared storage.
n
sizeGB integer Optional Data storage size. Must be a positive
integer.
diskNum integer Optional Specifies the number of disks to use for
each node group.
dsNames list of string Optional Array of datastores the node group can
use.
dnNames4Data list of string Optional Array of datastores the data node group
can use.
dsNames4System list of string Optional Array of datastores the system can use.
rpNames list of string Optional Array of resource pools the node group
can use.
haFlag string Optional By default, NameNode and JobTracker
nodes are protected by vSphere HA.
on. Protect the node with vSphere
n
HA.
ft. Protect the node with vSphere FT.
n
off. Do not use vSphere HA or
n
vSphere FT.
placementPolicies object Optional Up to three optional constraints:
instancePerHost
n
groupRacks
n
groupAssociations
n
Serengeti Predefined Virtual Machine Sizes
Serengeti provides predefined virtual machine sizes to use for defining the size of virtual machines in a cluster node group.
VMware, Inc. 89
Table 84. Serengeti Predefined Virtual Machine Sizes
SMALL MEDIUM LARGE EXTRA_LARGE
Number of CPUs per virtual machine
RAM, in GB 3.75 7.5 15 30
Hadoop master data disk size, in GB
Hadoop worker data disk size, in GB
Hadoop client data disk size, in GB
Zookeeper data disk size, in GB
1 2 4 8
25 50 100 200
50 100 200 400
50 100 200 400
20 40 80 120

White Listed and Black Listed Hadoop Attributes

White listed attributes are Apache Hadoop attributes that you can configure from Serengeti with the
cluster config command. The majority of Apache Hadoop attributes are white listed. However, there are a
few black listed Apache Hadoop attributes, which you cannot configure from Serengeti.
If you use an attribute in the cluster specification file that is neither a white listed nor a black listed attribute, and then run the cluster config command, a warning appears and you must answer yes to continue or no to cancel.
If your cluster includes a NameNode or JobTracker, Serengeti configures the fs.default.name and
dfs.http.address attributes. You can override these attributes by defining them in your cluster
specification.
Table 85. Configuration Attribute White List
File Attributes
core-site.xml All core-default configuration attributes listed on the Apache Hadoop 2.x documentation Web
page. For example, http://hadoop.apache.org/docs/branch_name/core-default.html.
Exclude the attributes defined in the black list.
hdfs-site.xml All hdfs-default configuration attributes listed on the Apache Hadoop 2.x documentation Web
page. For example, http://hadoop.apache.org/docs/branch_name/hdfs-default.html.
Exclude the attributes defined in the black list.
mapred-site.xml All mapred-default configuration attributes listed on the Apache Hadoop 2.x documentation
Web page. For example, http://hadoop.apache.org/docs/branch_name/mapred- default.html.
Exclude the attributes defined in the black list.
hadoop-env.sh
JAVA_HOME
PATH
HADOOP_CLASSPATH
HADOOP_HEAPSIZE
HADOOP_NAMENODE_OPTS
HADOOP_DATANODE_OPTS
HADOOP_SECONDARYNAMENODE_OPTS
HADOOP_JOBTRACKER_OPTS
HADOOP_TASKTRACKER_OPTS
HADOOP_LOG_DIR
90 VMware, Inc.
Table 85. Configuration Attribute White List (Continued)
File Attributes
log4j.properties
fair­scheduler.xml
capacity­scheduler.xml
mapred-queue­acls.xml
hadoop.root.logger
hadoop.security.logger
log4j.appender.DRFA.MaxBackupIndex
log4j.appender.RFA.MaxBackupIndex
log4j.appender.RFA.MaxFileSize
text All fair_scheduler configuration attributes listed on the Apache Hadoop 2.x documentation
Web page that can be used inside the text field. For example, http://hadoop.apache.org/docs/branch_name/fair_scheduler.html.
Exclude the attributes defined in the black list.
All capacity_scheduler configuration attributes listed on the Apache Hadoop 2.x documentation Web page. For example, http://hadoop.apache.org/docs/branch_name/capacity_scheduler.html.
Exclude attributes defined in black list
All mapred-queue-acls configuration attributes listed on the Apache Hadoop 2.x Web page. For example,
http://hadoop.apache.org/docs/branch_name/cluster_setup.html#Configuring+the +Hadoop+Daemons.
Exclude the attributes defined in the black list.
Chapter 8 Cluster Specification Reference
Table 86. Configuration Attribute Black List
File Attributes
core-site.xml
hdfs-site.xml
mapred-site.xml
hadoop-env.sh
log4j.properties
fair-scheduler.xml
capacity-scheduler.xml
mapred-queue-acls.xml
net.topology.impl
net.topology.nodegroup.aware
dfs.block.replicator.classname
topology.script.file.name
dfs.http.address
dfs.name.dir
dfs.data.dir
mapred.job.tracker
mapred.local.dir
mapred.task.cache.levels
mapred.jobtracker.jobSchedulable
mapred.jobtracker.nodegroup.aware
HADOOP_HOME
HADOOP_COMMON_HOME
HADOOP_MAPRED_HOME
HADOOP_HDFS_HOME
HADOOP_CONF_DIR
HADOOP_PID_DIR
None
None
None
None
VMware, Inc. 91

Convert Hadoop XML Files to Serengeti JSON Files

If you defined a lot of attributes in your Hadoop configuration files, you can convert that configuration information into the JSON format that Serengeti can use.
Procedure
1 Copy the directory $HADOOP_HOME/conf/ from your Hadoop cluster to the Serengeti Management Server.
2 Open a command shell, such as Bash or PuTTY, log in to the Serengeti Management Server, and run the
convert-hadoop-conf.rb Ruby conversion script.
convert-hadoop-conf.rb path_to_hadoop_conf
The converted Hadoop configuration attributes, in JSON format, appear.
3 Open the cluster specification file for editing.
4 Replace the cluster level configuration or group level configuration items with the output that was
generated by the convert-hadoop-conf.rb Ruby conversion script.
What to do next
Access the Serengeti CLI, and use the new specification file.
To apply the new configuration to a cluster, run the cluster config command. Include the --specFile
n
parameter and its value: the new specification file.
To create a cluster with the new configuration, run the cluster create command. Include the --
n
specFile parameter and its value: the new specification file.
92 VMware, Inc.

Serengeti CLI Command Reference 9

This section provides descriptions and syntax requirements for every Serengeti CLI command.
This chapter includes the following topics:
“appmanager Commands,” on page 93
n
“cluster Commands,” on page 95
n
“connect Command,” on page 102
n
“datastore Commands,” on page 102
n
“disconnect Command,” on page 103
n
“distro list Command,” on page 103
n
“mgmtvmcfg Commands,” on page 103
n
“network Commands,” on page 104
n
“resourcepool Commands,” on page 106
n
“template Commands,” on page 107
n
“topology Commands,” on page 107
n
“usermgmt Commands,” on page 107
n

appmanager Commands

The appmanager {*} commands let you add, delete, and manage your application managers.

appmanager add Command

The appmanager add command lets you add an application manager other than the default to your environment. You can specify either Cloudera Manager or Ambari application manager. The appmanager
add command reads the user name and password in interactive mode. If https is specified, the command
prompts for the file path of the certificate.
Parameter Mandatory/Optional Description
--name
application_manager_name
--description description
VMware, Inc. 93
Mandatory Application manager name
Optional
Parameter Mandatory/Optional Description
--type
[ClouderaManager/Ambari]
--url <http[s]://server:port>
Mandatory Name of the type of application manager to use, either Cloudera
Mandatory Application manager service URL, formatted as

appmanager delete Command

You can use the Serengeti CLI to delete an application manager when you no longer need it.
The application manager to delete must not contain clusters or the process fails.
appmanager delete --name application_manager_name
Parameter Mandatory or Optional Description
--name application_manager_name
Mandatory Application manager name

appmanager modify Command

With the appmanager modify command, you can modify the information for an application manager, for example, you can change the manager server IP address if it is not a static IP, or you could upgrade the administrator account.
Manager or Ambari
http[s]://application_manager_server_ip_or_hostname:port , prompts for a login, username, and password.
IMPORTANT Making an error when you modify an application manager can have serious consequences. For example, you change a Cloudera Manager URL to the URL for a new application manager. If you create Big Data Extensions clusters with the old Cloudera Manager instance, the previous Cloudera Manager cluster cannot be managed again. In addition, the Cloudera Manager cluster is not available to the new application manager instance.
appmanager modify --name application_manager_name
Mandatory
Parameter
--name
application_manager_name
--url http[s]://server:port
--changeAccount Optional Changes the login account and password for the application manager.
--changeCertificate Optional Changes the SSL certificate of the application manager. This parameter
or Optional Description
Mandatory Application manager name
Optional Application manager service URL, formatted as
http[s]://application_manager_server_ip_or_hostname:port , prompts for a login, username, and password. You can use either http or https.
only applies to application managers with a URL that starts with https.
94 VMware, Inc.

appmanager list Command

The appmanager list command returns a list of all available application managers including the default application manager.
Parameter Mandatory/Optional Description
--name application_manager_name
--distro distribution_name
--configurations | --roles

cluster Commands

The cluster {*} commands let you connect to clusters, create and delete clusters, stop and start clusters, and perform cluster management operations.
Chapter 9 Serengeti CLI Command Reference
Optional The application manager name.
Optional The name of a specific distribution. If you do not include the
distribution_name variable, the command returns all Hadoop distributions that are supported by the application manager.
Optional The Hadoop configurations or roles for a specific application
manager and distribution. You should not use unsupported roles to create a cluster.

cluster config Command

The cluster config command lets you modify the configuration of an existing Hadoop or HBase cluster, whether the cluster is configured according to the Serengeti defaults or you have customized the cluster.
NOTE The cluster config command can only be used with clusters that were created with the default application manager. For those clusters that were created with either Ambari or Cloudera Manager, any cluster configuration changes should be made from the application manager. Also, new services and configurations changed in the external application manager cannot be synced from Big Data Extensions.
You can use the cluster config command with the cluster export command to return cluster services and the original Hadoop configuration to normal in the following situations:
A service such as NameNode, JobTracker, DataNode, or TaskTracker goes down.
n
You manually changed the Hadoop configuration of one or more of the nodes in a cluster.
n
Run the cluster export command, and then run the cluster config command. Include the new cluster specification file that you just exported.
If the external HDFS cluster was created by Big Data Extensions, the user should use the clusterconfig command to add the HBase cluster topology to the HDFS cluster.
The following example depicts the specification file to add the topology:
"configuration" : { "hadoop" : { "topology.data": { "text": "10.1.1.1 /rack4,10.2.2.2 /rack4" } } }
Parameter Mandatory/Optional Description
--name cluster_name_in_Serengeti
--specFile spec_file_path
--yes
--skipConfigValidation
VMware, Inc. 95
Mandatory Name of Hadoop cluster to configure.
Optional File name of Hadoop cluster specification
Optional Answer Y to Y/N confirmation. If not specified, manually
type y or n.
Optional Skip cluster configuration validation.

cluster create Command

You use the cluster create command to create a Hadoop or HBase cluster.
If the cluster specification does not include the required nodes, for example a master node, the Serengeti Management Server creates the cluster according to the default cluster configuration that Serengeti Management Server deploys.
Parameter
--name cluster_name_in_Serengeti
--networkName management_network_name
--adminGroupName admin_group_name
--userGroupNameuser_group_name
--appmanagerappmanager_name
--type cluster_type
--password
--specFile spec_file_path
--distro Hadoop_distro_name
--dsNames datastore_names
--hdfsNetworkName hdfs_network_name
--mapredNetworkName mapred_network_name
--rpNames resource_pool_name
Mandatory or Optional Description
Mandatory. Cluster name.
Mandatory. Network to use for management traffic in Hadoop
clusters.
If you omit any of the optional network parameters, the traffic associated with that parameter is routed on the management network that you specify with the
--networkName parameter.
Optional Administrative group to use for this cluster as
defined in Active Directory or LDAP.
Optional User group to use for this cluster as defined in Active
Directory or LDAP.
Optional. Name of an application manager other than the
default to manage your clusters.
Optional. Cluster type:
Hadoop (Default)
n
HBase
n
Optional.
Do not use if you use the --
resume
parameter.
Optional. Cluster specification filename. For compute-only
Optional. Hadoop distribution for the cluster.
Optional. Datastore to use to deploy Hadoop cluster in
Optional. Network to use for HDFS traffic in Hadoop clusters.
Optional. Network to use for MapReduce traffic in Hadoop
Optional. Resource pool to use for Hadoop clusters. Multiple
Custom password for all the nodes in the cluster.
Passwords must be from 8 to 20 characters, use only visible lowerASCII characters (no spaces), and must contain at least one uppercase alphabetic character (A - Z), at least one lowercase alphabetic character (a
- z), at least one digit (0 - 9), and at least one of the following special characters: _, @, #, $, %, ^, &, *
clusters, you must revise the spec file to point to an external HDFS.
Serengeti. Multiple datastores can be used, separated by comma.
By default, all available datastores are used. When you specify the --dsNames parameter, the
cluster can use only those datastores that you provide in this command.
clusters.
resource pools can be used, separated by comma.
96 VMware, Inc.
Parameter
--resume
--topology topology_type
--yes
--skipConfigValidation
--skipVcRefresh true
--localRepoURL
--externalMapReduce
FQDN_of_Jobtracker/ResourceManager:port
Chapter 9 Serengeti CLI Command Reference
Mandatory or Optional Description
Optional.
Do not use if you use the --
password
parameter .
Optional. Topology type for rack awareness: HVE,
Optional. Confirmation whether to proceed following an error
Optional. Validation whether to skip cluster configuration.
Optional When performing cluster operations in a large
Optional. Option to create a local yum repository.
Optional. The port number is optional.
Recover from a failed deployment process.
RACK_AS_RACK, or HOST_AS_RACK.
message. If the responses are not specified, you can type y or n.
If you specify y, the cluster creation continues. If you do not specify y, the CLI presents the following prompt after displaying the warning message:
Are you sure you want to continue (Y/N)?
vCenter Server environment, refreshing the inventory list may take considerable time. You can improve cluster creation or resumption performance using this parameter.
NOTE If Serengeti Management Server shares the vCenter Server environment with other workloads, do not use this parameter. Serengeti Management Server cannot track the resource usage of other product's workloads, and must refresh the inventory list in such circumstances.

cluster delete Command

The cluster delete command lets you delete a cluster in Serengeti. When a cluster is deleted, all its virtual machines and resource pools are destroyed.
Parameter Mandatory/Optional Description
--name cluster_name
--templatetemplate_name
Mandatory Name of cluster to delete
Optional The template to use for clusters. If there is more than one template
virtual machine, you must specify this parameter.

cluster expand Command

The cluster expand command lets you expand and update the nodes in Big Data cluster.
You can expand an existing Big Data cluster with the cluster expand command. Edit the cluster's specification file to include additional nodes and other available resources, and use the cluster expand command to apply the configuration to the existing cluster.
Parameter Mandatory/Optional Description
--name cluster_name
--specFile spec_file_path
Mandatory Name of cluster to expand.
Mandatory Cluster specification filename.
VMware, Inc. 97

cluster export Command

The cluster export command lets you export cluster data. Depending on the options and parameters that you specify, you can export the cluster data to a specific location, format the delimiter of the export file, specify the type of data to export, and indicate the value for the topology.
You can use either of the following commands to export the cluster specification file.
cluster export --name cluster_name --specFile path_to_file
n
The use of the specfile parameter with the cluster export command is deprecated in Big Data Extensions 2.1.
cluster export --name cluster_name --type SPEC --output path_to_file
n
You can use the cluster export command to print the IP to RACK mapping table. The format of the command is ip rack. The external HDFS cluster can use the cluster export command to implement the data location of the HBase and MapReduce cluster.
You can use the cluster export command to print the IP for the management network of all nodes in a cluster.
You can use cluster export command to print the IP to FQDN mapping table for all nodes in a cluster. You can chose to display the mapping table to the terminal, or export it to a file.
cluster export --name cluster_name --type IP2FQDN
cluster export --name cluster_name --type IP2FQDN --output path_to_file
Mandatory or
Parameter
--name cluster_name
--type SPEC|RACK|IP|FQDN|IP2FQDN
--output path_to_output_file
--specfile path_to_spec_file
--topology [HOST_AS_RACK|RACK_AS_RACK|HVE|NONE]
--delimiter
Optional Description
Mandatory Name of cluster to export
Optional Type of data to export. The value can be one of
the following items:
n
SPEC, the default, to export a spec file.
n
RACK to export the rack topology of all nodes
n
IP to export the IP of all nodes
n
FQDN to export a cluster FQDN IP mapping of all nodes
n
IP2FQDN to export the IP to FQDN mapping table for all nodes in a cluster
Optional Output file in which to save the exported data
Optional Output file in which to save the cluster
specification.
Optional Value for the topology. The default value is the
topology that you specified when you created the cluster.
Optional Symbol or string to separate each line in the
result. The default value is \n, line by line.
98 VMware, Inc.
Chapter 9 Serengeti CLI Command Reference

cluster fix Command

The cluster fix command lets you recover from a failed disk.
IMPORTANT Even if you changed the user password on the nodes, the changed password is not used for the new nodes that are created by the disk recovery operation. If you set the initial administrator password when you created the cluster, that initial administrator password is used for the new nodes. If you did not set the initial administrator password when you created the cluster, new random passwords are used for the new nodes.
Table 91.
Parameter Mandatory/Optional Description
--name cluster_name
--disk
--nodeGroup nodegroup_name
Mandatory Name of cluster that has a failed disk.
Required Recover node disks.
Optional Perform scan and recovery only on the specified node group, not
on all the management nodes in the cluster.

cluster list Command

The cluster list command lets you view a list of provisioned clusters in Serengeti. You can see the following information: name, distribution, status, and information about each node group. The node group information consists of the instance count, CPU, memory, type, and size.
The application managers monitor the services and functions of your Big Data Extensions environment. Big Data Extensions syncs up the status from the application managers periodically. You can use the
cluster list command to get the latest status of your environment. If there are any warnings displayed,
you can check the details from the application manager console.
Table 92.
Parameter Mandatory/Optional Description
--name cluster_name_in_Serengeti
--detail
Optional Name of cluster to list.
Optional List cluster details, including name in Serengeti, distribution,
deploy status, each node’s information in different roles.
If you specify this option, Serengeti queries the vCenter Server to get the latest node status.

cluster resetParam Command

The cluster resetParam command lets you reset the ioShares level for a cluster to default values.
Table 93.
Parameter Mandatory/Optional Description
--name cluster_name
--ioShares
Mandatory Name of cluster for which to reset scaling parameters.
Optional Reset to NORMAL.
VMware, Inc. 99

cluster resize Command

The cluster resize command lets you change the number of nodes in a node group or scale the size of the up/down virtual machine's CPU or RAM in a node group. When creating new nodes, the new created nodes will have the same services and configurations as the original nodes. When deleting nodes, Serengeti Management Server only allows tasktracker and nodemanager roles to be deleted. You must specify at least one optional parameter.
If you specify the --instanceNum parameter, you cannot specify either the --cpuNumPerNode parameter or the
--memCapacityMbPerNode parameter.
You can specify the --cpuNumPerNode and the --memCapacityMbPerNode parameters at the same time to scale the CPU and RAM with a single command.
IMPORTANT Even if you changed the user password on the nodes, the changed password is not used for the new nodes that are created by the cluster resize operation. If you set the initial administrator password when you created the cluster, that initial administrator password is used for the new nodes. If you did not set the initial administrator password when you created the cluster, new random passwords are used for the new nodes.
Parameter Mandatory/Optional Description
--name cluster_name
--nodeGroup
name_of_the_node_group
--instanceNum
instance_number
--cpuNumPerNode
num_of_vCPUs
--force
--memCapacityMbPerNode
size_in_MB
--skipVcRefresh true
Mandatory Target the Hadoop cluster that was deployed by
Mandatory Target node group to scale out in the cluster that was deployed
Optional New instance number to scale to. If it is greater than the original
Optional Number of vCPUs in a virtual machine in a target node group.
Optional When scaling out a cluster, you can overcome hardware or
Optional Memory size, in MB, of each virtual machine in a target node
Optional When performing cluster operations in a large vCenter Server
Serengeti Management Server.
by Serengeti Management Server.
count,Serengeti Management Server creates new nodes in the target node group. If it is less than the original count, Serengeti Management Server deletes nodes in the target node group. If the cluster resize operation fails, you can use the target instance number again to retry the cluster resize operation.
software failures using the --force parameter. Applying this parameter allows the cluster resize operation to proceed without being blocked by limited virtual machine failures.
group.
environment, refreshing the inventory list may take considerable time. You can improve cluster resize performance using this parameter.
NOTE If Serengeti Management Server shares the vCenter Server environment with other workloads, do not use this parameter. Serengeti Management Server cannot track the resource usage of other product's workloads, and must refresh the inventory list in such circumstances.
100 VMware, Inc.
Loading...