VMware vSphere Big Data Extensions - 2.0 User’s Manual

VMware vSphere Big Data Extensions
Command-Line Interface Guide
vSphere Big Data Extensions 2.0
This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent editions of this document, see http://www.vmware.com/support/pubs.
EN-001513-00
You can find the most up-to-date technical documentation on the VMware Web site at:
http://www.vmware.com/support/
The VMware Web site also provides the latest product updates.
If you have comments about this documentation, submit your feedback to:
docfeedback@vmware.com
Copyright © 2013, 2014 VMware, Inc. All rights reserved. Copyright and trademark information. This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 United States License
(http://creativecommons.org/licenses/by-nd/3.0/us/legalcode).
VMware, Inc.
3401 Hillview Ave. Palo Alto, CA 94304 www.vmware.com
2 VMware, Inc.

Contents

About This Book 5
Using the Serengeti Remote Command-Line Interface Client 7
1
Access the Serengeti CLI By Using the Remote Command-Line Interface Client 7
Managing vSphere Resources for Hadoop and HBase Clusters 9
2
Add a Resource Pool with the Serengeti Command-Line Interface 10
Remove a Resource Pool with the Serengeti Command-Line Interface 10
Add a Datastore with the Serengeti Command-Line Interface 10
Remove a Datastore with the Serengeti Command-Line Interface 11
Add a Network with the Serengeti Command-Line Interface 11
Reconfigure a Static IP Network with the Serengeti Command-Line Interface 12
Remove a Network with the Serengeti Command-Line Interface 12
Creating Hadoop and HBase Clusters 15
3
About Hadoop and HBase Cluster Deployment Types 17
Serengeti’s Default Hadoop Cluster Configuration 18
Create a Default Serengeti Hadoop Cluster with the Serengeti Command-Line Interface 18
Create a Cluster with a Custom Administrator Password with the Serengeti Command-Line
Interface 19
Create a Cluster with an Available Distribution with the Serengeti Command-Line Interface 19
Create a Hadoop Cluster with Assigned Resources with the Serengeti Command-Line Interface 20
Create a Cluster with Multiple Networks with the Serengeti Command-Line Interface 21
Create a MapReduce v2 (YARN) Cluster with the Serengeti Command-Line Interface 21
Create a Customized Hadoop or HBase Cluster with the Serengeti Command-Line Interface 22
Create a Hadoop Cluster with Any Number of Master, Worker, and Client Nodes 23
Create a Data-Compute Separated Cluster with No Node Placement Constraints 24
Create a Data-Compute Separated Cluster with Placement Policy Constraints 25
Create a Compute-Only Cluster with the Serengeti Command-Line Interface 27
Create a Basic Cluster with the Serengeti Command-Line Interface 29
About Cluster Topology 31
Create a Cluster with Topology Awareness with the Serengeti Command-Line Interface 33
Create a Data-Compute Separated Cluster with Topology Awareness and Placement Constraints 34
Serengeti’s Default HBase Cluster Configuration 36
Create a Default HBase Cluster with the Serengeti Command-Line Interface 36
Create an HBase Cluster with vSphere HA Protection with the Serengeti Command-Line Interface 37
VMware, Inc.
Managing Hadoop and HBase Clusters 41
4
Stop and Start a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 42
Scale Out a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 42
Scale CPU and RAM with the Serengeti Command-Line Interface 43
3
Reconfigure a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 43
About Resource Usage and Elastic Scaling 45
Delete a Hadoop or HBase Cluster with the Serengeti Command-Line Interface 51
About vSphere High Availability and vSphere Fault Tolerance 51
Reconfigure a Node Group with the Serengeti Command-Line Interface 51
Recover from Disk Failure with the Serengeti Command-Line Interface Client 51
Monitoring the Big Data Extensions Environment 53
5
View Available Hadoop Distributions with the Serengeti Command-Line Interface 53
View Provisioned Hadoop and HBase Clusters with the Serengeti Command-Line Interface 54
View Datastores with the Serengeti Command-Line Interface 54
View Networks with the Serengeti Command-Line Interface 54
View Resource Pools with the Serengeti Command-Line Interface 55
Using Hadoop Clusters from the Serengeti Command-Line Interface 57
6
Run HDFS Commands with the Serengeti Command-Line Interface 57
Run MapReduce Jobs with the Serengeti Command-Line Interface 58
Run Pig and PigLatin Scripts with the Serengeti Command-Line Interface 58
Run Hive and Hive Query Language Scripts with the Serengeti Command-Line Interface 59
Cluster Specification Reference 61
7
Cluster Specification File Requirements 61
Cluster Definition Requirements 62
Annotated Cluster Specification File 62
Cluster Specification Attribute Definitions 66
White Listed and Black Listed Hadoop Attributes 68
Convert Hadoop XML Files to Serengeti JSON Files 70
Serengeti CLI Command Reference 71
8
cfg Commands 72
cluster Commands 74
connect Command 80
datastore Commands 81
disconnect Command 82
distro list Command 82
fs Commands 82
hive script Command 88
mr Commands 89
network Commands 92
pig script Command 94
resourcepool Commands 94
topology Commands 95
Index 97
4 VMware, Inc.

About This Book

VMware vSphere Big Data Extensions Command-Line Interface Guide describes how to use the Serengeti Command-Line Interface (CLI) to manage the vSphere resources that you use to create Hadoop and HBase clusters, and how to create, manage, and monitor Hadoop and HBase clusters with the Serengeti CLI.
VMware vSphere Big Data Extensions Command-Line Interface Guide also describes how to perform Hadoop and HBase operations with the Serengeti CLI, and provides cluster specification and Serengeti CLI command references.
Intended Audience
This guide is for system administrators and developers who want to use Serengeti to deploy and manage Hadoop clusters. To successfully work with Serengeti, you should be familiar with Hadoop and VMware vSphere®.
VMware Technical Publications Glossary
VMware Technical Publications provides a glossary of terms that might be unfamiliar to you. For definitions of terms as they are used in VMware technical documentation, go to
http://www.vmware.com/support/pubs.
®
VMware, Inc.
5
6 VMware, Inc.
Using the Serengeti Remote
Command-Line Interface Client 1
The Serengeti Remote Command-Line Interface Client lets you access the Serengeti Management Server to deploy, manage, and use Hadoop.

Access the Serengeti CLI By Using the Remote Command-Line Interface Client

You can access the Serengeti Command-Line Interface (CLI) to perform Serengeti administrative tasks with the Serengeti Remote CLI Client.
IMPORTANT You can only run Hadoop commands from the Serengeti CLI on a cluster running the Apache Hadoop 1.2.1 distribution. To use the command-line to run Hadoop administrative commands for clusters running other Hadoop distributions, such as cfg, fs, mr, pig, and hive, use a Hadoop client node to run these commands.
Prerequisites
Use the vSphere Web Client to log in to the vCenter Server on which you deployed the Serengeti vApp.
n
Verify that the Serengeti vApp deployment was successful and that the Management Server is running.
n
Verify that you have the correct password to log in to Serengeti CLI. See the VMware vSphere Big Data
n
Extensions Administrator's and User's Guide.
The Serengeti CLI uses its vCenter Server credentials.
Verify that the Java Runtime Environment (JRE) is installed in your environment and that its location is
n
in your PATH environment variable.
Procedure
1 Open a Web browser to connect to the Serengeti Management Server cli directory.
http://ip_address/cli
2 Download the ZIP file for your version and build.
The filename is in the format VMware-Serengeti-cli-version_number-build_number.ZIP.
3 Unzip the download.
The download includes the following components.
The serengeti-cli-version_number JAR file, which includes the Serengeti Remote CLI Client.
n
The samples directory, which includes sample cluster configurations.
n
Libraries in the lib directory.
n
VMware, Inc.
7
4 Open a command shell, and change to the directory where you unzipped the package.
5 Change to the cli directory, and run the following command to enter the Serengeti CLI.
For any language other than French of German, run the following command.
n
java -jar serengeti-cli-version_number.jar
For French or German languages, which use code page 850 (CP 850) language encoding when
n
running the Serengeti CLI from a Windows command console, run the following command.
java -Dfile.encoding=cp850 -jar serengeti-cli-version_number.jar
6 Connect to the Serengeti service.
You must run the connect host command every time you begin a CLI session, and again after the 30 minute session timeout. If you do not run this command, you cannot run any other commands.
a Run the connect command.
connect --host xx.xx.xx.xx:8443
b At the prompt, type your user name, which might be different from your login credentials for the
Serengeti Management Server.
NOTE If you do not create a user name and password for the Serengeti Command-Line Interface Client, you can use the default vCenter Server administrator credentials. The Serengeti Command­Line Interface Client uses the vCenter Server login credentials with read permissions on the Serengeti Management Server.
c At the prompt, type your password.
A command shell opens, and the Serengeti CLI prompt appears. You can use the help command to get help with Serengeti commands and command syntax.
To display a list of available commands, type help.
n
To get help for a specific command, append the name of the command to the help command.
n
help cluster create
Press Tab to complete a command.
n
8 VMware, Inc.
Managing vSphere Resources for
Hadoop and HBase Clusters 2
Big Data Extensions lets you manage the resource pools, datastores, and networks that you use in the Hadoop and HBase clusters that you create.
Add a Resource Pool with the Serengeti Command-Line Interface on page 10
n
You add resource pools to make them available for use by Hadoop clusters. Resource pools must be located at the top level of a cluster. Nested resource pools are not supported.
Remove a Resource Pool with the Serengeti Command-Line Interface on page 10
n
You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove resource pools when you do not need them or if you want the Hadoop clusters you create in the Serengeti Management Server to be deployed under a different resource pool. Removing a resource pool removes its reference in vSphere. The resource pool is not deleted.
Add a Datastore with the Serengeti Command-Line Interface on page 10
n
You can add shared and local datastores to the Serengeti server to make them available to Hadoop clusters.
Remove a Datastore with the Serengeti Command-Line Interface on page 11
n
You can remove any datastore from Serengeti that is not referenced by any Hadoop clusters. Removing a datastore removes only the reference to the vCenter Server datastore. The datastore itself is not deleted.
VMware, Inc.
Add a Network with the Serengeti Command-Line Interface on page 11
n
You add networks to Serengeti to make their IP addresses available to Hadoop clusters. A network is a port group, as well as a means of accessing the port group through an IP address.
Reconfigure a Static IP Network with the Serengeti Command-Line Interface on page 12
n
You can reconfigure a Serengeti static IP network by adding IP address segments to it. You might need to add IP address segments so that there is enough capacity for a cluster that you want to create.
Remove a Network with the Serengeti Command-Line Interface on page 12
n
You can remove networks from Serengeti that are not referenced by any Hadoop clusters. Removing an unused network frees the IP addresses for reuse.
9

Add a Resource Pool with the Serengeti Command-Line Interface

You add resource pools to make them available for use by Hadoop clusters. Resource pools must be located at the top level of a cluster. Nested resource pools are not supported.
When you add a resource pool to Big Data Extensions it symbolically represents the actual vSphere resource pool as recognized by vCenter Server. This symbolic representation lets you use the Big Data Extensions resource pool name, instead of the full path of the resource pool in vCenter Server, in cluster specification files.
Prerequisites
Deploy Big Data Extensions.
Procedure
1 Access the Serengeti Command-Line Interface client.
2 Run the resourcepool add command.
The --vcrp parameter is optional.
This example adds a Serengeti resource pool named myRP to the vSphere rp1 resource pool that is contained by the cluster1 vSphere cluster.
resourcepool add --name myRP --vccluster cluster1 --vcrp rp1
What to do next
After you add a resource pool to Big Data Extensions, do not rename the resource pool in vSphere. If you rename it, you cannot perform Serengeti operations on clusters that use that resource pool.

Remove a Resource Pool with the Serengeti Command-Line Interface

You can remove resource pools from Serengeti that are not in use by a Hadoop cluster. You remove resource pools when you do not need them or if you want the Hadoop clusters you create in the Serengeti Management Server to be deployed under a different resource pool. Removing a resource pool removes its reference in vSphere. The resource pool is not deleted.
Procedure
1 Access the Serengeti Command-Line Interface client.
2 Run the resourcepool delete command.
If the command fails because the resource pool is referenced by a Hadoop cluster, you can use the
resourcepool list command to see which cluster is referencing the resource pool.
This example deletes the resource pool named myRP.
resourcepool delete --name myRP

Add a Datastore with the Serengeti Command-Line Interface

You can add shared and local datastores to the Serengeti server to make them available to Hadoop clusters.
Procedure
1 Access the Serengeti CLI.
10 VMware, Inc.
Chapter 2 Managing vSphere Resources for Hadoop and HBase Clusters
2 Run the datastore add command.
This example adds a new, local storage datastore named myLocalDS. The --spec parameter’s value,
local*, is a wildcard specifying a set of vSphere datastores. All vSphere datastores whose names begin
with “local” are added and managed as a whole by Serengeti.
datastore add --name myLocalDS --spec local* --type LOCAL
What to do next
After you add a datastore to Big Data Extensions, do not rename the datastore in vSphere. If you rename it, you cannot perform Serengeti operations on clusters that use that datastore.

Remove a Datastore with the Serengeti Command-Line Interface

You can remove any datastore from Serengeti that is not referenced by any Hadoop clusters. Removing a datastore removes only the reference to the vCenter Server datastore. The datastore itself is not deleted.
You remove datastores if you do not need them or if you want to deploy the Hadoop clusters that you create in the Serengeti Management Server under a different datastore.
Procedure
1 Access the Serengeti CLI.
2 Run the datastore delete command.
If the command fails because the datastore is referenced by a Hadoop cluster, you can use the datastore
list command to see which cluster is referencing the datastore.
This example deletes the myDS datastore.
datastore delete --name myDS

Add a Network with the Serengeti Command-Line Interface

You add networks to Serengeti to make their IP addresses available to Hadoop clusters. A network is a port group, as well as a means of accessing the port group through an IP address.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the network.
Procedure
1 Access the Serengeti CLI.
VMware, Inc. 11
2 Run the network add command.
This example adds a network named myNW to the 10PG vSphere port group. Virtual machines that use this network use DHCP to obtain the IP addresses.
network add --name myNW --portGroup 10PG --dhcp
This example adds a network named myNW to the 10PG vSphere port group. Hadoop nodes use addresses in the 192.168.1.2-100 IP address range, the DNS server IP address is 10.111.90.2, the gateway address is 192.168.1.1, and the subnet mask is 255.255.255.0.
network add --name myNW --portGroup 10PG --ip 192.168.1.2-100 --dns 10.111.90.2 --gateway
192.168.1.1 --mask 255.255.255.0
To specify multiple IP address segments, use multiple strings to express the IP address range in the format xx.xx.xx.xx-xx[,xx]*. For example:
xx.xx.xx.xx-xx, xx.xx.xx.xx-xx, single_ip, single_ip
What to do next
After you add a network to Big Data Extensions, do not rename it in vSphere. If you rename the network, you cannot perform Serengeti operations on clusters that use that network.

Reconfigure a Static IP Network with the Serengeti Command-Line Interface

You can reconfigure a Serengeti static IP network by adding IP address segments to it. You might need to add IP address segments so that there is enough capacity for a cluster that you want to create.
If the IP range that you specify includes IP addresses that are already in the network, Serengeti ignores the duplicated addresses. The remaining addresses in the specified range are added to the network. If the network is already used by a cluster, the cluster can use the new IP addresses after you add them to the network. If only part of the IP range is used by a cluster, the unused IP address can be used when you create a new cluster.
Prerequisites
If your network uses static IP addresses, be sure that the addresses are not occupied before you add the network.
Procedure
1 Access the Serengeti CLI.
2 Run the network modify command.
This example adds IP addresses from 192.168.1.2 to 192.168.1.100 to a network named myNetwork.
network modify --name myNetwork --addIP 192.168.1.2-100

Remove a Network with the Serengeti Command-Line Interface

You can remove networks from Serengeti that are not referenced by any Hadoop clusters. Removing an unused network frees the IP addresses for reuse.
Procedure
1 Access the Serengeti CLI.
12 VMware, Inc.
Chapter 2 Managing vSphere Resources for Hadoop and HBase Clusters
2 Run the network delete command.
network delete --name network_name
If the command fails because the network is referenced by a Hadoop cluster, you can use the network
list --detail command to see which cluster is referencing the network.
VMware, Inc. 13
14 VMware, Inc.

Creating Hadoop and HBase Clusters 3

Big Data Extensions lets you create and deploy Hadoop and HBase clusters. A Hadoop or HBase cluster is a special type of computational cluster designed specifically for storing and analyzing large amounts of unstructured data in a distributed computing environment.
The resource requirements are different for clusters created with the Serengeti Command-Line Interface and the Big Data Extensions plug-in for the vSphere Web Client because the clusters use different default templates. The default clusters created through the Serengeti Command-Line Interface are targeted for Project Serengeti users and proof-of-concept applications, and are smaller than the Big Data Extensions plug-in templates, which are targeted for larger deployments for commercial use.
Additionally, some deployment configurations require more resources than other configurations. For example, if you create a Greenplum HD 1.2 cluster, you cannot use the SMALL size virtual machine. If you create a default MapR or Greenplum HD cluster through the Serengeti Command-Line Interface, at least 550GB of storage and 55GB of memory are recommended. For other Hadoop distributions, at least 350GB of storage and 35GB of memory are recommended.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the cluster's virtual machine automatic migration. Although this prevents vSphere from automatically migrating the virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such management functions outside of the Big Data Extensions environment can make it impossible for you to perform some Big Data Extensions operations, such as disk failure recovery.
VMware, Inc.
About Hadoop and HBase Cluster Deployment Types on page 17
n
Big Data Extensions lets you deploy several types of Hadoop and HBase clusters. You need to know about the types of clusters that you can create.
Serengeti’s Default Hadoop Cluster Configuration on page 18
n
For basic Hadoop deployments, such as proof of concept projects, you can use Serengeti’s default Hadoop cluster configuration for clusters that are created with the Command-Line Interface.
Create a Default Serengeti Hadoop Cluster with the Serengeti Command-Line Interface on page 18
n
You can create as many clusters as you want in your Serengeti environment, but your environment must meet all prerequisites.
Create a Cluster with a Custom Administrator Password with the Serengeti Command-Line Interface
n
on page 19
When you create a cluster, you can assign a custom administrator password to all the nodes in the cluster. Custom administrator passwords let you directly log in to the cluster's nodes instead of having to first log in to the Serengeti Management server.
15
Create a Cluster with an Available Distribution with the Serengeti Command-Line Interface on
n
page 19
You can choose which Hadoop distribution to use when you deploy a cluster. If you do not specify a Hadoop distribution, the resulting cluster includes the default distribution, Apache Hadoop.
Create a Hadoop Cluster with Assigned Resources with the Serengeti Command-Line Interface on
n
page 20
By default, when you use Serengeti to deploy a Hadoop cluster, the cluster might contain any or all available resources: vCenter Server resource pool for the virtual machine's CPU and memory, datastores for the virtual machine's storage, and a network. You can assign which resources the cluster uses by specifying specific resource pools, datastores, and/or a network when you create the Hadoop cluster.
Create a Cluster with Multiple Networks with the Serengeti Command-Line Interface on page 21
n
When you create a cluster, you can distribute the management, HDFS, and MapReduce traffic to separate networks. You might want to use separate networks to improve performance or to isolate traffic for security reasons.
Create a MapReduce v2 (YARN) Cluster with the Serengeti Command-Line Interface on page 21
n
You can create MapReduce v2 (YARN) cluster with the Serengeti Command-Line Interface.
Create a Customized Hadoop or HBase Cluster with the Serengeti Command-Line Interface on
n
page 22
You can create clusters that are customized for your requirements, including the number of nodes, virtual machine RAM and disk size, the number of CPUs, and so on.
Create a Hadoop Cluster with Any Number of Master, Worker, and Client Nodes on page 23
n
You can create a Hadoop cluster with any number of master, worker, and client nodes.
Create a Data-Compute Separated Cluster with No Node Placement Constraints on page 24
n
You can create a cluster with separate data and compute nodes, without node placement constraints.
Create a Data-Compute Separated Cluster with Placement Policy Constraints on page 25
n
You can create a cluster with separate data and compute nodes, and define placement policy constraints to distribute the nodes among the virtual machines as you want.
Create a Compute-Only Cluster with the Serengeti Command-Line Interface on page 27
n
You can create compute-only clusters to run MapReduce jobs on existing HDFS clusters, including storage solutions that serve as an external HDFS.
Create a Basic Cluster with the Serengeti Command-Line Interface on page 29
n
You can create a basic cluster in your Serengeti environment. A basic cluster is a group of virtual machines provisioned and managed by Serengeti. Serengeti helps you to plan and provision the virtual machines to your specifications. You can use the basic cluster's virtual machines to install Big Data applications.
About Cluster Topology on page 31
n
You can improve workload balance across your cluster nodes, and improve performance and throughput, by specifying how Hadoop virtual machines are placed using topology awareness. For example, you can have separate data and compute nodes, and improve performance and throughput by placing the nodes on the same set of physical hosts.
Create a Cluster with Topology Awareness with the Serengeti Command-Line Interface on page 33
n
To achieve a balanced workload or to improve performance and throughput, you can control how Hadoop virtual machines are placed by adding topology awareness to the Hadoop clusters. For example, you can have separate data and compute nodes, and improve performance and throughput by placing the nodes on the same set of physical hosts.
16 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Create a Data-Compute Separated Cluster with Topology Awareness and Placement Constraints on
n
page 34
You can create clusters with separate data and compute nodes, and define topology and placement policy constraints to distribute the nodes among the physical racks and the virtual machines.
Serengeti’s Default HBase Cluster Configuration on page 36
n
HBase clusters are required for you to build big table applications. To run HBase MapReduce jobs, configure the HBase cluster to include JobTracker nodes or TaskTracker nodes.
Create a Default HBase Cluster with the Serengeti Command-Line Interface on page 36
n
Serengeti supports deploying HBase clusters on HDFS.
Create an HBase Cluster with vSphere HA Protection with the Serengeti Command-Line Interface on
n
page 37
You can create HBase clusters with separated Hadoop NameNode and HBase Master roles, and configure vSphere HA protection for the Masters.

About Hadoop and HBase Cluster Deployment Types

Big Data Extensions lets you deploy several types of Hadoop and HBase clusters. You need to know about the types of clusters that you can create.
You can create the following types of clusters.
Basic Hadoop Cluster
HBase Cluster
Data-Compute Separated Hadoop Cluster
Compute-Only Hadoop Cluster
Customized Cluster
You can create a simple Hadoop deployment for proof of concept projects and other small scale data processing tasks using the basic Hadoop cluster.
You can create an HBase cluster. To run HBase MapReduce jobs, configure the HBase cluster to include JobTracker or TaskTracker nodes.
You can separate the data and compute nodes in a Hadoop cluster, and you can control how nodes are placed on your environment's vSphere ESXi hosts.
You can create a compute-only cluster to run MapReduce jobs. Compute­only clusters run only MapReduce services that read data from external HDFS clusters and that do not need to store data.
You can use an existing cluster specification file to create clusters using the same configuration as your previously created clusters. You can also edit the file to customize the cluster configuration.
Hadoop Distributions Supporting MapReduce v1 and MapReduce v2 (YARN)
If the Hadoop distribution you use supports both MapReduce v1 and MapReduce v2 (YARN), the default Hadoop cluster configuration creates a MapReduce v2 cluster.
In addition, if you are using two different versions of a vendor's Hadoop distribution, and both versions support MapReduce v1 and MapReduce v2, the cluster you create using the latest version with the default Hadoop cluster will use MapReduce v2. Clusters you create with the earlier Hadoop version will use MapReduce v1. For example, if you have both Cloudera CDH 5 and CDH 4 installed within Big Data Extensions, clusters you create with CDH 5 will use MapReduce v2, and clusters you create with CDH 4 will use MapReduce v1.
VMware, Inc. 17

Serengeti’s Default Hadoop Cluster Configuration

For basic Hadoop deployments, such as proof of concept projects, you can use Serengeti’s default Hadoop cluster configuration for clusters that are created with the Command-Line Interface.
The resulting cluster deployment consists of the following nodes and virtual machines:
One master node virtual machine with NameNode and JobTracker services.
n
Three worker node virtual machines, each with DataNode and TaskTracker services.
n
One client node virtual machine containing the Hadoop client environment: the Hadoop client shell,
n
Pig, and Hive.
Hadoop Distributions Supporting MapReduce v1 and MapReduce v2 (YARN)
If the Hadoop distribution you use supports both MapReduce v1 and MapReduce v2 (YARN), the default Hadoop cluster configuration creates a MapReduce v2 cluster.
In addition, if you are using two different versions of a vendor's Hadoop distribution, and both versions support MapReduce v1 and MapReduce v2, the cluster you create using the latest version with the default Hadoop cluster will use MapReduce v2. Clusters you create with the earlier Hadoop version will use MapReduce v1. For example, if you have both Cloudera CDH 5 and CDH 4 installed within Big Data Extensions, clusters you create with CDH 5 will use MapReduce v2, and clusters you create with CDH 4 will use MapReduce v1.

Create a Default Serengeti Hadoop Cluster with the Serengeti Command-Line Interface

You can create as many clusters as you want in your Serengeti environment, but your environment must meet all prerequisites.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Deploy a default Serengeti Hadoop cluster on vSphere.
cluster create --name cluster_name
The only valid characters for cluster names are alphanumeric and underscores. When you choose the cluster name, also consider the applicable vApp name. Together, the vApp and cluster names must be < 80 characters.
During the deployment process, real-time progress updates appear on the command-line.
What to do next
After the deployment finishes, you can run Hadoop commands and view the IP addresses of the Hadoop node virtual machines from the Serengeti CLI.
18 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters

Create a Cluster with a Custom Administrator Password with the Serengeti Command-Line Interface

When you create a cluster, you can assign a custom administrator password to all the nodes in the cluster. Custom administrator passwords let you directly log in to the cluster's nodes instead of having to first log in to the Serengeti Management server.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command and include the --password parameter.
cluster create --name cluster_name --password
3 Enter your custom password, and enter it again.
Passwords are from 8 to 128 characters, and include only alphanumeric characters ([0-9, a-z, A-Z]) and the following special characters: _ @ # $ % ^ & *
Your custom password is assigned to all the nodes in the cluster.

Create a Cluster with an Available Distribution with the Serengeti Command-Line Interface

You can choose which Hadoop distribution to use when you deploy a cluster. If you do not specify a Hadoop distribution, the resulting cluster includes the default distribution, Apache Hadoop.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
VMware, Inc. 19
2 Run the cluster create command, and include the --distro parameter.
The --distro parameter’s value must match a distribution name displayed by the distro list command.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
This example deploys a cluster with the Cloudera CDH distribution:
cluster create --name clusterName --distro cdh
This example creates a customized cluster named mycdh that uses the CDH4 Hadoop distribution, and is configured according to the /opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json sample cluster specification file. In this sample file, nameservice0 and nameservice1 are federated. That is,
nameservice0 and nameservice1 are independent and do not require coordination with each other. The
NameNode nodes in the nameservice0 node group are HDFS2 HA enabled. In Serengeti, name node group names are used as service names for HDFS2.
cluster create --name mycdh --distro cdh4 -­specFile /opt/serengeti/samples/default_cdh4_ha_hadoop_cluster.json

Create a Hadoop Cluster with Assigned Resources with the Serengeti Command-Line Interface

By default, when you use Serengeti to deploy a Hadoop cluster, the cluster might contain any or all available resources: vCenter Server resource pool for the virtual machine's CPU and memory, datastores for the virtual machine's storage, and a network. You can assign which resources the cluster uses by specifying specific resource pools, datastores, and/or a network when you create the Hadoop cluster.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command, and specify any or all of the command’s resource parameters.
This example deploys a cluster named myHadoop on the myDS datastore, under the myRP resource pool, and uses the myNW network for virtual machine communications.
cluster create --name myHadoop --rpNames myRP --dsNames myDS --networkName myNW
20 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
Create a Cluster with Multiple Networks with the Serengeti Command­Line Interface
When you create a cluster, you can distribute the management, HDFS, and MapReduce traffic to separate networks. You might want to use separate networks to improve performance or to isolate traffic for security reasons.
For optimal performance, use the same network for HDFS and MapReduce traffic in Hadoop and Hadoop +HBase clusters. HBase clusters use the HDFS network for traffic related to the HBase Master and HBase RegionServer services.
IMPORTANT You cannot configure multiple networks for clusters that use the MapR Hadoop distribution.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create command and include the --networkName, --hdfsNetworkName, and --
mapredNetworkName parameters.
cluster create --name cluster_name --networkName management_network [--hdfsNetworkName hdfs_network] [--mapredNetworkName mapred_network]
If you omit an optional network parameter, the traffic associated with that network parameter is routed on the management network that you specify by the --networkName parameter.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
The cluster's management, HDFS, and MapReduce traffic is distributed among the specified networks.
Create a MapReduce v2 (YARN) Cluster with the Serengeti Command­Line Interface
You can create MapReduce v2 (YARN) cluster with the Serengeti Command-Line Interface.
When you create a Hadoop cluster with the Serengeti Command-Line Interface, by default you create a MapReduce v1 cluster. To create a MapReduce v2 (YARN) cluster, create a cluster specification file modeled after the /opt/serengeti/samples/default_hadoop_yarn_cluster.json file, and specify the --specFile parameter and your cluster specification file in the cluster create ... command.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
VMware, Inc. 21
Procedure
1 Access the Serengeti CLI.
2 Run the cluster create ... command.
This example creates a customized MapR v2 cluster according to the sample cluster specification file,
default_hadoop_yarn_cluster.json.
cluster create --name cluster_name --distro cdh4 -­specFile /opt/serengeti/samples/default_hadoop_yarn_cluster.json

Create a Customized Hadoop or HBase Cluster with the Serengeti Command-Line Interface

You can create clusters that are customized for your requirements, including the number of nodes, virtual machine RAM and disk size, the number of CPUs, and so on.
The Serengeti package includes several annotated sample cluster specification files that you can use as models when you create your custom specification files.
In the Serengeti Management Server, the sample cluster specification files are
n
in /opt/serengeti/samples.
If you use the Serengeti Remote CLI client, the sample specification files are in the client directory.
n
Changing a node group role might cause the cluster creation process to fail. For example, workable clusters require a NameNode, so if there are no NameNode nodes after you change node group roles, you cannot create a cluster.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define the cluster's characteristics such as the node groups.
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the cluster specification file.
Use the full path to specify the file.
cluster create --name cluster_name --specFile full_path/spec_filename
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
22 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters

Create a Hadoop Cluster with Any Number of Master, Worker, and Client Nodes

You can create a Hadoop cluster with any number of master, worker, and client nodes.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define the cluster's characteristics, including the node groups.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
In this example, the cluster has one master MEDIUM size virtual machine, five worker SMALL size virtual machines, and one client SMALL size virtual machine. The instanceNum attribute configures the number of virtual machines in a node.
{ "nodeGroups" : [ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM" }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL" }, { "name": "client", "roles": [ "hadoop_client", "hive", "hive_server", "pig" ], "instanceNum": 1,
VMware, Inc. 23
"instanceType": "SMALL" } ] }
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the cluster specification file.
cluster create --name cluster_name --specFile full_path/spec_filename

Create a Data-Compute Separated Cluster with No Node Placement Constraints

You can create a cluster with separate data and compute nodes, without node placement constraints.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define the cluster's characteristics.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
In this example, the cluster has separate data and compute nodes, without node placement constraints. Four data nodes and eight compute nodes are created and put into individual virtual machines. The number of nodes is configured by the instanceNum attribute.
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": {
24 VMware, Inc.
"type": "LOCAL", "sizeGB": 50 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { } }
Chapter 3 Creating Hadoop and HBase Clusters
2 Access the Serengeti CLI.
3 Run the cluster create command and specify the cluster specification file.
cluster create --name cluster_name --specFile full_path/spec_filename

Create a Data-Compute Separated Cluster with Placement Policy Constraints

You can create a cluster with separate data and compute nodes, and define placement policy constraints to distribute the nodes among the virtual machines as you want.
CAUTION When you create a cluster with Big Data Extensions, Big Data Extensions disables the cluster's virtual machine automatic migration. Although this prevents vSphere from migrating the virtual machines, it does not prevent you from inadvertently migrating cluster nodes to other hosts by using the vCenter Server user interface. Do not use the vCenter Server user interface to migrate clusters. Performing such management functions outside of the Big Data Extensions environment might break the cluster's placement policy, such as the number of instances per host and the group associations. Even if you do not specify a placement policy, using vCenter Server to migrate clusters can break the default ROUNDROBIN placement policy constraints.
VMware, Inc. 25
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file to define the cluster's characteristics, including the node groups and
placement policy constraints.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
In this example, the cluster has data-compute separated nodes, and each node group has a
placementPolicy constraint. After a successful provisioning, four data nodes and eight compute nodes
are created and put into individual virtual machines. With the instancePerHost=1 constraint, the four data nodes are placed on four ESXi hosts. The eight compute nodes are put onto four ESXi hosts: two nodes on each ESXi host.
This cluster specification requires that you configure datastores and resource pools for at least four hosts, and that there is sufficient disk space for Serengeti to perform the necessary placements during deployment.
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1 } }, {
26 VMware, Inc.
"name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, "placementPolicies": { "instancePerHost": 2 } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { } }
Chapter 3 Creating Hadoop and HBase Clusters
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the cluster specification file.
cluster create --name cluster_name --specFile full_path/spec_filename

Create a Compute-Only Cluster with the Serengeti Command-Line Interface

You can create compute-only clusters to run MapReduce jobs on existing HDFS clusters, including storage solutions that serve as an external HDFS.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the Hadoop cluster.
n
To use any Hadoop distribution other than the provided Apache Hadoop, add one or more Hadoop
n
distributions. See the VMware vSphere Big Data Extensions Administrator's and User's Guide.
Procedure
1 Create a cluster specification file that is modeled on the Serengeti compute_only_cluster.json sample
cluster specification file found in the Serengeti cli/samples directory.
VMware, Inc. 27
2 Add the following code to your new cluster specification file.
For HDFS clusters, set port_num to 8020. For Hadoop 2.0 clusters, such as CDH4 and Pivotal HD distributions, set port_num to 9000.
NOTE To create an Apache Bigtop, Cloudera CDH4 and CDH5, or Pivotal PHD 1.1 cluster, you must configure a valid DNS and FQDN for the cluster's HDFS and MapReduce traffic. Without valid DNS and FQDN settings, the cluster creation process might fail or the cluster is created but does not function.
In this example, the externalHDFS field points to an HDFS. Assign the hadoop_jobtracker role to the
master node group and the hadoop_tasktracker role to the worker node group.
The externalHDFS field conflicts with node groups that have hadoop_namenode and hadoop_datanode roles. This conflict might cause the cluster creation to fail or, if successfully created, the cluster might not work correctly. To avoid this problem, define only a single HDFS.
{ "externalHDFS": "hdfs://hostname-of-namenode:port_num", "nodeGroups": [ { "name": "master", "roles": [ "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "worker", "roles": [ "hadoop_tasktracker", ], "instanceNum": 4, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 }, }
28 VMware, Inc.
Chapter 3 Creating Hadoop and HBase Clusters
], “configuration” : { } }
3 Access the Serengeti CLI.
4 Run the cluster create command and include the cluster specification file parameter and associated
filename.
cluster create --name name_computeOnlyCluster --specFile path/spec_file_name

Create a Basic Cluster with the Serengeti Command-Line Interface

You can create a basic cluster in your Serengeti environment. A basic cluster is a group of virtual machines provisioned and managed by Serengeti. Serengeti helps you to plan and provision the virtual machines to your specifications. You can use the basic cluster's virtual machines to install Big Data applications.
The basic cluster does not install the Big Data application packages used when creating a Hadoop or HBase cluster. Instead, you can install and manage Big Data applications with third party application management tools such as Apache Ambari or Cloudera Manager within your Big Data Extensions environment, and integrate it with your Hadoop software. The basic cluster does not deploy a Hadoop or Hbase cluster. You must deploy software into the basic cluster's virtual machines using an external third party application management tool.
The Serengeti package includes an annotated sample cluster specification file that you can use as an example when you create your basic cluster specification file. In the Serengeti Management Server, the sample specification file is located at /opt/serengeti/samples/basic_cluster.json. You can modify the configuration values in the sample cluster specification file to meet your requirements. The only value you cannot change is the value assigned to the role for each node group, which must always be basic.
You can deploy a basic cluster with the Big Data Extension plug-in using a customized cluster specification file.
To deploy software within the basic cluster virtual machines, use the cluster list --detail command, or run serengeti-ssh.sh cluster_name to obtain the IP address of the virtual machine. You can then use the IP address with management applications such as Apache Ambari or Cloudera Manager to provision the virtual machine with software of your choosing. You can configure the management application to use the user name serengeti, and the password you specified when creating the basic cluster within Big Data Extensions when the management tool needs a user name and password to connect to the virtual machines.
Prerequisites
Deploy the Serengeti vApp.
n
Ensure that you have adequate resources allocated to run the cluster, as well as the Big Data software
n
you intend to deploy.
Procedure
1 Create a specification file to define the basic cluster's characteristics.
You must use the basic role for each node group you define for the basic cluster.
{ "nodeGroups":[ { "name": "master", "roles": [ "basic" ], "instanceNum": 1,
VMware, Inc. 29
"cpuNum": 2, "memCapacityMB": 3768, "storage": { "type": "LOCAL", "sizeGB": 250 }, "haFlag": "on" }, { "name": "worker", "roles": [ "basic" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 3768, "storage": { "type": "LOCAL", "sizeGB": 250 }, "haFlag": "off" } ] }
2 Access the Serengeti CLI.
3 Run the cluster create command, and specify the basic cluster specification file.
cluster create --name cluster_name --specFile /opt/serengeti/samples/basic_cluster.json -­password
NOTE When creating a basic cluster, you do not need to specify a Hadoop distribution type using the
--distro option. The reason for this is that there is no Hadoop distribution being installed within the
basic cluster to be managed by Serengeti.
30 VMware, Inc.
Loading...
+ 70 hidden pages