Juniper HealthBot User Manual

HealthBot User Guide

Published

2021-02-17

Juniper Networks, Inc. 1133 Innovation Way Sunnyvale, California 94089 USA 408-745-2000 www.juniper.net

Juniper Networks, the Juniper Networks logo, Juniper, and Junos are registered trademarks of Juniper Networks, Inc. in the United States and other countries. All other trademarks, service marks, registered marks, or registered service marks are the property of their respective owners.

Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice.

HealthBot User Guide

The information in this document is current as of the date on the title page.

YEAR 2000 NOTICE

Juniper Networks hardware and software products are Year 2000 compliant. Junos OS has no known time-related limitations through the year 2038. However, the NTP application is known to have some difficulty in the year 2036.

END USER LICENSE AGREEMENT

The Juniper Networks product that is the subject of this technical documentation consists of (or is intended for use with) Juniper Networks software. Use of such software is subject to the terms and conditions of the End User License Agreement (“EULA”) posted at https://support.juniper.net/support/eula/. By downloading, installing or using such software, you agree to the terms and conditions of that EULA.

About the Documentation | ix

Documentation and Release Notes | ix

Documentation Conventions | ix

Documentation Feedback | xii

Requesting Technical Support | xii

Self-Help Online Tools and Resources | xiii

Creating a Service Request with JTAC | xiii

Introduction to HealthBot

HealthBot Overview | 15

Benefits of HealthBot | 15

iii

Closed-Loop Automation | 16

Main Components of HealthBot | 17

HealthBot Health Monitoring | 17

HealthBot Root Cause Analysis | 18

HealthBot Log File Analysis | 19

HealthBot Concepts | 20

HealthBot Data Collection Methods | 21

Data Collection - ’Push’ Model | 21

Data Collection - ’Pull’ Model | 22

HealthBot Topics | 22

HealthBot Rules - Basics | 23

HealthBot Rules - Deep Dive | 25

Rules | 25

Sensors | 28

Fields | 28

Vectors | 30

Variables | 30

Functions | 31

Triggers | 31

Tagging | 34

Rule Properties | 35

HealthBot Playbooks | 35

Healthbot Tagging | 36

Overview | 36

HealthBot Tagging Terminology | 37

How It Works | 41

Examples | 42

HealthBot Time Series Database (TSDB) | 48

Historical Context | 48

TSDB Improvements | 49

Database Sharding | 50

Database Replication | 51

Database Reads and Writes | 52

Manage TSDB Options in the HealthBot GUI | 53

HealthBot CLI Configuration Options | 55

HealthBot Machine Learning (ML) | 56

HealthBot Machine Learning Overview | 56

Understanding HealthBot Anomaly Detection | 57

Field | 57

Algorithm | 58

Learning period | 58

Pattern periodicity | 59

Understanding HealthBot Outlier Detection | 59

Dataset | 60

Algorithm | 60

Sigma coefficient (k-fold-3sigma only) | 62

Sensitivity | 62

Learning period | 62

Understanding HealthBot Predict | 62

Field | 63

Algorithm | 63

Learning period | 63

Pattern periodicity | 63

Prediction offset | 63

HealthBot Rule Examples | 64

HealthBot Anomaly Detection Example | 64

HealthBot Outlier Detection Example | 73

Frequency Profiles and Offset Time | 78

Frequency Profiles | 78

Configuration Using HealthBot GUI | 79

Configuration Using HealthBot CLI | 81

Apply a Frequency Profile Using the HealthBot GUI | 81

Apply a Frequency Profile Using the HealthBot CLI | 83

Offset Time Unit | 83

Offset Used in Formulas | 84

Offset Used in References | 85

Offset Used in Vectors | 86

Offset Used in Triggers | 88

Offset Used in Trigger Reference | 89

HealthBot Licensing | 91

HealthBot Licensing Overview | 91

Managing HealthBot Licenses | 94

Add a License to HealthBot | 94

View Licensing Status in HealthBot | 94

Management and Monitoring

Manage HealthBot Users and Groups | 98

User Management | 98

Group Management | 99

Limitations | 104

Manage Devices, Device Groups, and Network Groups | 104

Adding a Device | 106

Editing a Device | 109

Adding a Device Group | 109

Editing a Device Group | 113

Configuring a Retention Policy for the Time Series Database | 113

Adding a Network Group | 114

Editing a Network Group | 117

HealthBot Rules and Playbooks | 118

Add a Pre-Defined Rule | 119

Create a New Rule Using the HealthBot GUI | 119

Rule Filtering | 121

Sensors | 123

Fields | 125

Vectors | 128

Variables | 130

Functions | 131

Triggers | 133

Rule Properties | 136

Edit a Rule | 136

Add a Pre-Defined Playbook | 137

Create a New Playbook Using the HealthBot GUI | 138

Edit a Playbook | 139

Manage Playbook Instances | 140

View Information About Playbook Instances | 141

Create a Playbook Instance | 143

Manually Pause or Play a Playbook Instance | 145

Create a Schedule to Automatically Play/Pause a Playbook Instance | 146

Monitor Device and Network Health | 148

Dashboard | 149

Health | 152

Network Health | 165

Graph Page | 165

Alarms and Notifications | 181

Generate Alarm Notifications | 181

Manage Alarms Using Alarm Manager | 190

Stream Sensor and Field Data from HealthBot | 195

Generate Reports | 200

Configure a Secure Data Connection for HealthBot Devices | 216

Configure Security Profiles for SSL and SSH Authentication | 217

Configure Security Authentication for a Specific Device or Device Group | 218

Configure Data Summarization | 219

Creating a Data Summarization Profile | 220

Applying Data Summarization Profiles to a Device Group | 221

Modify the UDA and UDF Engines | 222

Overview | 222

How it Works | 223

Usage Notes | 224

vii

Configuration | 225

SIMULATE | 225

MODIFY | 226

ROLLBACK | 227

Logs for HealthBot Services | 227

Configure Service Log Levels for a Device Group or Network Group | 228

Download Logs for HealthBot Services | 229

Troubleshooting | 230

HealthBot Self Test | 230

Overview | 230

Other Uses for the Self Test Tool | 231

Usage Notes | 231

How to Use the Self Test Tool | 232

Device Reachability Test | 232

Overview | 232

Usage Notes | 233

How to Use the Device Reachability Tool | 233

Ingest Connectivity Test | 234

Overview | 234

Usage Notes | 235

How to Use the Ingest Connectivity Tool | 235

Debug No-Data | 236

Overview | 236

Usage Notes | 237

How to Use the Debug No-Data Tool | 238

HealthBot Configuration – Backup and Restore | 240

Back Up the Configuration | 240

Restore the Configuration | 240

Backup or Restore the Time Series Database (TSDB) | 241

viii

About the Documentation

IN THIS SECTION

Documentation and Release Notes | ix

Documentation Conventions | ix

Documentation Feedback | xii

Requesting Technical Support | xii

Use this guide to understand the features you can configure and the tasks you can perform from the HealthBot web UI.

Documentation and Release Notes

To obtain the most current version of all Juniper Networks®technical documentation, see the product documentation page on the Juniper Networks website at https://www.juniper.net/documentation/.

If the information in the latest release notes differs from the information in the documentation, follow the product Release Notes.

Juniper Networks Books publishes books by Juniper Networks engineers and subject matter experts. These books go beyond the technical documentation to explore the nuances of network architecture, deployment, and administration. The current list can be viewed at https://www.juniper.net/books.

Documentation Conventions

Table 1 on page x defines notice icons used in this guide.

Table 1: Notice Icons

DescriptionMeaningIcon

Indicates important features or instructions.Informational note

Caution

Indicates a situation that might result in loss of data or hardware damage.

Alerts you to the risk of personal injury or death.Warning

Alerts you to the risk of personal injury from a laser.Laser warning

Indicates helpful information.Tip

Alerts you to a recommended use or implementation.Best practice

Table 2 on page x defines the text and syntax conventions used in this guide.

Table 2: Text and Syntax Conventions

ExamplesDescriptionConvention

Fixed-width text like this

Italic text like this

Represents text that you type.Bold text like this

Represents output that appears on the terminal screen.

Introduces or emphasizes important

•

new terms.

Identifies guide names.

•

Identifies RFC and Internet draft

•

titles.

To enter configuration mode, type the configure command:

user@host> configure

user@host> show chassis alarms

No alarms currently active

A policy term is a named structure

•

that defines match conditions and actions.

Junos OS CLI User Guide

•

RFC 1997, BGP Communities

•

Attribute

Table 2: Text and Syntax Conventions (continued)

ExamplesDescriptionConvention

Italic text like this

Text like this

< > (angle brackets)

| (pipe symbol)

Represents variables (options for which you substitute a value) in commands or configuration statements.

Represents names of configuration statements, commands, files, and directories; configuration hierarchy levels; or labels on routing platform components.

variables.

Indicates a choice between the mutually exclusive keywords or variables on either side of the symbol. The set of choices is often enclosed in parentheses for clarity.

Configure the machine’s domain name:

[edit] root@# set system domain-name

domain-name

To configure a stub area, include

•

the stub statement at the [edit protocols ospf area area-id]

hierarchy level.

The console port is labeled

•

CONSOLE.

stub <default-metric metric>;Encloses optional keywords or

broadcast | multicast

(string1 | string2 | string3)

# (pound sign)

[ ] (square brackets)

Indention and braces ( { } )

; (semicolon)

GUI Conventions

Indicates a comment specified on the same line as the configuration statement to which it applies.

Encloses a variable for which you can substitute one or more values.

Identifies a level in the configuration hierarchy.

Identifies a leaf statement at a configuration hierarchy level.

rsvp { # Required for dynamic MPLS only

community name members [ community-ids ]

[edit] routing-options {

static {

route default {

nexthop address; retain;

}

Table 2: Text and Syntax Conventions (continued)

xii

ExamplesDescriptionConvention

Bold text like this

> (bold right angle bracket)

Represents graphical user interface (GUI) items you click or select.

Separates levels in a hierarchy of menu selections.

In the Logical Interfaces box, select

•

All Interfaces.

To cancel the configuration, click

•

Cancel.

In the configuration editor hierarchy, select Protocols>Ospf.

Documentation Feedback

We encourage you to provide feedback so that we can improve our documentation. You can use either of the following methods:

Online feedback system—Click TechLibrary Feedback, on the lower right of any page on the Juniper

•

Networks TechLibrary site, and do one of the following:

Click the thumbs-up icon if the information on the page was helpful to you.

•

Click the thumbs-down icon if the information on the page was not helpful to you or if you have

•

suggestions for improvement, and use the pop-up form to provide feedback.

E-mail—Send your comments to techpubs-comments@juniper.net. Include the document or topic name,

•

URL or page number, and software version (if applicable).

Requesting Technical Support

Technical product support is available through the Juniper Networks Technical Assistance Center (JTAC). If you are a customer with an active Juniper Care or Partner Support Services support contract, or are

covered under warranty, and need post-sales technical support, you can access our tools and resources online or open a case with JTAC.

JTAC policies—For a complete understanding of our JTAC procedures and policies, review the JTAC User

•

Guide located at https://www.juniper.net/us/en/local/pdf/resource-guides/7100059-en.pdf.

Product warranties—For product warranty information, visit https://www.juniper.net/support/warranty/.

•

JTAC hours of operation—The JTAC centers have resources available 24 hours a day, 7 days a week,

•

365 days a year.

Self-Help Online Tools and Resources

For quick and easy problem resolution, Juniper Networks has designed an online self-service portal called the Customer Support Center (CSC) that provides you with the following features:

Find CSC offerings: https://www.juniper.net/customers/support/

•

Search for known bugs: https://prsearch.juniper.net/

•

xiii

Find product documentation: https://www.juniper.net/documentation/

•

Find solutions and answer questions using our Knowledge Base: https://kb.juniper.net/

•

Download the latest versions of software and review release notes:

•

https://www.juniper.net/customers/csc/software/

Search technical bulletins for relevant hardware and software notifications:

•

https://kb.juniper.net/InfoCenter/

Join and participate in the Juniper Networks Community Forum:

•

https://www.juniper.net/company/communities/

Create a service request online: https://myjuniper.juniper.net

•

To verify service entitlement by product serial number, use our Serial Number Entitlement (SNE) Tool:

https://entitlementsearch.juniper.net/entitlementsearch/

Creating a Service Request with JTAC

You can create a service request with JTAC on the Web or by telephone.

Visit https://myjuniper.juniper.net.

•

Call 1-888-314-JTAC (1-888-314-5822 toll-free in the USA, Canada, and Mexico).

•

For international or direct-dial options in countries without toll-free numbers, see

https://support.juniper.net/support/requesting-support/.

CHAPTER

Introduction to HealthBot

HealthBot Overview | 15

HealthBot Concepts | 20

Healthbot Tagging | 36

HealthBot Time Series Database (TSDB) | 48

HealthBot Machine Learning (ML) | 56

Frequency Profiles and Offset Time | 78

HealthBot Licensing | 91

HealthBot Overview

IN THIS SECTION

Benefits of HealthBot | 15

Closed-Loop Automation | 16

Main Components of HealthBot | 17

HealthBot is a highly automated and programmable device-level diagnostics and network analytics tool that provides consistent and coherent operational intelligence across network deployments. Integrated with multiple data collection methods (such as Junos Telemetry Interface, NETCONF, syslog, and SNMP), HealthBot aggregates and correlates large volumes of time-sensitive telemetry data, providing a multidimensional and predictive view of the network. Additionally, HealthBot translates troubleshooting, maintenance, and real-time analytics into an intuitive user experience to give network operators actionable insights into the health of an individual device and the overall network.

Benefits of HealthBot

Customization—Provides a framework to define and customize health profiles, allowing truly actionable

•

insights for the specific device or network being monitored.

Automation—Automates root cause analysis and log file analysis, streamlines diagnostic workflows, and

•

provides self-healing and remediation capabilities.

Greater network visibility—Provides advanced multidimensional analytics across network elements,

•

giving you a clearer understanding of network behavior to establish operational benchmarks, improve resource planning, and minimize service downtime.

Intuitive graphical user interface—Offers an intuitive web-based GUI for policy management and easy

•

data consumption.

Open integration —Lowers the barrier of entry for telemetry and analytics by providing open source

•

data pipelines, notification capabilities, and third-party device support.

Multiple data collection methods—Includes support for Junos Telemetry Interface (JTI), NETCONF,

•

syslog, NetFlow, and SNMP.

Closed-Loop Automation

HealthBot offers closed-loop automation. The automation workflow can be divided into seven main steps (see Figure 1 on page 17):

1. Define—HealthBot provides tools for the user to define the health parameters of key network elements

through customizable key performance indicators (KPIs), rules, and playbooks.

2. Collect—HealthBot collects rule-based telemetry data from multiple devices using various types of

data transfer methods.

3. Store—HealthBot stores time-sensitive telemetry data in a time-series database (TSDB). This allows

users to query, perform operations on, and write new data back to the database, days, or even weeks after initial storage.

4. Analyze—HealthBot analyzes telemetry data based on customizable KPIs, rules, and playbooks.

5. Visualize—HealthBot provides multiple ways for you to visualize the aggregated telemetry data through

the HealthBot web UI to gain actionable and predictive insight into the health of your devices and overall network.

6. Notify—HealthBot notifies you through the HealthBot web UI and alarm notifications when problems

in the network are detected.

7. Act—HealthBot performs user-defined actions to help resolve and proactively prevent network problems.

Figure 1: HealthBot Closed-loop Automation Workflow

Rule Engine

g301020

API Server

Programmable

Access

MGD

Playbook

KPI Health Monitoring

Root Cause Analysis

Log File Analysis

Time Series

Database

Collect

Analyze

Third Party

Provisioning / NMS

NETCONF

REST API

Data Collection Layer

NETCONF

Open

Config

JTI CLI Syslog SNMP NetFlow

Telemetry

Infrastructure

Store

Python

Act

Notify

User Defined Functions/Actions

Notifications: Slack, Webhook, . . .

Visualize

Define

GUI Access

Main Components of HealthBot

HealthBot consists of three main components:

Health Monitoring—View an abstracted, hierarchical representation of device and network-level health,

•

and define the health parameters of key network elements through customizable key performance indicators (KPIs), rules, and playbooks.

Root Cause Analysis—Find the root cause of a device or network-level issue when HealthBot detects a

•

problem with a network element.

Log File Analysis—Analyze relevant system log messages by filtering out noise.

•

HealthBot Health Monitoring

The Challenge

With increasing data traffic generated by cloud-native applications and emerging technologies, service providers and enterprises need a network analytics solution to analyze volumes of telemetry data, offer

insights into overall network health, and produce actionable intelligence. While telemetry-based techniques have existed for years, the growing number of protocols, data formats, and key performance indicators (KPIs) from diverse networking devices has made data analysis complex and costly. Traditional CLI-based interfaces require specialized skills to extract business value from telemetry data, creating a barrier to entry for network analytics

The HealthBot Health Monitoring Solution

By aggregating and correlating raw telemetry data from multiple sources, the HealthBot health monitoring feature provides a multidimensional view of network health that reports current status, as well as projected threats to the infrastructure and its workloads.

Health status determination is tightly integrated with the HealthBot root cause analysis (RCA) application, which can make use of syslog log data received from the network and its devices. HealthBot health monitoring provides status indicators that alert you when, for example, a network resource is currently operating outside a user-defined performance policy, as well as risk analysis using historical trends to predict whether a resource may be unhealthy in the future. HealthBot health monitoring not only offers a fully customizable view of the current health of network elements, it also automatically initiates remedial actions based on predefined service level agreements (SLAs).

Defining the health of a network element, such as broadband network gateway (BNG), provider edge (PE), core, and leaf-spine, is highly contextual. Each element plays a different role in a network, with unique key performance indicators (KPIs) to monitor. Given that there is no single definition for network health across all use cases, HealthBot provides a highly customizable framework to allow you to define your own health profiles.

HealthBot Root Cause Analysis

The Challenge

In some cases, it can be challenging for a network operator to figure out what causes a Junos OS networking device to stop working properly. When this happens, the typical workflow to find the root cause of the network problem involves contacting a specialist from Juniper Networks, who would then troubleshoot and triage the unhealthy component based on knowledge built from years of experience. After completing this time-intensive assessment, the problem would then be reassigned to the relevant engineering team.

The HealthBot RCA Solution

The purpose of the HealthBot root cause analysis (RCA) application is to simplify the process of finding the root cause of a network issue. HealthBot RCA captures the troubleshooting knowledge of, for example, the Juniper Networks specialists as part of a knowledge base in the form of HealthBot rules. These rules are evaluated either on demand by a specific trigger or periodically in the background to ascertain the health of a networking component, such as routing protocol, system, interface, or chassis, on the device.

To illustrate the benefits of HealthBot RCA, let us consider the problem of OSPF flapping.

Figure 2 on page 19 highlights the workflow sequence involved in debugging OSPF flapping. If a network

operator or Juniper Networks specialist were assigned this troubleshooting task, he or she would need to

RPD – Infra Kernel – System Chassis Interfaces

PFE – Interface

RPD – OSPF

Control Plane

Host Path

Data Plane

PPM – OSPF PFE – Ukern PFE – Host PathRE – Host Path

XM – ASIC LU – ASIC PFE – System

PFE – jnh

g300019

perform manual debugging steps for each tile of the workflow sequence in order to find the root cause of the OSPF flapping. The HealthBot RCA application, on the other hand, delivers this expert service to you automatically as a bot. The RCA bot tracks all of the telemetry data collected by the HealthBot and translates the information into graphical status indicators (displayed in the HealthBot web UI) that correlate to different parts of the workflow sequence shown in Figure 2 on page 19.

Figure 2: High-level workflow to debug OSPF-flapping

When configuring HealthBot, each tile of the workflow sequence shown in Figure 2 on page 19 can be defined by one or more rules. For example, the RPD-OSPF tile could be defined as two rule conditions: one to check if "hello-transmitted" counters are incrementing and the other to check if "hello-received" counters are incrementing. Based on these user-defined rules, HealthBot provides status indicators, alarm notifications, and an alarm management tool through the web UI to inform and alert you of specific network conditions that could, for example, lead to OSPF flapping. By isolating a problem area in the workflow, HealthBot RCA proactively guides you in determining the appropriate corrective action to take to fix a pending issue or avoid a potential one.

HealthBot Log File Analysis

The Challenge

Networking devices can generate a lot of log messages, some of these messages are arcane and others create a lot of noise and clutter that drown out the more significant, meaningful messages. Network operators need an easy way to sort through and organize all of these log messages, as well as make sense of the information in order to take action, if necessary.

The HealthBot Log File Analysis Solution

Fully integrated with the HealthBot health monitoring and RCA features, HealthBot log file analysis can be implemented with the use of log patterns and pattern sets within the syslog ingest settings. The pattern sets can be applied to Rules to automatically filter out unnecessary log messages and help highlight only the relevant, actionable messages. Healthbot log file analysis consists of two main components:

1. An ingest engine that lets HealthBot receive syslog messages from networks and devices.

2. Pre-defined and customizable search patterns and pattern sets that can be applied to rules.

See Syslog Ingest for more information about syslog ingest.

HealthBot Concepts

IN THIS SECTION

HealthBot Data Collection Methods | 21

HealthBot Topics | 22

HealthBot Rules - Basics | 23

HealthBot Rules - Deep Dive | 25

HealthBot Playbooks | 35

HealthBot is a highly programmable telemetry-based analytics application. With it, you can diagnose and root cause network issues, detect network anomalies, predict potential network issues, and create real-time remedies for any issues that come up.

To accomplish this, network devices and HealthBot have to be configured to send and receive large amounts of data, respectively. Device configuration is covered throughout this and other sections of the guide.

Configuring HealthBot, or any application, to read and react to incoming telemetry data requires a language that describes several elements that are specific to the systems and data under analysis. This type of language is called a Domain Specific Language (DSL), i.e., a language that is specific to one domain. Any DSL is built to help answer questions. For HealthBot, these questions are:

Q: What components make up the systems that are sending data?

•

A: Network devices are made up of memory, cpu, interfaces, protocols and so on. In HealthBot, these are called “HealthBot Topics” on page 22.

Q: How do we gather, filter, process, and analyze all of this incoming telemetry data?

•

A: HealthBot uses “HealthBot Rules - Basics” on page 23 that consist of information blocks called sensors, fields, variables, triggers, and more.

Q: How do we determine what to look for?

•

A: It depends on the problem you want to solve or the question you want to answer. Healthbot uses

“HealthBot Playbooks” on page 35 to create collections of specific rules and apply them to specific

groups of devices in order accomplish specific goals. For example, part of the system-kpis-playbook can alert a user when system memory usage crosses a user-defined threshold.

This section covers these key concepts and more, which you need to understand before using HealthBot.

HealthBot Data Collection Methods

In order to provide visibility into the state of your network devices, HealthBot first needs to collect their telemetry data and other status information. It does this using sensors.

HealthBot supports sensors that “push” data from the device to HealthBot and sensors that require HealthBot to “pull” data from the device using periodic polling.

Data Collection - ’Push’ Model

As the number of objects in the network, and the metrics they generate, have grown, gathering operational statistics for monitoring the health of a network has become an ever-increasing challenge. Traditional ’pull’ data-gathering models, like SNMP and the CLI, require additional processing to periodically poll the network element, and can directly limit scaling.

The ’push’ model overcomes these limits by delivering data asynchronously, which eliminates polling. With this model, the HealthBot server can make a single request to a network device to stream periodic updates. As a result, the ’push’ model is highly scalable and can support the monitoring of thousands of objects in a network. Junos devices support this model in the form of the Junos Telemetry Interface (JTI).

HealthBot currently supports four ‘push’ ingest types.

Native GPB

•

NetFlow

•

OpenConfig

•

Syslog

•

These push-model data collection—or ingest—methods are explained in detail in the HealthBot Data Ingest Guide.

Data Collection - ’Pull’ Model

While the ’push’ model is the preferred approach for its efficiency and scalability, there are still cases where the ’pull’ data collection model is appropriate. With the ’pull’ model, HealthBot requests data from network devices at periodic intervals.

HealthBot currently supports two ‘pull’ ingest types.

iAgent (CLI/NETCONF)

•

SNMP

•

These pull-model data collection—or ingest—methods are explained in detail in the HealthBot Data Ingest Guide.

HealthBot Topics

Network devices are made up of a number of components and systems from CPUs and memory to interfaces and protocol stacks and more. In HealthBot, a topic is the construct used to address those different device components. The Topic block is used to create name spaces that define what needs to be modeled. Each Topic block is made up of one or more Rule blocks which, in turn, consist of the Field blocks, Function blocks, Trigger blocks, etc. See “HealthBot Rules - Deep Dive” on page 25 for details. Each rule created in HealthBot must be part of a topic. Juniper has curated a number of these system components into a list of Topics such as:

chassis

•

class-of-service

•

external

•

firewall

•

interfaces

•

kernel

•

linecard

•

logical-systems

•

protocol

•

routing-options

•

security

•

service

•

system

•

You can create sub-topics underneath any of the Juniper topic names by appending .<sub-topic> to the topic name. For example, kernel.tcpip or system.cpu.

Any pre-defined rules provided by Juniper fit within one of the Juniper topics with the exception of external, The external topic is reserved for user-created rules. In the HealtBot web GUI, when you create a new rule, the Topics field is automatically populated with the external topic name.

HealthBot Rules - Basics

HealthBot’s primary function is collecting and reacting to telemetry data from network devices. Defining how to collect the data, and how to react to it, is the role of a rule.

HealthBot ships with a set of default rules, which can be seen on the Configuration > Rules page of the HealthBot GUI, as well as in GitHub in the healthbot-rules repository. You can also create your own rules.

The structure of a HealthBot rule looks like this:

To keep rules organized, HealthBot organizes them into topics. Topics can be very general, like system, or they can be more granular, like protocol.bgp. Each topic contains one or more rules.

As described above, a rule contains all the details and instructions to define how to collect and handle the data. Each rule contains the following required elements:

The sensor defines the parameters for collecting the data. This typically includes which data collection

•

method to use (as discussed above in “HealthBot Data Collection Methods” on page 21), some guidance on which data to ingest, and how often to push or pull the data. In any given rule, a sensor can be defined directly within the rule or it can be referenced from another rule.

Example: Using the SNMP sensor, poll the network device every 60 seconds to collect all the device

•

data in the Juniper SNMP MIB table jnxOperatingTable.

The sensor typically ingests a large set of data, so fields provide a way to filter or manipulate that data,

•

allowing you to identify and isolate the specific pieces of information you care about. Fields can also act as placeholder values, like a static threshold value, to help the system perform data analysis.

Example: Extract, isolate, and store the jnxOperating15MinLoadAvg (CPU 15-minute average utilization)

•

value from the SNMP table specified above in the sensor.

Triggers periodically bring together the fields with other elements to compare data and determine current

•

device status. A trigger includes one or more ’when-then’ statements, which include the parameters that define how device status is visualized on the health pages.

Example: Every 90 seconds, check the CPU 15min average utilization value, and if it goes above a

•

defined threshold, set the device’s status to red on the device health page and display a message showing the current value.

The rule can also contain the following optional elements:

Vectors allow you to leverage existing elements to avoid the need to repeatedly configure the same

•

elements across multiple rules.

Examples: A rule with a configured sensor, plus a vector to a second sensor from another rule; a rule

•

with no sensors, and vectors to fields from other rules

Variables can be used to provide additional supporting parameters needed by the required elements

•

above.

Examples: The string “ge-0/0/0”, used within a field collecting status for all interfaces, to filter the

•

data down to just the one interface; an integer, such as “80”, referenced in a field to use as a static threshold value

Functions allow you to provide instructions (in the form of a Python script) on how to further interact

•

with data, and how to react to certain events.

Examples: A rule that monitors input and output packet counts, using a function to compare the count

•

values; a rule that monitors system storage, invoking a function to cleanup temp and log files if storage utilization goes above a defined threshold

NOTE: Rules, on their own, don’t actually do anything. To make use of rules you need to add

them to “HealthBot Playbooks” on page 35.

HealthBot Rules - Deep Dive

IN THIS SECTION

Rules | 25

Sensors | 28

Fields | 28

Vectors | 30

Variables | 30

Functions | 31

Triggers | 31

Tagging | 34

Rule Properties | 35

A rule is a package of components, or blocks, needed to extract specific information from the network or from a Junos device. Rules conform to a specifically tailored domain specific language (DSL) for analytics applications. The DSL is designed to allow rules to capture:

The minimum set of input data that the rule needs to be able to operate

•

The minimum set of telemetry sensors that need to be configured on the device(s)

•

The fields of interest from the configured sensors

•

The reporting or polling frequency

•

The set of triggers that operate on the collected data

•

The conditions or evaluations needed for triggers to kick in

•

The actions or notifications that need to be performed when a trigger kicks in

•

The details around rules, topics and playbooks are presented in the following sections.

Rules

Rules are meant to be free of any hard coding. Think of threshold values; If a threshold is hard coded, there

is no easy way to customize it for a different customer or device that has different requirements. Therefore, rules are defined using parameterization to set the default values. This allows the parameters to be left at default or be customized by the operator at the time of deployment. Customization can be done at the device group or individual device level while applying the HealthBot Playbooks on page 35 in which the individual rules are contained.

Rules that are device-centric are called device rules. Device components such as chassis, system, linecards, and interfaces are all addressed as HealthBot Topics on page 22 in the rule definition. Generally, device rules make use of sensors on the devices.

Rules that span multiple devices are called network rules. Network rules:

must have a rule-frequency configured

•

must not contain sensors

•

cannot be mixed with device rules in a playbook

•

To deploy either type of rule, include the rule in a playbook and then apply the playbook to a device group or network group.

NOTE: HealthBot comes with a set of pre-defined rules.

Not all of the blocks that make up a rule are required for every rule. Whether or not a specific block is required in a rule definition depends on what sort of information you are trying to get to. Additionally, some rule components are not valid for network rules. Table 3 on page 26 lists the components of a rule and provides a brief description of each one.

Table 3: Rule Components

“Sensors” on page 28

The Sensors block is like the access method for getting at the data. There are multiple types of sensors available in HealthBot: OpenConfig, Native GPB, iAgent, SNMP, and syslog.

It defines what sensors need to be active on the device in order to get to the data fields on which the triggers eventually operate. Sensor names are referenced by the Fields.

OpenConfig and iAgent sensors require that a frequency be set for push interval or polling interval respectively. SNMP sensors also require you to set a frequency.

Required in Device Rules?What it DoesBlock

that only use a field reference from another rule or a vector with references from another rule. In these cases, rule-frequency must be explicitly defined.

Valid for Network Rules?

NoNo–Rules can be created

Table 3: Rule Components (continued)

Required in Device Rules?What it DoesBlock

Valid for Network Rules?

Fields on page 28

“Vectors” on page 30

“Variables” on page 30

“Functions” on page 31

The source for the Fields block can be a pointer to a sensor, a reference to a field defined in another rule, a constant, or a formula. The field can be a string, integer or floating point. The default field type is string.

and comparing elements amongst different sets. A vector is used to hold multiple values from one or more fields.

Invariant rule definitions are achieved through mustache-style templating like {{<placeholder-variable> }}. The placeholder-variable value is set in the rule by default or can be user-defined at deployment time.

and actions by creating prototype methods in external files written in languages like python. The functions block includes details on the file path, method to be accessed, and any arguments, including argument description and whether it is mandatory.

YesYes-Fields contain the data

on which the triggers operate. Starting in HealthBot release 3.1.0, regular fields and key-fields can be added to rules based on conditional tagging profiles. See the “Tagging”

on page 34 section below.

YesNoThe Vectors block allows handling of lists, creating sets,

NoNoThe Variables block allows you to pass values into rules.

NoNoThe Functions block allows you to extend fields, triggers,

“Triggers” on page 31

“Rule Properties” on page 35

The Triggers block operates on fields and are defined by one or more Terms. When the conditions of a Term are met, then the action defined in the Term is taken.

By default, triggers are evaluated every 10 seconds, unless explicitly configured for a different frequency.

By default, all triggers defined in a rule are evaluated in parallel.

for a HealthBot rule, such as hardware dependencies, software dependencies, and version history.

YesYes–Triggers enable rules

to take action.

YesNoThe Rule Properties block allows you to specify metadata

Sensors

When defining a sensor, you must specify information such as sensor name, sensor type and data collection frequency. As mentioned in Table 3 on page 26, sensors can be one of the following:

OpenConfig—For information on OpenConfig JTI sensors, see the Junos Telemetry Interface User Guide.

•

Native GPB—For information on Native GPB JTI sensors, see the Junos Telemetry Interface User Guide.

•

iAgent—The iAgent sensors use NETCONF and YAML-based PyEZ tables and views to fetch the necessary

•

data. Both structured (XML) and unstructured (VTY commands and CLI output) data are supported. For information on Junos PyEZ, see the Junos PyEz Documentation.

SNMP—Simple Network Management Protocol.

•

syslog—system log

•

BYOI—Bring your own ingest – Allows you to define your own ingest types.

•

Flow—NetFlow traffic flow analysis protocol

•

sFlow—sFlow packet sampling protocol

•

When different rules have the same sensor defined, only one subscription is made per sensor. A key, consisting of sensor-path for OpenConfig and Native GPB sensors, and the tuple of file and table for iAgent sensors is used to identify the associated rule.

When multiple sensors with the same sensor-path key have different frequencies defined, the lowest frequency is chosen for the sensor subscription.

Fields

There are four types of field sources, as listed in Table 3 on page 26. Table 4 on page 29 describes the four field ingest types in more detail.

Table 4: Field Ingest Type Details

DetailsField Type

Sensor

Reference

Subscribing to a sensor typically provides access to multiple columns of data. For instance, subscribing to the OpenConfig interface sensor provides access to a bunch of information including counter related information such as:

/interfaces/counters/tx-bytes,

/interfaces/counters/rx-bytes,

/interfaces/counters/tx-packets,

/interfaces/counters/rx-packets,

/interfaces/counters/oper-state, etc.

Given the rather long names of paths in OpenConfig sensors, the Sensor definition within Fields allows for aliasing, and filtering. For single-sensor rules, the required set of Sensors for the Fields table are programmatically auto-imported from the raw table based on the triggers defined in the rule.

Triggers can only operate on Fields defined within that rule. In some cases, a Field might need to reference another Field or Trigger output defined in another Rule. This is achieved by referencing the other field or trigger and applying additional filters. The referenced field or trigger is treated as a stream notification to the referencing field. References aren’t supported within the same rule.

Constant

Formula

References can also take a time-range option which picks the value, if available, from the time-range provided. Field references must always be unambiguous, so proper attention must be given to filtering the result to get just one value. If a reference receives multiple data points, or values, only the latest one is used. For example, if you are referencing a the values contained in a field over the last 3 minutes, you might end up with 6 values in that field over that time-range. HealthBot only uses the latest value in a situation like this.

A field defined as a constant is a fixed value which cannot be altered during the course of execution. HealthBot Constant types can be strings, integers, and doubles.

Raw sensor fields are the starting point for defining triggers. However, Triggers often work on derived fields defined through formulas by applying mathematical transformations.

Formulas can be pre-defined or user-defined (UDF). Pre-defined formulas include: Min, Max, Mean, Sum, Count, Rate of Change, Elapsed Time, Standard Deviation, Microburst, Dynamic Threshold, Anomaly Detection, Outlier Detection, and Predict.

Some pre-defined formulas can operate on time ranges in order to work with historical data. If a time range is not specified, then the formula works on current data, specified as now.

Vectors

Vectors are useful in helping to gather multiple elements into a single rule. For example, using a vector

you could gather all of the interface error fields. The syntax for Vector is:

vector <vector-name>{ path [$field-1 $field-2 .. $field-n]; filter <list of specific element(s) to filter out from vector>; append <list of specific element(s) to be added to vector>; }

$field-n can be field of type reference.

The fields used in defining vectors can be direct references to fields defined in other rules:

vector <vector-name>{ path [/device-group[device-group-name=<device-group>]\ /device[device-name=<device>]/topic[topic-name=<topic>]\ /rule[rule-name=<rule>]/field[<field-name>=<field-value>\ AND|OR ...]/<field-name> ...]; filter <list of specific element(s) to filter out from vector>; append <list of specific element(s) to be added to vector>; }

This syntax allows for optional filtering through the <field-name>=<field-value> portion of the construct. Vectors can also take a time-range option that picks the values from the time-range provided. When multiple values are returned over the given time-range, they are all selected as an array.

The following pre-defined formulas are supported on vectors:

unique @vector1–Returns the unique set of elements from vector1

•

@vector1 and @vector2–Returns the intersection of unique elements in vector1 and vector2.

•

@vector1 or @vector2–Returns the total set of unique elements in the two vectors.

•

@vector1 unless @vector2–Returns the unique set of elements in vector-1, but not in vector-2

•

Variables

Variables are defined during rule creation on the Variables page. This part of variable definition creates the default value that gets used if no specific value is set in the device group or on the device during

+ 213 hidden pages

Juniper HealthBot User Manual

Specifications and Main Features

Frequently Asked Questions

User Manual

Table of Contents

About the Documentation

Documentation and Release Notes

Documentation Conventions

Documentation Feedback

Requesting Technical Support

Self-Help Online Tools and Resources

Creating a Service Request with JTAC

Introduction to HealthBot

HealthBot Overview

Benefits of HealthBot

Closed-Loop Automation

Main Components of HealthBot

HealthBot Health Monitoring

HealthBot Root Cause Analysis

HealthBot Log File Analysis

HealthBot Concepts

HealthBot Data Collection Methods

Data Collection - ’Push’ Model

Data Collection - ’Pull’ Model

HealthBot Topics

HealthBot Rules - Basics

HealthBot Rules - Deep Dive

Rules

Sensors

Fields

Vectors

Variables