The Insecure Wire

a network engineers perspective

Using Python Paramiko to automate commands on Palo Alto PAN OS

So Palo Alto TAC recently confirmed to me that PAN OS 9.0.3 has a bug wherein the proxydnsd service will max the management CPU even if your not using proxy DNS.

It is slated for a fix in November’s release of 9.0.5, however I would rather not have the management CPU constantly maxed. So to fix this problem I created a Python script with the Paramiko library for SSH connectivity. This allows you to automate CLI commands via Python. Since the command to restart the proxydnsd service is a debug command, you can’t use the PA API, it has to be done from the CLI.

I run this python script using Python 2.7 on a Ubuntu Linux VM. You will need to pip install paramiko and pip install time.

#
# Log into PAN Firewall via SSH and restart DNS Proxy
# which is causing mgmt cpu spike on PAN OS 9.0.3
# Requires Python 2.x
# Setup crontab schedule to automatically execute
#
import paramiko
import time

USERNAME = 'username'
PASSWORD = 'password'
HOSTNAME = '192.168.0.1' #Firewalls IP
PORT = 22

def ssh_command(username=USERNAME, password=PASSWORD, hostname=HOSTNAME, port=PORT):
ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.load_system_host_keys()
ssh_client.connect(hostname, port, username, password)
remote_conn = ssh_client.invoke_shell()
#print ("Interactive SSH session established")
remote_conn.send("debug software restart process dnsproxy\n ")
time.sleep(8)
remote_conn.send("exit\n ")

if __name__ == '__main__':
ssh_command()

You can download the script from my Github page. I setup crontab to run the script at 20 minute intervals:
1. Make sure your script is working first (you have filled out the username, password and hostname fields and it executes correctly with python 2.x).
2. crontab -e
Select 1. for Nano
3. Add a new line like so:
*/20 * * * * python /path/to/PADebugCmd.py
Where /path/to is your directory path to the script file.

Unisphere Management service won’t start on EMC VNX Array

OK so we had a scheduled power outage at one of our secondary DCs. The SAN was shutdown following the VNX shutdown procedure via Unisphere.

After the power was back on everything on the SAN came up accept the Unisphere web management. After some Google-fu the culprit was the NAS slots had not come up properly, in fact they were powered off.

[root@vnx-cs0 ~]# /nasmcd/getreason
10 - slot_0 primary control station
11 - slot_1 secondary control station
0 - slot_2 off
0 - slot_3 off

As you can see slot 2 and 3 nas blades are off. Now this array does not use file storage only block. That does not matter for the virtual ip and unisphere management service. To power them on use these commands:

/nasmcd/sbin/t2reset pwron -s 2
/nasmcd/sbin/t2reset pwron -s 3

Once they power on – the state changes on the getreason:

5 - slot_2 contacted
5 - slot_3 contacted

This saved me a full round trip to this DC, very handy. If this still doesnt fix it – something is wrong with the physical cabling between the nas blades and the Control Station, or there is a fault.

Troubleshooting VMware lost access to volume errors

Recently we have been seeing heaps of these types of errors fill up our VMware ESXi logs:

2015-07-02T02:00:11.282Z cpu10:36273)HBX: 2832: Waiting for timed out [HB state abcdef02 offset 3444736 gen 549 stampUS 115704005679 uuid 5592d754-21d7d8a7-0a7e-a0369f519998 jrnl drv 14.60] on vol 'example datastore'

And in VCenter task / events for host:

Lost access to volume 5592d754-21d7d8a7-0a7e-a0369f519998 (LUN 211) following connectivity issues. Recovery attempt in progress and outcome will be reported shortly.
Successfully restored access to volume 5592d754-21d7d8a7-0a7e-a0369f519998 (LUN 211) following connectivity issues.

When ESXi disconnects and reconnects storage consistently – there is a problem (obviously). Even in milliseconds this will impact performance of your workloads. I tried to follow the support steps in VMware KB 2136081, to no avail.

So to break down this issue logically, it can be one of three things:
1. The host (drivers, firmware, hardware etc)
2. The SAN – EMC VNX array
3. The switching fabric – we use Nexus switches with FCoE to the UCS fabric interconnects – storage and data share the same switching.

Hosts
I ruled out the UCS hosts by installing the newest storage drivers for ESXi 6.x. I also setup a new LUN and made sure no VMs were on it – meaning no I/O and still the driver is reporting the dreaded time out error message.
Storage
Next was the SAN, since its managed with EMC Support, I was able to log an SR with support for the storage processor log analysis. This came back with the following:

[RSCN : Registered State Change Notification:]
SPB Interface:1 RSCN count:4
SPB Interface:0 RSCN count:4

fffffd : well known address of the Management server on the switch:

B 07/07/19 14:56:46 364afc0 FCOEDMQL 0 (FE4) DVM RSCN Recd - Id: 0xfffffd, Device State: 0x3
B 07/07/19 14:56:46 36bbfc0 FCOEDMQL 1 (FE5) DVM RSCN Recd - Id: 0xfffffd, Device State: 0x3
B 07/07/19 14:58:47 36bbfc0 FCOEDMQL 1 (FE5) DVM RSCN Recd - Id: 0xfffffd, Device State: 0x3
B 07/07/19 14:58:47 364afc0 FCOEDMQL 0 (FE4) DVM RSCN Recd - Id: 0xfffffd, Device State: 0x3

For further information:
RSCN and aborts due to DEVICE_RESET for reason KILLING_RSCN_AFFECTED_DEVICE
When a RSCN (Registered State Change Notification) is received, the array makes a comparison of the nodes that are "logged in" as compared to the list sent from the name server (the switch).
The RSCN is a notification delivered from the switch to the array telling the array that something happened on the fabric.
The array responds to this by deleting any connection affected by the RSCN notification. The host would need to log in again.

So from this I was able to deduce that the switching was sending ‘RSCN notification’ to both SAN SPB ports. This is causing the behavior in ESXi wherein access to LUNs is timing out and reconnecting and repeating. At this point we know there is something wrong with the switching fabric.

Switching Fabric
To find the issue on the fabric switching I went through standard troubleshooting techniques. Check physical layer first. Everything was connected and UP/UP.
Digging further through the logging on the Nexus switching I saw that from cold boot there were SFP+ transceivers reported as not supported. Even though the ports were UP and the unsupported command was entered in the switch config.
As above the RCSN logout errors where only happening on certain ports on the SAN side. I checking all four down links to the san from the nexus side – they all had third party SFPs. I decided to switch them all out for Cisco branded SR SFP+.

And with that the issue was resolved. Third party optics worked with this setup for years, I suspect 2/4 were faulty. Either way its a lesson for critical links it’s probably best to use the branded optics where possible.

Setting up AD Auth with Hashicorp Vault

Hashicorp Vault is open source and can be used in DevOps processes for secure automated retrieval of keys and secrets.
I recently setup Vault as a password / key store. Here is how to configure Vault for Active Directory LDAP authentication.

This setup assumes the following:
‘sAMAccountID’ is the username attribute within AD for the user/s you want to authenticated to Vault.
The user must be a member of a specific group to be granted access to the Vault secrets path.
Vault is installed and initialized with the root token.

1. Create a file named IT.hcl with the following Vault policy as its contents:

path "secret/data/IT" {
capabilities = ["create", "read", "update", "delete", "list"]
}

2. Write the policy into the Vault:

vault policy write IT IT.hcl

3. Enable LDAP Auth:
vault auth enable ldap
4. Write the LDAP auth config (edit the values for your binddn, groups and server name):

vault write auth/ldap/config \
url="ldap://server.domain.name" \
userattr="samaccountid" \
userdn="ou=Users,dc=domain,dc=name" \
groupdn="ou=Groups,dc=domain,dc=name" \
groupfilter="(&(objectClass=group)(member:1.2.840.113556.1.4.1941:={{.UserDN}}))" \
groupattr="samaccountid"
binddn="cn=vault,ou=users,dc=domain,dc=name" \
bindpass='My$ecrt3tP4ss' \
upndomain="domain.name"

5. Map the Vault IT policy to the IT AD group:

vault write auth/ldap/groups/IT policies=IT

Note that in AD the group should be named ‘IT’ (for this example)
6. Test Vault AD Authentication:

vault login -method=ldap username='myUser'

7. Confirm your AD user has the permissions set in the IT Vault policy:

vault token capabilities secret/data/IT

In this example the AD user myUser is a member of the AD group ‘IT’ which has full permission to the /secret/data/IT Vault. All done 🙂

Palo Alto Networks Powershell Backup Script

As we recently rolled out a bunch of PA firewalls, I have created a powershell script to backup the running configuration using the XML API.

You can grab the script from my github here:
panbackup

The instructions are as follows:
1. Create the folder c:\panbackup\ on your Windows Server.

2. Create a local administrator on the firewall as a member of super users (read only). This will allow rights to export the full configuration with phash keys. Which means you can restore the config on a new appliance easily. The below PA documentation details how to create a local firewall administrator: Create Firewall Administrator

3. Generate your API key as follows: https:///api/?type=keygen&user=&password= You can also generate api key via cURL as per the PA documentation below:
Generate your API key

4. Test your powershell script! You may need to set the correct saving path, file names etc. Add a scheduled task and viola! Peace of mind.

ELK Stack with Palo Alto Firewall – Using Curator to clear indexes

I recently deployed an ELK stack (Elasticsearch, Logstash, Kibana) VM as logger for a Palo Alto Networks firewall. ELK is open source and allows you to create beautiful dashboards in Kibana.
I followed the following guide for integrating PAN firewall with ELK palo-alto-elasticstack-viz.

Overview Dashboard
Threat Dashboard

The issue I was having is that Elastic indexes would continue to grow and the VM would eventually run out of disk. To solve this problem I did the following:

1. Change to daily indexes, base on date stamp. Edit logstash config like so (this edit follows on from the above PAN-OS.conf logstash configuration file):

output {
if "PAN-OS_Traffic" in [tags] {
elasticsearch {
index => "panos-traffic-%{+YYYY.MM.dd}"
hosts => ["localhost:9200"]
user => "elastic"
password => "yourpassword"
}
}
else if "PAN-OS_Threat" in [tags] {
elasticsearch {
index => "panos-threat-%{+YYYY.MM.dd}"
hosts => ["localhost:9200"]
user => "elastic"
password => "yourpassword"
}
}
else if "PAN-OS_Config" in [tags] {
elasticsearch {
index => "panos-config-%{+YYYY.MM.dd}"
hosts => ["localhost:9200"]
user => "elastic"
password => "yourpassword"
}
}
else if "PAN-OS_System" in [tags] {
elasticsearch {
index => "panos-system-%{+YYYY.MM.dd}"
hosts => ["localhost:9200"]
user => "elastic"
password => "yourpassword"
}
}
}

Logstash will now create an index based on date stamp for the firewall log inputs.
2. Use Elastic Curator cli tool to create a shell script and run it with crontab:
Create /etc/curator/config.yml

client:
hosts:
- 127.0.0.1
port: 9200
url_prefix:
use_ssl: False
certificate:
client_cert:
client_key:
ssl_no_validate: False
http_auth: elastic:yourpassword
timeout: 30
master_only: False

logging:
loglevel: INFO
logfile:
logformat: default
blacklist: ['elasticsearch', 'urllib3']

Create /etc/curator/delete-after.yml
Set unit_count to the number of days to keep indexes. In my example anything older than 60 days gets deleted.

actions:
1:
action: delete_indices
description: >-
Delete indices older than X days (based on index name), for panos-
prefixed indices. Ignore the error if the filter does not result in an
actionable list of indices (ignore_empty_list) and exit cleanly.
options:
ignore_empty_list: True
disable_action: False
filters:
- filtertype: pattern
kind: prefix
value: panos-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 60

Create /etc/curator/cleanup.sh and paste in:

#!/bin/bash
# This df command grabs the free space of the root '/'.
disk=$(df -H | grep -vE '^Mounted| /.' | awk '{ print $1 " " $5 " " $6 }' | awk 'NR == 2' | awk '{print $2}' |sed 's/%//')

# Delete indices older than 60 days.
curator --config /etc/curator/config.yml /etc/curator/delete-after.yml
echo $disk

Now add to crontab – to run the script 5 mins past midnight every night:

sudo crontab -e
5 0 * * * /etc/curator/cleanup.sh

That it! You can tweak the unit_count days if you want to have say only 7 days worth of logs depending on your use case. You can also run curator manually like so:

sudo curator --config /etc/curator/config.yml /etc/curator/delete-after.yml

This helps when debugging your script logic and checking that elastic is actually deleting indices.

Request a SAN certificate using MS CA Web enrollment Pages

1. Run these commands on the MS CA server:

certutil -setreg policy\EditFlags +EDITF_ATTRIBUTESUBJECTALTNAME2
net stop certsvc
net start certsvc

2. In the Attributes box, type the desired SAN attributes. SAN attributes take the following form:

san:dns=dns.name[&dns=dns.name]

For example : To add two DNS names to the SAN field , you can type:

san:dns=corpdc1.fabrikam.com&dns=ldap.fabrikam.com

Renaming a vSwitch in VMWare ESXi

At the ESX Console, log in and hit Alt-F1 then type unsupported and hit Enter. You won’t see the word “unsupported” appear as you type it but upon hitting Enter, you’ll be prompted for the root password. Type it in and hit Enter.

You be presented with the ESXi command prompt. The default text editor in ESXi CLI is vi:

cd /etc/vmware
vi esx.conf

Search for “name” using Esc, /name, Enter and keep hitting n (next) until you find the incorrectly named vSwitch. Change the word by hitting Esc, cw followed by the correct name, followed by Esc.

/net/vswitch/child[0001]/name = “SAN-A“

If you’re happy the name has been changed correctly in esx.conf, hit Esc, :wq! and hit Enter to write the changes back to disk and quit vi.
Back at the Linux prompt, type clear to clear the screen, and type exit and hit Enter to log out of the console. Alt-F2 will close the “Unsupported Console” returning you back to the black and yellow ESX Console.

Esc to log out, then finally F11 to restart the host.

Note I have tested this on ESXi 6.7u1 – you must restart after the change to esx.conf – other changes via UI do not work until ESXi is reloaded.

Configure iSCSI Port Binding using ESXCLI

1. Enable SSH console from the management settings > troubleshooting options on the ESXi console.
2. Bring up the CLI console with alt + f1.
3. Enter the following commands to bind the VMKernel port(s) to the software iSCSI adapter (ESXi 6.7).


~ # esxcli iscsi networkportal add --nic vmk1 --adapter vmhba64
~ # esxcli iscsi networkportal add --nic vmk2 --adapter vmhba64

Deploying Dell vLT on OS10

I was recently asked to assist our training environment with the deployment of mid market Dell data center utilizing OS 10 and Virtual Link Trunking. VLT is Dell’s implementation of multi-chassis etherchannel, similar to Cisco’s Virtual Port Channel. OS10 is an open networking platform, runs on Debian Linux and boots of the ONIE (Open Network Install Environment) boot loader.

For this environment / lab here is our bill of materials:
2 x Dell S4112F-ON
2 x Dell R640 1 RU servers
1 x Dell sc3000v2 SAN
1 x Palo Alto Networks PA-220

Dell Data Centre Design

Dell Data Centre Design

Dell vLT allows multiple switches to appear as one logical unit. VLT has layer 3 routing features and as such out of box can replace FHRR protocols such as VRRP. In this design there is a pair of S4112F-ON switches connected at 100gb/s. The left side is PVST+ root bridge for all VLANs and it is also the primary vLT peer. One thing to note in regard to the Dell vLT peer-routing feature. Both L3 SVI IP addresses will perform IP routing for a subnet, if the primary switch in a VLT peer goes down then the secondary will transparently respond to IP routing request. You can not ping the primary gateway IP when this fail over scenario is in affect. Note that vLT port-channels require a mirror of configuration on both switch members. This is regardless of access or trunk mode port-channels. There is an example below.

Here is the OS10 (10.4.2.2.265) configuration to achieve the above vLT design:
Core-1

vlt-domain 1
backup destination 192.168.x.2
discovery-interface ethernet 1/1/14-1/1/15
peer-routing
primary-priority 1
vlt-mac 00:11:22:33:44:56

interface port-channel3
description PA-Internet-Uplink
no shutdown
switchport access vlan 550
vlt-port-channel 3

interface ethernet1/1/7
description Link-1-to-PA
no shutdown
channel-group 3

Core-2

vlt-domain 1
backup destination 192.168.x.1
discovery-interface ethernet 1/1/14-1/1/15
peer-routing
vlt-mac 00:11:22:33:44:56

interface port-channel3
description PA-Internet-Uplink
no shutdown
switchport access vlan 550
vlt-port-channel 3

interface ethernet1/1/7
description Link-1-to-PA
no shutdown
channel-group 3