Use Case 1: vSwitch Deployment
The figure below shows a deployment of a PA-VM on an
ESXi host where the data ports “Port 1” and “Port 2” are linked
to eth1 and eth2 of the PA-VM. Each port hosts two queue pairs (for
example, Tx0/Rx0, and Tx1/Rx1) or has multiqueue enabled.
Enabling multiqueue and RSS for load balancing packets sent/received
to/from multiple queues enhances processing performance. Based on
an internal logic of vCPU to port/queue mapping (in this case) packets
arriving and being sent out from P1/Q0 and P2/Q0 are processed by
dataplane task T1 running on (i.e., pinned to) vCPU1. The data plane
task T2 follows a similar association, as shown in the vSwitch deployment
diagram above.
The two data plane tasks are running on vCPU1 and vCPU2 and these are non-sibling CPUs (means
that they don’t share the same core in case of hyperthreading). This means that
even with hyperthreading enabled the task assignment can be pinned to different
cores for high performance. Also these dataplane task vCPUs all belong to the
same NUMA node (or socket) to avoid NUMA-related performance issues.
Two other performance bottlenecks can be addressed with increasing
the queue sizes and dedicating a vCPU or thread to the ports that
schedule traffic to and from these ports. Increasing the queue sizes
(Qsize) will accommodate large sudden bursts of traffic and prevent
packet drops under bursty traffic. Adding a dedicated CPU thread
(ethernetX.ctxPerDev = 1) to port level packet
processing will allow traffic to be processed at a higher rate,
thereby increasing the traffic throughput to reach line rate.
The PA-VM packet processing technique also determines performance.
This can be set to either DPDK or PacketMMAP. DPDK uses a poll mode
driver (depends on the driver type) to constantly poll for packets
received in the queues. This leads to higher throughput performance.
Depending on the poll period is latency observed by the packets.
If the polling is continuous (i.e., busy-poll a setting from the
PANOS cli) then the vCPU utilization for the data plane tasks will
be a 100% but will yield the best performance. Internally the software
uses a millisecond-level polling time to prevent unnecessary utilization
of CPU resources.
PacketMMAP, on the other hand, has a lower performance than DPDK
but it works with any network level drivers. For DPDK the vSwitch
driver must have support for DPDK. PacketMMAP works with interrupts
that are raised when a packet is received by the port and placed
in the receive queue. This means that for every packet, or group
of packets, interrupts are raised and packets are drained off the
receive queue for processing. This results in lower latency in packet
processing, but reduced throughput, because interrupts must be processed
every time, causing higher CPU overhead. In general PacketMMAP will
have lower packet processing latency than DPDK (without busy poll
modification).