Azure Kubernetes Service (AKS) Networking Deep Dive Part 1— Pod Intra-Node/Inter-Node Communication

If you are like me, who has not much foundation on Docker and learn every cloud native after K8swas introduced, you might have the same urge to figure out how the networking works within the cluster.

In this series, I would try to touch on

  • Container Communication
  • Pod Intra-Node/Inter-Node Communication
  • Pod Service Communication
  • Pod Ingress Communication
  • Azure Container Network Interface (CNI)
  • CoreDNS

There are always new concepts and solutions coming out in K8s development every year, but I believe these should cover for the most part of K8s networking for now.

For the majority of this series, the environment would be on AKS with Kubenet and AKS with Azure CNI would be dissected at the end. AKS would be slightly different from any bare-metal K8s cluster built on any environment, on-premises or cloud.

Container Communication

Containers would need a way to communicate with external environment docker0 on each hosting Node serves as the container gateway and it is created for this purpose.

docker0 is a virtual bridge interface created by Docker. It randomly chooses an address and subnet from a private defined range. All the Docker containers are connected to this bridge and use the NAT rules created by docker to communicate with the outside world.

On Azure portal, we could see docker bridge is assigned a totally different address space.

Next, if we execute into one of the hosting Nodes and check the iptables rule chains, we should be seeing information like below.

In K8s cluster, each Pod is getting a private IP address from the Pod CIDR, so the iptables rule “MASQUERADE” from source range “172.17.0.0/16” to destination range “0.0.0.0/0” would not be applied anywhere. However, if this is not K8s environment but pure Docker environment, the rule would be applied.

# get the node's name
- kubectl get nodes
# get into the node's shell. Reference for installation.
- kubectl node-shell <node name>
# go through the iptables rule chains of the newly created NGINX Ingress
- iptables -t nat -L -n | column -t | grep DOCKER
- iptables -t nat -L POSTROUTING -n | column -t

Pod Intra-Node Communication

Let’s go through the background information. On Azure portal, if we create the cluster and keep everything as default, the networking information would be similar to below.

  • Kubenet is the first and default CNI any K8s cluster would apply. The characteristic of this CNI is that it would have separate Docker, Pod and Service address space to ensure everything is not overlapped.
  • In AKS with Kubenet CNI, all Docker, Pod, Service address spaces are NOT on the same layer as Azure virtual network. We could almost think the cluster has a nested virtual environment where resource uses their own nested virtual private IP address to communicate.

As we all know, essentially, Pods are just processes running on the hosting Node with software-defined isolation. The isolation is referred as namespaces. In this series, we would be heavily focusing on network namespaces. If you would like to know more about namespaces, this is a very good article.

Pod’s eth0 Linked with Node’s vethX

Next, let’s look at an image with detailed explanation how Pod uses hosting Node’s network capability. In the next few demonstration, we would focus on the yellow highlighted part, linkage between Pod’s network interface to Node’s virtual network interface.

Too look for more details, we would need to get into the hosting Node’s environment and this could be done with SSH or Node Shell.

The following demonstration would be using Node Shell.

# get nodes' name
- kubectl get nodes
# get into one of the node's shell
- kubectl node-shell <node's name>
# if you exit the shell and check the Pods. You would find out it is actually a Pod with privileged permissions to execute commands in the hosting Node.
- kubectl get pods

When listing containers, each Pod would create 2 containers. The one using the “pause” image is for creating the Pod’s network namespace and ensuring the functionality would not be impacted if there is no containers running in the Pod anymore; the other container is what administrators create. For more information around pause containers, please check here.

# list containers
- docker ps
# get container ID by approximate name
- docker ps | grep k8s_<pod name>_<pod name>
# get process ID of the container
- docker inspect --format '{{.State.Pid}}' <container ID>
# enter the network namespace of the process
- nsenter -t <process ID> -n ip addr
Ref: More about nsenter

After getting into the process’ network namespace, we could check what is the name and index of the network interface. In my lab environment, it is “eth0@if11”. “11” is the index and we should be able to map with the “vethxxx” on the hosting Node.

# list the ip links on the hosting Node
- ip link list

We now know that Pod1’s network interface “eth0@if11” is linked with the hosting node’s “veth66c1e3b@if3" as the index is the same.

Container Bridge (cbr0)

Next, we would look at how Pods within the same hosting Node find each other and communicate. We created pod1 and pod2 on the same node; pod3 on a different node.

The communication is done mostly by a container bridge interface (cbr0).

# in the hosting Node
- ifconfig -a

We could see “cbr0” is using the IP address of 10.244.0.1, which is actually in the Pod’s address space shown previously. If we use Node Shell to get into another Node, we would see a different IP address to be used.

Pod1 Connecting to Pod2

With Pod1’s eth0 linked to hosting Node’s vethX, Pod2’s eth0 linked to hosting Node’s vethY and cbr0 in Pod’s address space, Pods could communicate with each other without further configuration.

The network packet originates from Pod1 eth0, linked to hosting Node’s vethX, goes through cbr0, makes an ARP request to find where Pod2 is located. Within the same Node, ARP would respond with Pod2’s IP address. The network packet would go through hosting Node’s vethY, then Pod2’s eth0.

That’s it! That is how Pods communicate with each other in the same Node!

Pod Inter-Nodes Communication

All the concepts of Pod communication would be the same as within the same Node. The only difference is that when Pod1 (Node1) is trying to contact Pod3 (Node2), the ARP request would fail on Node1’s container bridge (cbr0). Therefore, Node1’s cbr0 would forward requests based on the set user-defined routes. Node1’s eth0 would then reach the right Node (Node2) where Pod3 is located. After that, Node2’s cbr0 would do an ARP request to figure out where Pod3 is located and the connection would be made.

Pod1 (Node1) Connects to Pod3 (Node2)

Source: Understanding Kubernetes Networking — Part 2 | by Sumeet Kumar | Microsoft Azure | Medium

There is a root network namespace (netns) that has the hosting Node’s actual network interface (eth0) and the virtual network interfaces that are paired with Pod’s network interface in Pod’s netns.

  • Pod1 netns’ eth0 ← → Root netns’ veth0
  • Pod2 netns’ eth0 ← → Root netns’ veth1
  • Node1 and Node2 both have cbr0 as container bridge. Container bridge is in the same Pod address space.

User-Defined Route

Since this is on Azure infrastructure, user-defined route would actually need to be added on the platform for the request to send to the right node.

Node’s IP Forwarding

Within each Node, IP forwarding is enabled and route table is updated.

# get IP forwarding setting
- sysctl -a | grep -i net.ipv4.ip_forward
# get routing table information
- route -n

In the next part, we would be looking at Pod Service Communication and Pod Ingress Communication. The difference would be Pod interacting with a K8s-native L4 load balancer and with a solution act as L7 load balancer. Of course, a lot of details would need to be covered.

Learning new things about Kubernetes every day. Hopefully, the learning notes could help people on the same journey!