Azure Kubernetes Service (AKS) Networking Deep Dive Part 2— Pod and Service Communication & Pod and Ingress Communication

Pod and Service Communication

To understand how external client gets to access services provided by Pods, we would need to create a simple NGINX Deployment with 3 replicas. Then expose it with native Load Balancer Service.

Inbound Communication

For Node ports, we could execute the command below to see whether port 32643 on each Node is actually listening for external clients.

To know the details on how the communication gets network address translated (NAT), use the below command to look at NAT table in iptables. We would only focus on the last 2 rules as both IP address, 10.0.x.x and 20.x.x.x, show in the previous screenshots, meaning these 2 are related with the just created service.

Whatever external clients visiting the load balancer IP address “20.x.x.x” would be hitting “KUBE-FW-BMQ5UNGRIS2RIY35”.

Once the network packets go into the chain of “KUBE-FW-BMQ5UNGRIS2RIY35”, it would do SNAT with “KUBE-MARK-MASQ”.

When the network packets go to “KUBE-MARK-MASQ”, they are masqueraded.

Masquerade is an algorithm dependent on the iptables implementation that allows one to route traffic without disrupting the original traffic.

Simply put, the network packets’ source IP would have been tagged with MARK or 0x4000. With this tag, Linux system would know to source network address translate it (SNAT) to an private IP address. If you would like to see more example, you could also check out “KUBE-POSTROUTING”.

The network packets would then go to “KUBE-SVC-BMQ5UNGRIS2RIY35”.

The network packets would be routed randomly to either “KUBE-SEP-RH2TKPYPHJIEMIWG” or “KUBE-SEP-36IDCPEH467QNWVS”. This actually shows how many actual endpoints are behind the K8s Load Balancer service. At the end, the Load Balancer private IP address “10.0.x.x” would need to be traversed into an actual Pod providing the service.

One interesting thing we could find in this rule chain is that the load-balancing probability would be all different if we have more Pods providing the service. For example, if we have 2 Pods providing the service, the probability of going to either one would be 50%. Hence, 0.5 is seen on the screen. If we have 3 Pods providing the service, we would see the first rule chain showing 0.33 as the request could be hitting any Pod’s endpoint. Then, we would see 0.5 as if the request does not hit the first rule chain and hit the second one (iptables rule chains would need to be read top down), the system would need to ignore the first endpoint and calculate the probability with the remaining, which is 2 Pods. The same logic goes on.

We could pick either this “KUBE-SEP-*” or the other, it does not matter. In this rule, the network packets would be routed to DNAT endpoint, which is the endpoint of one of the Pods seen in the previous screenshot.

You might not get the full DNAT rule detail because of Ubuntu server version. If you are not able to upgrade the server to the next version, we could get into kube-proxy container shell to get the details.

The network packets would be dropped if no DNAT endpoints could be found. For example, the K8s Service is not associated with the correct Deployment label.

To know more about iptables, please check this site.

Here is the flow of how external clients try to contact K8s service exposed with Load Balancer. Please ignore the green rectangle that is not connected with the orange one as it is just a representation to a non-existing Pod endpoint.

Load Balancer External Traffic Policy

The default K8s Load Balancer would use external traffic policy with “cluster”, meaning the traffic would be going through the whole cluster to find the actual Pod’s endpoints. The plus side is that the request would definitely be routed to the actual service-providing Pods; the down side is that it will add extra hops inside the cluster, essentially consuming resource when it does not need to.

The iptables rule chains would be looking similar to below.

Another external traffic policy would be “local”, meaning the traffic would only routed within the node itself instead of the whole cluster. The plus side is that it would not do extra hops inside the cluster, but it would just drop the network request when it could not find any Pod’s endpoints providing the service.

The iptables rule chains would be looking similar to below.

Outbound Communication

When Pods are trying to reach back to external environment, the iptables rule chain is pretty similar to inbound. Pod’s endpoint would be masqueraded into kube-proxy’s private IP address (Node’s private IP address). Then, the Linux system would read into conntrack record to know which destination IP it should reach back to and also masqueraded its own source IP to cluster’s public IP address.

Connection Tracking (conntrack)

Right now, we know that whenever external clients try to contact services provided by K8s, both source and destination would need to be NAT. A service called “conntrack” inside Linux iptables netfilter tracks and maintains the state of connection and NAT information.

Using several basic conntrack commands to see how to retrieve the information needed. To get more information, please check this site.

conntrack has its down side and you find a lot of real-world scenarios on the Net. Here is a paragraph from Tigera.

The conntrack table has a configurable maximum size and, if it fills up, connections will typically start getting rejected or dropped. For most workloads, there’s plenty of headroom in the table and this will never be an issue. However, there are a few scenarios where the conntrack table needs a bit more thought:

The most obvious case is if your server handles an extremely high number of simultaneously active connections. For example, if your conntrack table is configured to be 128k entries but you have >128k simultaneous connections, you’ll definitely hit issues!

The slightly less obvious case is if your server handles an extremely high number of connections per second. Even if the connections are short-lived, connections continue to be tracked by Linux for a short timeout period (120s by default). For example, if your conntrack table is configured to be 128k entries, and you are trying to handle 1,100 connections per second, that’s going to exceed the conntrack table size even if the connections are very short-lived (128k / 120s = 1092 connections/s).

Quoted from here

Pod Ingress Communication

We would first install NGINX with the following one-liner.

Then, we create 2 simple services with 1 Pod as backend each for NGINX Ingress Controller to work. Follow through this section of official documentation and we should have at least 2 services ready behind the NGINX Ingress. Please make sure to create the example services within the default NGINX Ingress namespace, “ingress-nginx”.

The NGINX Ingress should look similar to below but instead of 2 Pods, each service only has 1.

After deployment, we could execute the following commands to know all the endpoint information.

Test to visit one of the service from external environment through Ingress. Without specifying the path, it should go to aks-hello-world-one by default.

Let’s deep dive into the iptables rule chains like we did in Pod to Service communication.

If you are not seeing the DNAT rule in the end, we could again try to get into kube-proxy Pod’s shell on each Node to get more details.

If we check which Pod owns the DNAT private IP address, we would figure it is the NGINX Ingress Controller.

That gives us some ideas on how NGINX Ingress actually make the incoming request route to the correct endpoint. Let’s try to look at conntrack when connecting to NGINX Ingress from Pod, Node and external environment.

Pod

From Pod to NGINX Ingress Service, since it is within cluster, it should resolve the service FQDN to 10.0.37.21, then route to the correct endpoint. In this case, without specifying the path, it should go to “aks-hello-world-one”.

We figure the source IP is not from the Pod (10.244.5.4) but from NGINX Ingress Controller Pod (10.244.5.2).

Node

Besides the fact that Node could not resolve K8s cluster’s service FQDN, everything remains the same. In this case, administrators could curl NGINX Ingress Controller’s service cluster IP (10.0.37.21) instead.

External Environment

Besides the fact that administrators/users would need to use the NGINX Ingress public IP address to visit the service, everything else remains the same.

It is safe for us to say that at the last mile, it is NGINX Ingress Controller controlling where the network request should be routed to.

As part 1 and part 2 of this series contain a lot of information, it took me almost 2 whole weeks to gather all the information needed to understand the working theory and also perform associated testings. Hope this would provide more details and clarity to people who want to understand more about K8s networking — Kubenet CNI!

Reference

Kubernetes NodePort and iptables rules | Ronak Nathani

A Deep Dive into Kubernetes External Traffic Policies — Andrew Sy Kim (asykim.com)

Demystifying Kubernetes Services Packet Path | by Abhishek Mitra | Medium

Learning new things about Kubernetes every day. Hopefully, the learning notes could help people on the same journey!