Azure Kubernetes Service (AKS) Networking Deep Dive Part 2— Pod and Service Communication & Pod and Ingress Communication
Pod and Service Communication
To understand how external client gets to access services provided by Pods, we would need to create a simple NGINX Deployment with 3 replicas. Then expose it with native Load Balancer Service.
# create a NGINX Deployment with 2 replicas
- kubectl create deployment my-nginx --image=nginx --replicas=2# check Pods to see each Pod's IP address and located Nodes
- kubectl get pods -o wide
# expose the Deployment with Load Balancer Service
- kubectl expose deployment my-nginx --type=LoadBalancer --port=80 --name=my-nginx-service
# check service to see its cluster IP address, Node port and external IP address
- kubectl get services
# scale the deployment in any way you like. "x" and "y" is customizable with any positive integer.
- kubectl scale --current-replicas=x --replicas=y deployment/my-nginx
For Node ports, we could execute the command below to see whether port 32643 on each Node is actually listening for external clients.
# enter each Node with Node Shell and execute
- netstat -tlpRef: For more information on "netstat", please check here.
# check service iptables rules on one of the nodes.
# this would be applied on all nodes in the cluster.
- iptables -t nat -L KUBE-NODEPORTS -n | column -t
To know the details on how the communication gets network address translated (NAT), use the below command to look at NAT table in iptables. We would only focus on the last 2 rules as both IP address, 10.0.x.x and 20.x.x.x, show in the previous screenshots, meaning these 2 are related with the just created service.
Whatever external clients visiting the load balancer IP address “20.x.x.x” would be hitting “KUBE-FW-BMQ5UNGRIS2RIY35”.
# get node shell Pod name
- kubectl get pods# enter the node shell
- kubectl exec <node shell pod name> -it -- /bin/bash# check iptables rule chains
- iptables -t nat -L KUBE-SERVICES -n | column -t
Once the network packets go into the chain of “KUBE-FW-BMQ5UNGRIS2RIY35”, it would do SNAT with “KUBE-MARK-MASQ”.
iptables -t nat -L KUBE-FW-BMQ5UNGRIS2RIY35 -n | column -t
When the network packets go to “KUBE-MARK-MASQ”, they are masqueraded.
Masquerade is an algorithm dependent on the iptables implementation that allows one to route traffic without disrupting the original traffic.
Simply put, the network packets’ source IP would have been tagged with MARK or 0x4000. With this tag, Linux system would know to source network address translate it (SNAT) to an private IP address. If you would like to see more example, you could also check out “KUBE-POSTROUTING”.
iptables -t nat -L KUBE-MARK-MASQ -n | column -t
The network packets would then go to “KUBE-SVC-BMQ5UNGRIS2RIY35”.
iptables -t nat -L KUBE-SVC-BMQ5UNGRIS2RIY35 -n | column -t
The network packets would be routed randomly to either “KUBE-SEP-RH2TKPYPHJIEMIWG” or “KUBE-SEP-36IDCPEH467QNWVS”. This actually shows how many actual endpoints are behind the K8s Load Balancer service. At the end, the Load Balancer private IP address “10.0.x.x” would need to be traversed into an actual Pod providing the service.
One interesting thing we could find in this rule chain is that the load-balancing probability would be all different if we have more Pods providing the service. For example, if we have 2 Pods providing the service, the probability of going to either one would be 50%. Hence, 0.5 is seen on the screen. If we have 3 Pods providing the service, we would see the first rule chain showing 0.33 as the request could be hitting any Pod’s endpoint. Then, we would see 0.5 as if the request does not hit the first rule chain and hit the second one (iptables rule chains would need to be read top down), the system would need to ignore the first endpoint and calculate the probability with the remaining, which is 2 Pods. The same logic goes on.
We could pick either this “KUBE-SEP-*” or the other, it does not matter. In this rule, the network packets would be routed to DNAT endpoint, which is the endpoint of one of the Pods seen in the previous screenshot.
- iptables -t nat -L KUBE-SEP-RH2TKPYPHJIEMIWG -n | column -t
You might not get the full DNAT rule detail because of Ubuntu server version. If you are not able to upgrade the server to the next version, we could get into kube-proxy container shell to get the details.
# get the kube-proxy Pod's name
- kubectl get pods -n kube-system -o wide | grep kube-proxy# get into the kube-proxy Pod's shell.
- kubectl exec -it < Pod name > -n kube-system -- /bin/sh# go through the iptables rule chains of the newly created NGINX Ingress
- iptables-save | grep < KUBE-SEP-* >
# check again what are the Pod's endpoints.
- kubectl get pods -o wide | grep my-nginx
The network packets would be dropped if no DNAT endpoints could be found. For example, the K8s Service is not associated with the correct Deployment label.
# drop the network packets if DNAT endpoints could not be found
- iptables -t nat -L KUBE-MARK-DROP -n | column -t
To know more about iptables, please check this site.
Here is the flow of how external clients try to contact K8s service exposed with Load Balancer. Please ignore the green rectangle that is not connected with the orange one as it is just a representation to a non-existing Pod endpoint.
Load Balancer External Traffic Policy
The default K8s Load Balancer would use external traffic policy with “cluster”, meaning the traffic would be going through the whole cluster to find the actual Pod’s endpoints. The plus side is that the request would definitely be routed to the actual service-providing Pods; the down side is that it will add extra hops inside the cluster, essentially consuming resource when it does not need to.
The iptables rule chains would be looking similar to below.
Another external traffic policy would be “local”, meaning the traffic would only routed within the node itself instead of the whole cluster. The plus side is that it would not do extra hops inside the cluster, but it would just drop the network request when it could not find any Pod’s endpoints providing the service.
The iptables rule chains would be looking similar to below.
When Pods are trying to reach back to external environment, the iptables rule chain is pretty similar to inbound. Pod’s endpoint would be masqueraded into kube-proxy’s private IP address (Node’s private IP address). Then, the Linux system would read into conntrack record to know which destination IP it should reach back to and also masqueraded its own source IP to cluster’s public IP address.
Connection Tracking (conntrack)
Right now, we know that whenever external clients try to contact services provided by K8s, both source and destination would need to be NAT. A service called “conntrack” inside Linux iptables netfilter tracks and maintains the state of connection and NAT information.
Using several basic conntrack commands to see how to retrieve the information needed. To get more information, please check this site.
# list out all conntrack records
- conntrack -L# try to connect the service "my-nginx" from any external environment
# get the endpoints of the service once again
- kubectl get pods -o wide | grep my-nginx
# try to locate the established connection with one of the Pod's endpoint from conntrack
# since we do not know which endpoint the external client would be hitting, we could only test
- conntrack -L -d 10.244.0.9
conntrack has its down side and you find a lot of real-world scenarios on the Net. Here is a paragraph from Tigera.
The conntrack table has a configurable maximum size and, if it fills up, connections will typically start getting rejected or dropped. For most workloads, there’s plenty of headroom in the table and this will never be an issue. However, there are a few scenarios where the conntrack table needs a bit more thought:
The most obvious case is if your server handles an extremely high number of simultaneously active connections. For example, if your conntrack table is configured to be 128k entries but you have >128k simultaneous connections, you’ll definitely hit issues!
The slightly less obvious case is if your server handles an extremely high number of connections per second. Even if the connections are short-lived, connections continue to be tracked by Linux for a short timeout period (120s by default). For example, if your conntrack table is configured to be 128k entries, and you are trying to handle 1,100 connections per second, that’s going to exceed the conntrack table size even if the connections are very short-lived (128k / 120s = 1092 connections/s).
Quoted from here
Pod Ingress Communication
We would first install NGINX with the following one-liner.
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.45.0/deploy/static/provider/cloud/deploy.yamlReference
Then, we create 2 simple services with 1 Pod as backend each for NGINX Ingress Controller to work. Follow through this section of official documentation and we should have at least 2 services ready behind the NGINX Ingress. Please make sure to create the example services within the default NGINX Ingress namespace, “ingress-nginx”.
The NGINX Ingress should look similar to below but instead of 2 Pods, each service only has 1.
After deployment, we could execute the following commands to know all the endpoint information.
# get pods providing the 2 simple services
- kubectl get pods -n ingress-nginx# get the 2 services associated with NGINX Ingress
- kubectl get svc -n ingress-nginx# get Ingress rule
- kubectl get ingress -n ingress-nginx# describe the Ingress rule
- kubectl descirbe ingress <ingress rule name> -n ingress-nginx
Test to visit one of the service from external environment through Ingress. Without specifying the path, it should go to aks-hello-world-one by default.
Let’s deep dive into the iptables rule chains like we did in Pod to Service communication.
# get the node's name
- kubectl get nodes# get into the node's shell. Reference for installation.
- kubectl node-shell <node name># go through the iptables rule chains of the newly created NGINX Ingress
- iptables -t nat -L KUBE-SERVICES/xxx/yyy -n | column -t
If you are not seeing the DNAT rule in the end, we could again try to get into kube-proxy Pod’s shell on each Node to get more details.
# get kube-proxy Pod's name
- kubectl get pods -o wide -n kube-system | grep kube-proxy
# get into the Pod's shell to see more details
- kubectl exec -it <kube-proxy Pod name> -n kube-system -- /bin/sh
- iptables-save | grep < iptables rule chain name>
If we check which Pod owns the DNAT private IP address, we would figure it is the NGINX Ingress Controller.
kubectl get pods -n ingress-nginx | grep <DNAT Pod IP>
That gives us some ideas on how NGINX Ingress actually make the incoming request route to the correct endpoint. Let’s try to look at conntrack when connecting to NGINX Ingress from Pod, Node and external environment.
From Pod to NGINX Ingress Service, since it is within cluster, it should resolve the service FQDN to 10.0.37.21, then route to the correct endpoint. In this case, without specifying the path, it should go to “aks-hello-world-one”.
# create a pod within the cluster and execute inside the shell
- kubectl exec -it < pod name > -- /bin/bash# visit the service through FQDN
- curl ingress-nginx-controller.ingress-nginx.svc.cluster.local# enter into the Node shell hosting the Pod and use conntrack to figure out the connection state
We figure the source IP is not from the Pod (10.244.5.4) but from NGINX Ingress Controller Pod (10.244.5.2).
Besides the fact that Node could not resolve K8s cluster’s service FQDN, everything remains the same. In this case, administrators could curl NGINX Ingress Controller’s service cluster IP (10.0.37.21) instead.
Besides the fact that administrators/users would need to use the NGINX Ingress public IP address to visit the service, everything else remains the same.
It is safe for us to say that at the last mile, it is NGINX Ingress Controller controlling where the network request should be routed to.
As part 1 and part 2 of this series contain a lot of information, it took me almost 2 whole weeks to gather all the information needed to understand the working theory and also perform associated testings. Hope this would provide more details and clarity to people who want to understand more about K8s networking — Kubenet CNI!