hdhoang/homes - Change 2GLM23DYNUIZI7FJEW2SCSNOX42CM5LD6AYLLQXBCVGIQYRSHPHAC

unfould

Created by hdhoang on October 10, 2023

2GLM23DYNUIZI7FJEW2SCSNOX42CM5LD6AYLLQXBCVGIQYRSHPHAC

Dependencies

In channels

main

Change contents

Replacement in container-in-linux.md at line 66 [3.285]

B:BD[3.3018] → [3.3018:3084]

B:BD[3.3084] → [2.414:499]

from there, we could do `kubectl debug` into the container & look
around for any huge directory. this app creates temporary files in a
flat directory:

[3.3018]

[3.3174]

from there, we could do `kubectl debug` into the container & look around for any huge directory. this app creates temporary files in a flat directory:

Replacement in container-in-linux.md at line 82 [3.285]

B:BD[3.3462] → [3.3462:3594]

B:BD[3.3594] → [2.591:795]

∅:D[2.795] → [3.3794:3862]

B:BD[3.3794] → [3.3794:3862]

B:BD[3.3862] → [2.796:891]

at this point, we made the mistake of rollout-restarting the
deployment. new pods started working fine right away, but the old pods
were still around in `Terminating` state. linux was busy removing the
files in their overlayfs mount, including the above giant directories
& their files. that caused ~3hours of high write-iops to commit
filesystem metadata on the nodes, which slowed down other unrelated
workloads. we had to cordon them during that time, and moved more
important pods out of there.

[3.3462]

[3.3962]

at this point, we made the mistake of rollout-restarting the deployment. new pods started working fine right away, but the old pods were still around in `Terminating` state. linux was busy removing the files in their overlayfs mount, including the above giant directories & their files. that caused ~3hours of high write-iops to commit filesystem metadata on the nodes, which slowed down other unrelated workloads. we had to cordon them during that time, and moved more important pods out of there.

Replacement in container-in-linux.md at line 84 [3.285]

B:BD[3.3963] → [3.3963:4183]

the strangest effect was that pods on other nodes also failed
readiness check randomly. it turned out some of their mysqlrouters
were on the heavy-load nodes. the db clusters were totally fine, they
run on different HW.

[3.3963]

[3.4183]

the strangest effect was that pods on other nodes also failed readiness check randomly. it turned out some of their mysqlrouters were on the heavy-load nodes. the db clusters were totally fine, they run on different HW.

Replacement in container-in-linux.md at line 99 [3.285]

B:BD[3.5042] → [3.5042:5181]

B:BD[3.5181] → [2.892:968]

i still don't understand how a network-heavy app can be disturbed so
much by disk io. perhaps it checkpoints or logs something in critical
path? but that's now [water under the
bridge](https://youtu.be/4G-YQA_bsOU)

[3.5042]

[3.5187]

i still don't understand how a network-heavy app can be disturbed so much by disk io. perhaps it checkpoints or logs something in critical path? but that's now [water under the bridge](https://youtu.be/4G-YQA_bsOU)

Replacement in container-in-linux.md at line 104 [3.285]

B:BD[3.5227] → [2.969:1052]

segue from the previous issue, for a long time we have been facing
this situation:

[3.5227]

[3.5310]

segue from the previous issue, for a long time we have been facing this situation:

Replacement in container-in-linux.md at line 110 [3.285]

B:BD[2.1148] → [2.1148:1224]

each shim is small, but the overall buildup causes system-wide
degradation:

[2.1148]

[2.1224]

each shim is small, but the overall buildup causes system-wide degradation:

Replacement in container-in-linux.md at line 116 [3.285]

B:BD[3.5471] → [2.1403:1491]

here is an example which started running & finished its work during
the above deletion:

[3.5471]

[3.5570]

here is an example which started running & finished its work during the above deletion:

Replacement in container-in-linux.md at line 137 [3.285]

B:BD[3.8327] → [2.1492:1642]

after 10s, from `ExitedAt` :44:42 to `deadline exceeded` :44:52,
containerd gave up on removing the task, and the orphan shim stays
around from then.

[3.8327]

[3.8698]

after 10s, from `ExitedAt` :44:42 to `deadline exceeded` :44:52, containerd gave up on removing the task, and the orphan shim stays around from then.

Replacement in container-in-linux.md at line 139 [3.285]

B:BD[3.8699] → [2.1643:1827]

again with ebpf in one hand and the flow around this area in another,
we think that discarding/unmounting each container's overlay
filesystem are io-intensive as well as io-sensitive.

[3.8699]

[3.8876]

again with ebpf in one hand and the flow around this area in another, we think that discarding/unmounting each container's overlay filesystem are io-intensive as well as io-sensitive.

Replacement in container-in-linux.md at line 188 [3.285]

B:BD[3.12574] → [2.1828:2383]

cloudfoundry discussed [this general
problem](https://www.cloudfoundry.org/blog/an-overlayfs-journey-with-the-garden-team/). coccoc
is following with great interest for newer containerd 1.6 LTS releases
for fixes around handling overlay deletion. some recent ones improved
short-term, temporary mounts by marking them readonly. containerd
maintainers also made a [great
reproduction](https://github.com/containerd/containerd/pull/9004/files#diff-1d0d1c3863f35bb86ef37975c4e1a2062e6ca71e6f6a94dc385f8a3556284ddcR117)
with `strace` fault-injection feature:

[3.12574]

[3.13123]

cloudfoundry discussed [this general problem](https://www.cloudfoundry.org/blog/an-overlayfs-journey-with-the-garden-team/). coccoc is following with great interest for newer containerd 1.6 LTS releases with fixes around handling overlay deletion. some recent ones improved short-term, temporary mounts by marking them readonly. containerd maintainers also made a [great reproduction](https://github.com/containerd/containerd/pull/9004/files#diff-1d0d1c3863f35bb86ef37975c4e1a2062e6ca71e6f6a94dc385f8a3556284ddcR117) with `strace` fault-injection feature:

Deletion in container-in-linux.md at line 194 [3.285]

B:BD[3.13300] → [3.13300:13370]

B:BD[3.13370] → [2.2384:2485]


a more fundamental fix, using overlayfs `volatile` mode to alleviate
whole-system load, is [in design phase for
now](https://github.com/containerd/containerd/pull/4785).

Replacement in container-in-linux.md at line 195 [3.285]

B:BD[3.13468] → [2.2486:2844]

we can't do much else to mitigate this problem. due to the nature of
php/nodejs/python applications with many loose files for each
container, and the way we [pass php files to nginx
containers](https://medium.com/coccoc-engineering-blog/our-journey-to-kubernetes-container-design-principles-is-your-back-bag-9166fc4736d2#957e)
in a shared `emptyDir` volume.

[3.13468]

[2.2844]

a more fundamental fix, using overlayfs `volatile` mode to alleviate whole-system load, is [in design phase for now](https://github.com/containerd/containerd/pull/4785).

Insertion in container-in-linux.md at line 197 [3.285]

[2.2845]

[3.13825]

we can't do much else to mitigate this problem. due to the nature of php/nodejs/python applications with many loose files for each container, and the way we [pass php files to nginx containers](https://medium.com/coccoc-engineering-blog/our-journey-to-kubernetes-container-design-principles-is-your-back-bag-9166fc4736d2#957e) in a shared `emptyDir` volume.

Replacement in container-in-linux.md at line 202 [3.285]

B:BD[3.13852] → [3.13852:13923]

B:BD[3.13923] → [2.2868:3014]

onward to the main title. as part of migrating on-host applications to
k8s, we mount some files as `hostPath` volume into containers, and let
host cronjobs write new data into them. for a time, this worked
correctly:

[3.13852]

[3.14084]

onward to the main title. as part of migrating on-host applications to k8s, we mount some files as `hostPath` volume into containers, and let host cronjobs write new data into them. for a time, this worked correctly:

Replacement in container-in-linux.md at line 231 [3.285]

B:BD[3.14494] → [2.3015:3154]

but update to the cronjob code introduced a new phenomenon. on host, we
can see the new data in the file, but k8s pods read only old data.

[3.14494]

[3.14630]

but update to the cronjob code introduced a new phenomenon. on host, we can see the new data in the file, but k8s pods read only old data.

Replacement in container-in-linux.md at line 259 [3.285]

B:BD[3.15211] → [2.3189:3598]

`hostPath` is implemented as a bind-mount, so it's "translated" to
specific inode once at the pod setup phase. after `mv` rewrote the
path to different inode, `68812816` is kept alive only by the mount
namespace. it's similar to a running process holding open a deleted
file, giving `DEL` state in `lsof` listings. but this 0-link file is
still reachable from host, via the container's `root/` under `/proc`:

[3.15211]

[3.15644]

`hostPath` is implemented as a bind-mount, so it's "translated" to specific inode once at the pod setup phase. after `mv` rewrote the path to different inode, `68812816` is kept alive only by the mount namespace. it's similar to a running process holding open a deleted file, giving `DEL` state in `lsof` listings. but this 0-link file is still reachable from host, via the container's `root/` under `/proc`:

Replacement in container-in-linux.md at line 266 [3.285]

B:BD[3.15710] → [3.15710:15777]

B:BD[3.15777] → [2.3599:3823]

our mitigation for this one was changing `hostPath` up a level, to
share the more-stable directory as volume instead. it would still
break the same way if someone rename the directory, but it's less
likely. and further, we'll work with people to share the data updates
in a more robust way.

[3.15710]

[3.15912]

our mitigation for this one was changing `hostPath` up a level, to share the more-stable directory as volume instead. it would still break the same way if someone rename the directory, but it's less likely. and further, we'll work with people to share the data updates in a more robust way.