Issues with logging feature for Cloud Scale
Useful commands required during troubleshooting
To get the list of nodes:
$ kubectl get nodes
To view the information (such as Taints applied), describe the node:
$ kubectl describe node <node name>
To view which nodes a pod is assigned to:
$ kubectl get pods -A -o wide
This command displays the extra data including node.
To obtain the information about the fluentbit DaemonSet's, run the describe command on fluentbit DaemonSet's:
$ kubectl describe ds nb-fluentbit-daemonset -n netbackup
This command displays how many DaemonSet's are scheduled.
If the taints and tolerations are not configured properly, a lack of DaemonSet's being assigned is seen and the container and pod logs would not be considered. This issue is due to tolerations not setup in the values.yaml file or tolerations not added.
The values can be viewed using the following commands:
To get DaemonSet's in NetBackup namespace:
$ kubectl get ds -n <netbackup namespace>
To view tolerations:
$ kubectl edit ds -n <netbackup namespace> nb-fluentbit-daemonset
The tolerations can be found in the vi menu that is opened. If no change is required then do not save any changes.
Following error message appears when fluentbit scans the location for logs which has permission issues:
[error] [input:tail:tail.0] read error, check permissions: /mnt/nblogs/*/*/*.log [ warn] [input:tail:tail.0] error scanning path: /mnt/nblogs/*/*/*.log [error] [input:tail:tail.0] read error, check permissions: /mnt/nblogs/*/*/*/*.log [ warn] [input:tail:tail.0] error scanning path: /mnt/nblogs/*/*/*/*.log [error] [input:tail:tail.0] read error, check permissions: /mnt/nblogs/*/*/*.log [ warn] [input:tail:tail.0] error scanning path: /mnt/nblogs/*/*/*.log
The above error messages are displayed in the sidecar logs which can be found in the collector pod as they are picked up by the DaemonSet's and stored under the pod that the sidecar resides in. Some application logs associated with the sidecar may be missing from the collector if this error occurs.
Workaround:
Exec into the sidecar and determine which folder has permission issues.
If you add an incorrect labels into .yaml file, it will end up having no demonset running for that node and eventually logs would not be collected for pods that are present on that node.
If any of the Cloud Scale node is configured on the Agentpool (systempool) in case of Azure, then the demonset would not be able to collect the logs.