Bug 1981946 - IP address and FQDN not displayed in ovirt-engine even though latest guest agent is installed
Summary: IP address and FQDN not displayed in ovirt-engine even though latest guest ag...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.40.70.6
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.9
: ---
Assignee: Tomáš Golembiovský
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks: 2003432
TreeView+ depends on / blocked
 
Reported: 2021-07-13 18:56 UTC by GervaisdeM
Modified: 2021-11-04 19:28 UTC (History)
8 users (show)

Fixed In Version: vdsm-4.40.90.4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2003432 (view as bug list)
Environment:
Last Closed: 2021-10-25 15:47:42 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+
ahadas: blocker-


Attachments (Terms of Use)
engine.log, vdsm.log from a host, output from getStats commands (2.18 MB, application/zip)
2021-07-13 18:56 UTC, GervaisdeM
no flags Details
getStats,out vdsm.log vdsm.log.1.xz (1.47 MB, application/zip)
2021-08-09 14:50 UTC, GervaisdeM
no flags Details
vdsm-debug-log-2021-09-24 (1.47 MB, text/plain)
2021-09-24 19:08 UTC, GervaisdeM
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 116839 0 master POST virt: trace how guest info is passed around 2021-10-19 11:25:57 UTC
oVirt gerrit 116840 0 master NEW virt: add logger for guest agent information 2021-09-22 18:27:51 UTC
oVirt gerrit 117211 0 master MERGED qga: continue and process all VMs 2021-10-25 09:56:02 UTC
oVirt gerrit 117243 0 ovirt-4.4.z MERGED qga: continue and process all VMs 2021-10-25 14:20:52 UTC

Description GervaisdeM 2021-07-13 18:56:58 UTC
Created attachment 1801252 [details]
engine.log, vdsm.log from a host, output from getStats commands

Description of problem:
The ip address and fqdn are getting to the ovirt-engine. The interface shows an "!" and says that the latest guest addition needs to be installed and running on the guest OS even though it is installed.

The vagrant plugin we use queries the API to sort out what IP it needs to communicate with vm's that are being brought up by our development team and fails because it can't sort out what IP the vm has.

Some vm's seem to show the correct information while others based on the same template do not. Migrating a vm from one host to another sometimes caused the ip to get into the system.

Version-Release number of selected component (if applicable):


How reproducible:
In my case, it is very easy to reproduce, but the issue does not always happen. If the server has been rebooted or vdsm has been restarted recently then things seem to work OK for a while.

Steps to Reproduce:
Up a vm from a previously known, good template.

Actual results:
IP intermittently does not appear in ovirt-engine.

Expected results:
IP and FQDN should be properly sent to ovirt-engine as qemu-guest-agent is installed

Additional info:
This affecting vms based on templates running Ubuntu 18.04/20.04 and Redhat 7/8.

Comment 1 RHEL Program Management 2021-07-14 06:30:04 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Tomáš Golembiovský 2021-07-14 10:24:51 UTC
There is no obvious issue on VDSM side. The VMs are queried regularly but there is no guest info in the stats.
Once you notice this behavior again on the VM could you please share output of the following commands:

    virsh guestinfo noel-ovirt-biaa-11
    virsh domifaddr noel-ovirt-biaa-11 --source agent
    virsh qemu-agent-command noel-ovirt-biaa-11 '{"execute":"guest-network-get-interfaces"}'

Hopefully this will give us some clues.

Comment 3 GervaisdeM 2021-07-15 20:59:54 UTC
The vm (noel-ovirt-biaa-11) that was not displaying an IP is now.

I ran the requested commands on several vms that were not displaying an IP in the web interface The last vm was one I spun up with vagrant and it is currently hanging and will eventually timeout.
Vagrant run stage: ==> i8manage_nopuppet_test_ov: Waiting for VM to get an IP address... 

Hope this is not too much, but ran this on multiple nodes with vms running different OS's in case there is a clue in there somewhere.

Thanks,
Gervais

[root ~]# virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # guestinfo noel-ovirt-dhs-22
user.count          : 0
os.id               : ubuntu
os.name             : Ubuntu
os.pretty-name      : Ubuntu 20.04.2 LTS
os.version          : 20.04.2 LTS (Focal Fossa)
os.version-id       : 20.04
os.machine          : x86_64
os.kernel-release   : 5.4.0-66-generic
os.kernel-version   : #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021
timezone.name       : UTC
timezone.offset     : 0
hostname            : noel-ovirt-dhs-22
fs.count            : 8
fs.0.name           : sda2
fs.0.mountpoint     : /
fs.0.fstype         : ext4
fs.0.total-bytes    : 500507766784
fs.0.used-bytes     : 28232536064
fs.0.disk.count     : 1
fs.0.disk.0.alias   : sda
fs.0.disk.0.serial  : 0QEMU_QEMU_HARDDISK_8a9e76e7-6827-4248-9dd4-29c3d1c9771c
fs.0.disk.0.device  : /dev/sda2
fs.1.name           : sda2
fs.1.mountpoint     : /sftp/sftp_user/ingest_data
fs.1.fstype         : ext4
fs.1.total-bytes    : 500507766784
fs.1.used-bytes     : 28232536064
fs.1.disk.count     : 1
fs.1.disk.0.alias   : sda
fs.1.disk.0.serial  : 0QEMU_QEMU_HARDDISK_8a9e76e7-6827-4248-9dd4-29c3d1c9771c
fs.1.disk.0.device  : /dev/sda2
fs.2.name           : loop0
fs.2.mountpoint     : /snap/core18/2074
fs.2.fstype         : squashfs
fs.2.total-bytes    : 58195968
fs.2.used-bytes     : 58195968
fs.2.disk.count     : 0
fs.3.name           : loop1
fs.3.mountpoint     : /snap/core18/1880
fs.3.fstype         : squashfs
fs.3.total-bytes    : 57671680
fs.3.used-bytes     : 57671680
fs.3.disk.count     : 0
fs.4.name           : loop2
fs.4.mountpoint     : /snap/lxd/16099
fs.4.fstype         : squashfs
fs.4.total-bytes    : 74842112
fs.4.used-bytes     : 74842112
fs.4.disk.count     : 0
fs.5.name           : loop3
fs.5.mountpoint     : /snap/lxd/20326
fs.5.fstype         : squashfs
fs.5.total-bytes    : 70909952
fs.5.used-bytes     : 70909952
fs.5.disk.count     : 0
fs.6.name           : loop5
fs.6.mountpoint     : /snap/snapd/12398
fs.6.fstype         : squashfs
fs.6.total-bytes    : 33947648
fs.6.used-bytes     : 33947648
fs.6.disk.count     : 0
fs.7.name           : loop4
fs.7.mountpoint     : /snap/snapd/8542
fs.7.fstype         : squashfs
fs.7.total-bytes    : 31457280
fs.7.used-bytes     : 31457280
fs.7.disk.count     : 0

virsh # domifaddr noel-ovirt-dhs-22 --source agent
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 lo         00:00:00:00:00:00    ipv4         127.0.0.1/8
 -          -                    ipv6         ::1/128
 enp1s0     56:6f:65:8b:00:09    ipv4         192.168.173.25/16
 -          -                    ipv6         fe80::546f:65ff:fe8b:9/64

virsh # qemu-agent-command noel-ovirt-dhs-22 '{"execute":"guest-network-get-interfaces"}'
{"return":[{"name":"lo","ip-addresses":[{"ip-address-type":"ipv4","ip-address":"127.0.0.1","prefix":8},{"ip-address-type":"ipv6","ip-address":"::1","prefix":128}],"statistics":{"tx-packets":853428,"tx-errs":0,"rx-bytes":427632008,"rx-dropped":0,"rx-packets":853428,"rx-errs":0,"tx-bytes":427632008,"tx-dropped":0},"hardware-address":"00:00:00:00:00:00"},{"name":"enp1s0","ip-addresses":[{"ip-address-type":"ipv4","ip-address":"192.168.173.25","prefix":16},{"ip-address-type":"ipv6","ip-address":"fe80::546f:65ff:fe8b:9","prefix":64}],"statistics":{"tx-packets":62180,"tx-errs":0,"rx-bytes":128789706,"rx-dropped":171968,"rx-packets":1166301,"rx-errs":0,"tx-bytes":20379261,"tx-dropped":0},"hardware-address":"56:6f:65:8b:00:09"}]}

virsh # ^C
[root ~]# exit
Connection to ovirt1 closed.



ssh ovirt2
Last login: Thu Jul 15 19:49:35 2021 from 192.168.168.1

  node status: OK
  See `nodectl check` for more information

Admin Console: https://10.10.1.2:9090/ or https://192.168.168.11:9090/ or https://10.10.2.2:9090/


  node status: OK
  See `nodectl check` for more information

Admin Console: https://10.10.1.2:9090/ or https://192.168.168.11:9090/ or https://10.10.2.2:9090/


Welcome to ovirt2.dgi 
[root ~]# virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # guestinfo alexander-2004
user.count          : 0
os.id               : ubuntu
os.name             : Ubuntu
os.pretty-name      : Ubuntu 20.04.2 LTS
os.version          : 20.04.2 LTS (Focal Fossa)
os.version-id       : 20.04
os.machine          : x86_64
os.kernel-release   : 5.4.0-66-generic
os.kernel-version   : #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021
timezone.name       : UTC
timezone.offset     : 0
hostname            : vagrantbase-ubuntu2004
fs.count            : 7
fs.0.name           : sda2
fs.0.mountpoint     : /
fs.0.fstype         : ext4
fs.0.total-bytes    : 500507766784
fs.0.used-bytes     : 15814242304
fs.0.disk.count     : 1
fs.0.disk.0.alias   : sda
fs.0.disk.0.serial  : 0QEMU_QEMU_HARDDISK_c1bcd399-bbcb-4758-a8a1-4632d79e5d1e
fs.0.disk.0.device  : /dev/sda2
fs.1.name           : loop1
fs.1.mountpoint     : /snap/lxd/16099
fs.1.fstype         : squashfs
fs.1.total-bytes    : 74842112
fs.1.used-bytes     : 74842112
fs.1.disk.count     : 0
fs.2.name           : loop0
fs.2.mountpoint     : /snap/core18/1880
fs.2.fstype         : squashfs
fs.2.total-bytes    : 57671680
fs.2.used-bytes     : 57671680
fs.2.disk.count     : 0
fs.3.name           : loop2
fs.3.mountpoint     : /snap/snapd/8542
fs.3.fstype         : squashfs
fs.3.total-bytes    : 31457280
fs.3.used-bytes     : 31457280
fs.3.disk.count     : 0
fs.4.name           : loop3
fs.4.mountpoint     : /snap/snapd/12398
fs.4.fstype         : squashfs
fs.4.total-bytes    : 33947648
fs.4.used-bytes     : 33947648
fs.4.disk.count     : 0
fs.5.name           : loop4
fs.5.mountpoint     : /snap/core18/2074
fs.5.fstype         : squashfs
fs.5.total-bytes    : 58195968
fs.5.used-bytes     : 58195968
fs.5.disk.count     : 0
fs.6.name           : loop5
fs.6.mountpoint     : /snap/lxd/20326
fs.6.fstype         : squashfs
fs.6.total-bytes    : 70909952
fs.6.used-bytes     : 70909952
fs.6.disk.count     : 0

virsh # domifaddr alexander-2004 --source agent
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 lo         00:00:00:00:00:00    ipv4         127.0.0.1/8
 -          -                    ipv6         ::1/128
 enp1s0     56:6f:65:8b:00:0c    ipv4         192.168.173.28/16
 -          -                    ipv6         fe80::546f:65ff:fe8b:c/64

virsh # qemu-agent-command alexander-2004 '{"execute":"guest-network-get-interfaces"}'
{"return":[{"name":"lo","ip-addresses":[{"ip-address-type":"ipv4","ip-address":"127.0.0.1","prefix":8},{"ip-address-type":"ipv6","ip-address":"::1","prefix":128}],"statistics":{"tx-packets":637,"tx-errs":0,"rx-bytes":60896,"rx-dropped":0,"rx-packets":637,"rx-errs":0,"tx-bytes":60896,"tx-dropped":0},"hardware-address":"00:00:00:00:00:00"},{"name":"enp1s0","ip-addresses":[{"ip-address-type":"ipv4","ip-address":"192.168.173.28","prefix":16},{"ip-address-type":"ipv6","ip-address":"fe80::546f:65ff:fe8b:c","prefix":64}],"statistics":{"tx-packets":52758,"tx-errs":0,"rx-bytes":433033164,"rx-dropped":447839,"rx-packets":2944392,"rx-errs":0,"tx-bytes":6164774,"tx-dropped":0},"hardware-address":"56:6f:65:8b:00:0c"}]}

virsh # exit

[root ~]# virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # guestinfo alexandria2
user.count          : 0
os.id               : centos
os.name             : CentOS Stream
os.pretty-name      : CentOS Stream 8
os.version          : 8
os.version-id       : 8
os.machine          : x86_64
os.kernel-release   : 4.18.0-315.el8.x86_64
os.kernel-version   : #1 SMP Mon Jun 28 19:09:44 UTC 2021
timezone.name       : ADT
timezone.offset     : -10800
hostname            : alexandria.dgi
fs.count            : 2
fs.0.name           : dm-0
fs.0.mountpoint     : /
fs.0.fstype         : xfs
fs.0.total-bytes    : 29827903488
fs.0.used-bytes     : 3050864640
fs.0.disk.count     : 1
fs.0.disk.0.alias   : vda
fs.0.disk.0.serial  : d4b74174-3778-4fee-9
fs.0.disk.0.device  : /dev/vda2
fs.1.name           : vda1
fs.1.mountpoint     : /boot
fs.1.fstype         : ext4
fs.1.total-bytes    : 952840192
fs.1.used-bytes     : 299028480
fs.1.disk.count     : 1
fs.1.disk.0.alias   : vda
fs.1.disk.0.serial  : d4b74174-3778-4fee-9
fs.1.disk.0.device  : /dev/vda1
disk.count          : 6
disk.0.name         : /dev/vda1
disk.0.partition    : yes
disk.0.dependency.count: 1
disk.0.dependency.0.name: /dev/vda
disk.1.name         : /dev/vda2
disk.1.partition    : yes
disk.1.dependency.count: 1
disk.1.dependency.0.name: /dev/vda
disk.2.name         : /dev/vda
disk.2.partition    : no
disk.2.serial       : d4b74174-3778-4fee-9
disk.2.alias        : vda
disk.3.name         : /dev/dm-0
disk.3.partition    : no
disk.3.dependency.count: 1
disk.3.dependency.0.name: /dev/vda2
disk.3.guest_alias  : cl-root
disk.4.name         : /dev/sr0
disk.4.partition    : no
disk.4.serial       : QEMU_DVD-ROM_QM00005
disk.4.alias        : sdc
disk.5.name         : /dev/dm-1
disk.5.partition    : no
disk.5.dependency.count: 1
disk.5.dependency.0.name: /dev/vda2
disk.5.guest_alias  : cl-swap

virsh # domifaddr alexandria2 --source agent
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 lo         00:00:00:00:00:00    ipv4         127.0.0.1/8
 -          -                    ipv6         ::1/128
 enp1s0     00:50:56:bf:47:53    ipv4         192.168.180.135/16
 -          -                    ipv6         fe80::9b49:b62d:1800:49d3/64

virsh # qemu-agent-command alexandria2 '{"execute":"guest-network-get-interfaces"}'
{"return":[{"name":"lo","ip-addresses":[{"ip-address-type":"ipv4","ip-address":"127.0.0.1","prefix":8},{"ip-address-type":"ipv6","ip-address":"::1","prefix":128}],"statistics":{"tx-packets":652,"tx-errs":0,"rx-bytes":56859,"rx-dropped":0,"rx-packets":652,"rx-errs":0,"tx-bytes":56859,"tx-dropped":0},"hardware-address":"00:00:00:00:00:00"},{"name":"enp1s0","ip-addresses":[{"ip-address-type":"ipv4","ip-address":"192.168.180.135","prefix":16},{"ip-address-type":"ipv6","ip-address":"fe80::9b49:b62d:1800:49d3","prefix":64}],"statistics":{"tx-packets":755882,"tx-errs":0,"rx-bytes":2107436985,"rx-dropped":95063,"rx-packets":2266605,"rx-errs":0,"tx-bytes":97774500,"tx-dropped":0},"hardware-address":"00:50:56:bf:47:53"}]}

ssh ovirt2          
Last login: Thu Jul 15 20:19:19 2021 from 192.168.168.1

  node status: OK
  See `nodectl check` for more information

Admin Console: https://10.10.1.2:9090/ or https://192.168.168.11:9090/ or https://10.10.2.2:9090/

^[[A
  node status: OK
  See `nodectl check` for more information

Admin Console: https://10.10.1.2:9090/ or https://192.168.168.11:9090/ or https://10.10.2.2:9090/


Welcome to ovirt2.dgi 
[root ~]# virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # guestinfo gervais-ovirt-i8manage-test
user.count          : 0
os.id               : ubuntu
os.name             : Ubuntu
os.pretty-name      : Ubuntu 20.04.2 LTS
os.version          : 20.04.2 LTS (Focal Fossa)
os.version-id       : 20.04
os.machine          : x86_64
os.kernel-release   : 5.4.0-66-generic
os.kernel-version   : #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021
timezone.name       : UTC
timezone.offset     : 0
hostname            : vagrantbase-ubuntu2004
fs.count            : 4
fs.0.name           : sda2
fs.0.mountpoint     : /
fs.0.fstype         : ext4
fs.0.total-bytes    : 500507766784
fs.0.used-bytes     : 4842360832
fs.0.disk.count     : 1
fs.0.disk.0.alias   : sda
fs.0.disk.0.serial  : 0QEMU_QEMU_HARDDISK_eafaac44-c1ca-4ee4-81f4-2b0db5a88f46
fs.0.disk.0.device  : /dev/sda2
fs.1.name           : loop2
fs.1.mountpoint     : /snap/lxd/16099
fs.1.fstype         : squashfs
fs.1.total-bytes    : 74842112
fs.1.used-bytes     : 74842112
fs.1.disk.count     : 0
fs.2.name           : loop0
fs.2.mountpoint     : /snap/core18/1880
fs.2.fstype         : squashfs
fs.2.total-bytes    : 57671680
fs.2.used-bytes     : 57671680
fs.2.disk.count     : 0
fs.3.name           : loop1
fs.3.mountpoint     : /snap/snapd/8542
fs.3.fstype         : squashfs
fs.3.total-bytes    : 31457280
fs.3.used-bytes     : 31457280
fs.3.disk.count     : 0

virsh # domifaddr gervais-ovirt-i8manage-test --source agent
 Name       MAC address          Protocol     Address
-------------------------------------------------------------------------------
 lo         00:00:00:00:00:00    ipv4         127.0.0.1/8
 -          -                    ipv6         ::1/128
 enp1s0     56:6f:65:8b:00:67    ipv4         192.168.173.119/16
 -          -                    ipv6         fe80::546f:65ff:fe8b:67/64

virsh # qemu-agent-command gervais-ovirt-i8manage-test '{"execute":"guest-network-get-interfaces"}'
{"return":[{"name":"lo","ip-addresses":[{"ip-address-type":"ipv4","ip-address":"127.0.0.1","prefix":8},{"ip-address-type":"ipv6","ip-address":"::1","prefix":128}],"statistics":{"tx-packets":132,"tx-errs":0,"rx-bytes":10980,"rx-dropped":0,"rx-packets":132,"rx-errs":0,"tx-bytes":10980,"tx-dropped":0},"hardware-address":"00:00:00:00:00:00"},{"name":"enp1s0","ip-addresses":[{"ip-address-type":"ipv4","ip-address":"192.168.173.119","prefix":16},{"ip-address-type":"ipv6","ip-address":"fe80::546f:65ff:fe8b:67","prefix":64}],"statistics":{"tx-packets":11345,"tx-errs":0,"rx-bytes":122338065,"rx-dropped":66,"rx-packets":80174,"rx-errs":0,"tx-bytes":926907,"tx-dropped":0},"hardware-address":"56:6f:65:8b:00:67"}]}

Comment 4 GervaisdeM 2021-07-23 12:51:41 UTC
@

Comment 5 GervaisdeM 2021-07-23 12:58:10 UTC
Apologies for the above comment.

This issue is still huge for our organization. Please let me know if there is anything else I can do to help move this issue along.

Comment 6 Tomáš Golembiovský 2021-07-27 16:46:28 UTC
I went through the provided data several times and I still don't see where could the stats get lost. The only thing I noticed that the VMs were migrated to that host (and not started fresh). Do you experience the behavior only after a migration?

Comment 7 GervaisdeM 2021-07-27 19:38:07 UTC
Hi Tomáš,

It does not work on boxes that are started fresh on a host. If I put a host into maintenance mode and then activate it, vm's started on that host show an IP properly. If I migrate a vm that is not showing its IP to the newly reactivated host, it displays its IP.

Hosts that have active VM's running for any length of time eventually no longer consistently display their IP's in the interface. The odd one or two show an IP. It is not always the same VM that continues to display an IP and I can not find a pattern.

If I migrate a vm off of the host that was recently put into maintenance mode and activated to a host that has been running untouched for some time, the vm immediately stops displaying its IP.

This includes the HostedEngine.

Comment 8 GervaisdeM 2021-07-28 18:16:30 UTC
My HostedEngine is still running CentOS 8.

Does this have any bearing on this issue?
Should I move it to CentOS Stream?

Comment 9 Arik 2021-08-08 16:23:48 UTC
We fail to reproduce this issue and while we have the relevant logs, it's still quite difficult to understand what's going on.
Assuming there are still VMs with no guest agent information (ip addresses and fqdn) in your environment, could you please pick one of them and:
1. Provide the output of getStats
2. Specify the exact time getStats was triggered
3. Provide VDSM log from that host that covers this point in time and few hours back

Comment 10 GervaisdeM 2021-08-09 14:48:32 UTC
The issue definitely continues. We have 74 vm's running in our cluster and about 1/3 have no IP and FQDN being displayed in the oVirt engine. 

We have 3 hosts in our cluster (ovirt1, ovirt2, ovirt3 - I know crazy naming scheme). I pushed most of the vm's off to ovirt2 and ovirt3. oVirt1 has only 8 vm's running on it and they all show the correct information in the engine. If a new box is brought up by one of our developers, it almost always comes up on oVirt1 (because it has the least vm's) and it does come up ok and display the correct information in the ovirt-engine. If for some reason, the box starts on ovirt2 or ovirt3, I can force it to migrate to ovirt1 and it will display the IP and fqdn properly. As soon as I migrate that new vm to ovirt2 or ovirt3 (the hosts with the majority of the vms running on them), the ip and fqdn are not longer displayed in the engine interface.

I'm attaching vdsm logs and output of getStats as requested. the file is called ovirt-logs-2021-08-09.zip
getStats was run at Mon Aug  9 13:21:37 UTC 2021

Comment 11 GervaisdeM 2021-08-09 14:50:19 UTC
Created attachment 1812462 [details]
getStats,out vdsm.log vdsm.log.1.xz

Comment 12 Tomáš Golembiovský 2021-08-10 06:41:53 UTC
I may have already asked this, but I cannot find this in a bug. Are all the hosts up to date? What are the versions of vdsm, libvirt and qemu?

In previous attachment you posted a getStats for a VM on broken host (as requested). For comparison, could you please share also result of getStats for the same VM but running on ovirt1 when it has all the guest information visible in engine?

It pretty much looks like a host issue, but let's double check it is not related to bug 1901572. Please try clearing the hash for NICs info in engine, like in:

    https://bugzilla.redhat.com/show_bug.cgi?id=1901572#c16

Will this fix the issue or change anything?

Comment 13 GervaisdeM 2021-08-10 20:19:10 UTC
Clearing the hash did not seem to have any effect.

Output from getStats on same vm after migrating it to ovirt1

```
[root ~]# vdsm-client VM getStats vmID=72ae23f6-7487-42e1-9ceb-dcf189a6f264
[
    {
        "acpiEnable": "true",
        "appsList": [
            "kernel-3.10.0-1160.21.1.el7.x86_64",
            "qemu-guest-agent-2.12.0"
        ],
        "balloonInfo": {
            "balloon_cur": "8388608",
            "balloon_max": "8388608",
            "balloon_min": "8388608",
            "balloon_target": "8388608",
            "ballooning_enabled": true
        },
        "clientIp": "",
        "cpuSys": "0.27",
        "cpuUsage": "6530000000",
        "cpuUser": "0.44",
        "disks": {
            "sdc": {
                "apparentsize": "0",
                "flushLatency": "0",
                "readBytes": "0",
                "readLatency": "0",
                "readOps": "0",
                "readRate": "0.0",
                "truesize": "0",
                "writeLatency": "0",
                "writeOps": "0",
                "writeRate": "0.0",
                "writtenBytes": "0"
            },
            "vda": {
                "apparentsize": "200540160",
                "flushLatency": "1413051.5",
                "imageID": "563abb58-e91a-44da-9382-9a850fff0c4e",
                "readBytes": "0",
                "readLatency": "0",
                "readOps": "0",
                "readRate": "0.0",
                "truesize": "214777344",
                "writeLatency": "21563795.0",
                "writeOps": "1",
                "writeRate": "102.40785782986511",
                "writtenBytes": "1536"
            }
        },
        "disksUsage": [],
        "displayInfo": [
            {
                "ipAddress": "192.168.168.13",
                "port": "5909",
                "tlsPort": "",
                "type": "vnc"
            }
        ],
        "elapsedTime": "962703",
        "guestCPUCount": -1,
        "guestFQDN": "gervais-centos7-test-php56.dgi",
        "guestIPs": "",
        "guestName": "gervais-centos7-test-php56.dgi",
        "guestOs": "3.10.0-1160.21.1.el7.x86_64",
        "guestOsInfo": {
            "arch": "x86_64",
            "codename": "",
            "distribution": "CentOS Linux",
            "kernel": "3.10.0-1160.21.1.el7.x86_64",
            "type": "linux",
            "version": "7"
        },
        "guestTimezone": {
            "offset": -180,
            "zone": "ADT"
        },
        "hash": "-5327289600703475597",
        "kvmEnable": "true",
        "memUsage": "10",
        "memoryStats": {
            "majflt": 0,
            "mem_free": "7491276",
            "mem_total": "8006768",
            "mem_unused": "7491276",
            "minflt": 2,
            "pageflt": 2,
            "swap_in": 0,
            "swap_out": 0
        },
        "monitorResponse": "0",
        "netIfaces": [
            {
                "hw": "00:00:00:00:00:00",
                "inet": [
                    "127.0.0.1"
                ],
                "inet6": [
                    "::1"
                ],
                "name": "lo"
            },
            {
                "hw": "56:6f:65:8b:00:36",
                "inet": [
                    "192.168.173.70"
                ],
                "inet6": [
                    "fe80::5547:5c32:3c7c:842e"
                ],
                "name": "eth0"
            }
        ],
        "network": {
            "vnet26": {
                "macAddr": "56:6f:65:8b:00:36",
                "name": "vnet26",
                "rx": "28134",
                "rxDropped": "0",
                "rxErrors": "0",
                "sampleTime": 523369.025767464,
                "speed": "1000",
                "state": "unknown",
                "tx": "940",
                "txDropped": "0",
                "txErrors": "0"
            }
        },
        "session": "Unknown",
        "status": "Up",
        "statusTime": "2670852671",
        "timeOffset": "0",
        "username": "",
        "vcpuCount": "1",
        "vcpuPeriod": 100000,
        "vcpuQuota": "-1",
        "vmId": "72ae23f6-7487-42e1-9ceb-dcf189a6f264",
        "vmJobs": {},
        "vmName": "gervais-centos7-test-php56",
        "vmType": "kvm"
    }
]
```

Everything is up to date:

```
[root ~]# dnf clean all ; dnf upgrade
99 files removed
Ceph packages for x86_64                                                                                                                      176 kB/s | 283 kB     00:01    
dell-system-update_independent                                                                                                                144 kB/s | 150 kB     00:01    
dell-system-update_dependent                                                                                                                  155 kB/s | 195 kB     00:01    
oVirt Node Optional packages from CentOS Stream 8 - BaseOS                                                                                    6.7 MB/s | 8.8 MB     00:01    
oVirt Node Optional packages from CentOS Stream 8 - AppStream                                                                                 522 kB/s |  12 MB     00:24    
Latest oVirt 4.4 Release                                                                                                                      1.3 MB/s | 2.2 MB     00:01    
Extra Packages for Enterprise Linux 8 - x86_64                                                                                                7.6 MB/s |  10 MB     00:01    
CentOS-8 - Gluster 8                                                                                                                          157 kB/s |  64 kB     00:00    
virtio-win builds roughly matching what will be shipped in upcoming RHEL                                                                      102 kB/s |  66 kB     00:00    
Copr repo for EL8_collection owned by sbonazzo                                                                                                615 kB/s | 402 kB     00:00    
Copr repo for gluster-ansible owned by sac                                                                                                     28 kB/s | 7.3 kB     00:00    
Copr repo for ovsdbapp owned by mdbarroso                                                                                                     6.9 kB/s | 2.0 kB     00:00    
Advanced Virtualization CentOS Stream packages for x86_64                                                                                     230 kB/s | 153 kB     00:00    
CentOS-8 Stream - oVirt 4.4                                                                                                                   1.0 MB/s | 928 kB     00:00    
CentOS-8 - OpsTools - collectd                                                                                                                287 kB/s | 148 kB     00:00    
CentOS-8 - NFV OpenvSwitch                                                                                                                     92 kB/s |  45 kB     00:00    
OpenStack Train Repository                                                                                                                    661  B/s | 257  B     00:00    
Dependencies resolved.
Nothing to do.
Complete!
```

Versions I am running:

```
[root ~]# rpm -qa | grep vdsm
vdsm-client-4.40.70.6-1.el8.noarch
vdsm-common-4.40.70.6-1.el8.noarch
vdsm-http-4.40.70.6-1.el8.noarch
vdsm-4.40.70.6-1.el8.x86_64
vdsm-gluster-4.40.70.6-1.el8.x86_64
vdsm-network-4.40.70.6-1.el8.x86_64
vdsm-jsonrpc-4.40.70.6-1.el8.noarch
vdsm-python-4.40.70.6-1.el8.noarch
vdsm-api-4.40.70.6-1.el8.noarch
vdsm-yajsonrpc-4.40.70.6-1.el8.noarch

[root ~]# rpm -qa libvirt
libvirt-7.4.0-1.el8s.x86_64

[root ~]# rpm -qa | grep qemu
qemu-kvm-block-ssh-5.2.0-16.el8s.x86_64
qemu-kvm-block-gluster-5.2.0-16.el8s.x86_64
qemu-guest-agent-5.2.0-16.el8s.x86_64
qemu-img-5.2.0-16.el8s.x86_64
qemu-kvm-common-5.2.0-16.el8s.x86_64
qemu-kvm-block-iscsi-5.2.0-16.el8s.x86_64
libvirt-daemon-driver-qemu-7.4.0-1.el8s.x86_64
qemu-kvm-ui-opengl-5.2.0-16.el8s.x86_64
qemu-kvm-block-rbd-5.2.0-16.el8s.x86_64
qemu-kvm-ui-spice-5.2.0-16.el8s.x86_64
qemu-kvm-block-curl-5.2.0-16.el8s.x86_64
qemu-kvm-5.2.0-16.el8s.x86_64
ipxe-roms-qemu-20181214-8.git133f4c47.el8.noarch
qemu-kvm-core-5.2.0-16.el8s.x86_64
```

Comment 14 GervaisdeM 2021-09-01 14:07:14 UTC
I've noticed that there have been some small updates to packages and have applied them. None of them have made any difference to the issue described on this ticket. 

Is there anything I can do on my end? Should I do a reinstall of my hosts from the UI in the ovirt-engine? Could that make the situation worse?

Comment 15 Tomáš Golembiovský 2021-09-02 08:51:30 UTC
(In reply to GervaisdeM from comment #14)
> I've noticed that there have been some small updates to packages and have
> applied them. None of them have made any difference to the issue described
> on this ticket. 
> 
> Is there anything I can do on my end? Should I do a reinstall of my hosts
> from the UI in the ovirt-engine? Could that make the situation worse?

Yes, I wanted to suggest reinstalling the host. So if you can please try that.

Also, compare configuration in /etc/vdsm if there are any differences between the host that works OK and the faulty hosts.

Comment 16 GervaisdeM 2021-09-07 20:25:50 UTC
I haven't gotten to reinstalling yet. Will try to one evening this week. Before I do, should I have any concerns about reinstalling? My cluster is hyperconverged with all three servers being being part of the underlying gluster filesystems that I need for vm's and hosted engine. Anything I need to be aware of before running the reinstall?

All three hosts in my cluster are affected by this issue. The only difference in my /etc/vdsm directories on my servers is the vdsm.id

Example:
[root ~]# scp -r ovirt1:/etc/vdsm .
svdsm.logger.conf                                                                                                                           100%  734   124.2KB/s   00:00    
mom.conf                                                                                                                                    100% 3149     3.9MB/s   00:00    
vdsm                                                                                                                                        100%  159   177.3KB/s   00:00    
00-defines.policy                                                                                                                           100%  323    18.6KB/s   00:00    
04-cputune.policy                                                                                                                           100% 2276     3.1MB/s   00:00    
02-balloon.policy                                                                                                                           100% 5369     6.2MB/s   00:00    
05-iotune.policy                                                                                                                            100% 1105     1.6MB/s   00:00    
03-ksm.policy                                                                                                                               100% 4651   621.0KB/s   00:00    
01-parameters.policy                                                                                                                        100%    1     1.8KB/s   00:00    
vdsm.conf                                                                                                                                   100%   56    94.5KB/s   00:00    
logger.conf                                                                                                                                 100% 2167     3.2MB/s   00:00    
vdsm.id                                                                                                                                     100%   36    49.5KB/s   00:00    
[root ~]# diff -r /etc/vdsm vdsm
diff -r /etc/vdsm/vdsm.id vdsm/vdsm.id
1c1
< 4c4c4544-0056-3510-8056-c2c04f383432
\ No newline at end of file
---
> 4c4c4544-0056-3410-8057-c2c04f383432
\ No newline at end of file

Comment 17 GervaisdeM 2021-09-10 14:36:18 UTC
There was another minor upgrade showing for my hosts... I applied the update to all of my hosts and then ran the reinstall one at a time for all three oVirt hosts.

This is exactly what I did:

1. Put first server into maintenance mode.
2. Selected Installation -> Reinstall from the menu
3. Left everything as the default, except set Hosted Engine -> Deploy

I did that for all three ovirt hosts.

My issue remains exactly the same. Is there something in my process above that I should change?

Comment 20 GervaisdeM 2021-09-21 16:28:20 UTC
Any next steps that I can take that might help sort this out?

Comment 22 Tomáš Golembiovský 2021-09-22 19:58:08 UTC
(In reply to GervaisdeM from comment #20)
> Any next steps that I can take that might help sort this out?

That depends on if you would be willing to patch one of your servers with: https://gerrit.ovirt.org/c/vdsm/+/116839/

If yes, then please enable DEBUG logs for "vds" logger in /etc/vdsm/logger.conf and restart vdsmd service. Let the VM run for a couple of minutes (5 is more than enough), run the VM.getStats API call with vdsm-client and provide again the vdsm.log plus result of API call. Hopefully that would give us more clues.

Comment 23 GervaisdeM 2021-09-24 19:08:46 UTC
Created attachment 1826026 [details]
vdsm-debug-log-2021-09-24

Comment 24 GervaisdeM 2021-09-24 19:10:41 UTC
I put my host in Maintenance Mode
Updated the files with https://gerrit.ovirt.org/c/vdsm/+/116839/ by downloading them to one of my hosts and mv'ing some files around so the new ones are being used.
Added the DEBUG flag to /etc/vdsm/logger.conf.
Rebooted

Shuffled enough vm's manually to that host so that some of them stop showing IP's in the overt-engine web interface (if there are only a couple of vm's on the host, they DO show their ip)
The vdsm.log (vdsm-debug-log-2021-09-24) is attached and here is the output of the vdsm-client command:

[root ~]# vdsm-client VM getStats vmID=a64feb94-475a-484f-87bc-6d6213cb7fe1
[
    {
        "acpiEnable": "true",
        "appsList": [],
        "balloonInfo": {
            "balloon_cur": "5242880",
            "balloon_max": "5242880",
            "balloon_min": "5242880",
            "balloon_target": "5242880",
            "ballooning_enabled": true
        },
        "clientIp": "",
        "cpuSys": "1.13",
        "cpuUsage": "67110000000",
        "cpuUser": "2.91",
        "disks": {
            "sda": {
                "apparentsize": "4141088768",
                "flushLatency": "1901196.5",
                "imageID": "07b9dfb0-1678-45eb-ac61-50c5064f3326",
                "readBytes": "86016",
                "readLatency": "0",
                "readOps": "9",
                "readRate": "0.0",
                "truesize": "4320882688",
                "writeLatency": "3066841.4",
                "writeOps": "3621",
                "writeRate": "10635.814409007802",
                "writtenBytes": "23912448"
            },
            "sdc": {
                "apparentsize": "0",
                "flushLatency": "0",
                "readBytes": "0",
                "readLatency": "0",
                "readOps": "0",
                "readRate": "0.0",
                "truesize": "0",
                "writeLatency": "0",
                "writeOps": "0",
                "writeRate": "0.0",
                "writtenBytes": "0"
            }
        },
        "displayInfo": [
            {
                "ipAddress": "192.168.168.13",
                "port": "5911",
                "tlsPort": "",
                "type": "vnc"
            }
        ],
        "elapsedTime": "1905267",
        "guestCPUCount": -1,
        "guestFQDN": "",
        "guestIPs": "",
        "hash": "-1536474874257087895",
        "kvmEnable": "true",
        "memUsage": "57",
        "memoryStats": {
            "majflt": 0,
            "mem_free": "2207892",
            "mem_total": "5053548",
            "mem_unused": "261804",
            "minflt": 66,
            "pageflt": 66,
            "swap_in": 0,
            "swap_out": 0
        },
        "monitorResponse": "0",
        "network": {
            "vnet36": {
                "macAddr": "56:6f:65:8b:00:2f",
                "name": "vnet36",
                "rx": "1901928",
                "rxDropped": "0",
                "rxErrors": "0",
                "sampleTime": 6876.242791271,
                "speed": "1000",
                "state": "unknown",
                "tx": "695171",
                "txDropped": "0",
                "txErrors": "0"
            }
        },
        "session": "Unknown",
        "status": "Up",
        "statusTime": "2154359889",
        "timeOffset": "3",
        "username": "Unknown",
        "vcpuCount": "2",
        "vcpuPeriod": 100000,
        "vcpuQuota": "-1",
        "vmId": "a64feb94-475a-484f-87bc-6d6213cb7fe1",
        "vmJobs": {},
        "vmName": "gervais-ovirt-i8manage-nagios",
        "vmType": "kvm"
    }
]

Comment 25 Tomáš Golembiovský 2021-10-01 09:23:35 UTC
Thanks for the logs. It gave me couple more clues but we're still far from knowing what is really happening here. Would you have also the logs before 18:55:43 from that day? What I would like to see is the lifetime of one of the VMs without stats from its start/migration to host. It looks like these VMs were without stats:

86f85996-3306-4489-9449-569803f5fa85
a64feb94-475a-484f-87bc-6d6213cb7fe1
b42368a2-d6b7-40ac-9842-8a38edf4fdb1
38e5240c-6ef0-43d4-a44d-1bd9bcc40d05

Hopefully the logs are not yet rotated away. Feel free to reach me by email instead of attaching files to this BZ.

Comment 26 GervaisdeM 2021-10-01 12:14:21 UTC
Hi Tomáš

I've sent you an email with what looks like the correct log file attached (based on the timestamp).

The problem is that this issue is intermittent. For me to get the debugging turned on, I had to turn the server to maintenance mode. By doing so, the problem (on that one server) is fixed when it is re-activated until a more vm's are migrated to it and then it starts "breaking." I manually moved vm's around to simulate this and then sent you the last set of logs.

It is very possible that the vm's you listed above are not on the host (oVirt1 in this case) earlier in the day. :(

Let me know if there is anything else I can do.

Comment 27 GervaisdeM 2021-10-14 01:49:24 UTC
Hi Tomáš,

Any next steps?

Comment 28 Tomáš Golembiovský 2021-10-19 21:11:36 UTC
I think I may have found the source of this issue. Could you please try applying the trivial change from: https://gerrit.ovirt.org/c/vdsm/+/117211

Let me know if this really fixes the issue. If not, I have few more changes that increase amount of debug logs in the VDSM handling of qemu-ga.

Comment 29 Sander 2021-10-20 07:29:58 UTC
(In reply to Tomáš Golembiovský from comment #28)
> I think I may have found the source of this issue. Could you please try
> applying the trivial change from: https://gerrit.ovirt.org/c/vdsm/+/117211
> 
> Let me know if this really fixes the issue. If not, I have few more changes
> that increase amount of debug logs in the VDSM handling of qemu-ga.

This patch fixes the issue for me (oVirt 4.4.8 on AlmaLinux 8.4).

Comment 30 Yonathan Dossow 2021-10-20 16:39:47 UTC
I can also confirm the fix. patch applied on RHVH 4.4.8, vdsm-4.40.80.6-1.el8ev.x86_64.

Comment 31 GervaisdeM 2021-10-20 18:38:59 UTC
I'm looking at a list of vm's that are ALL displaying IP addresses!

Looks like the patched worked for me.
oVirt 4.4.8.6-1
vdsm-4.40.80.6-1.el8.x86_64

Comment 32 Tomáš Golembiovský 2021-10-21 08:14:52 UTC
(In reply to GervaisdeM from comment #31)
> I'm looking at a list of vm's that are ALL displaying IP addresses!
> 
> Looks like the patched worked for me.
> oVirt 4.4.8.6-1
> vdsm-4.40.80.6-1.el8.x86_64

Thanks for the confirmation.

Comment 35 RHEL Program Management 2021-10-21 12:55:30 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.


Note You need to log in before you can comment on or make changes to this bug.