class: center, middle # Troubleshooting Openstack with Ansible and oslogmerger
Jan. 29th 2016, RDO day, CentOS dojo - Miguel Ángel Ajo (mangelajo@redhat.com) ??? Introduce yourself, I work on OpenStack Neutron on behalf of Red Hat, I help fixing bugs, developing new features, and sometimes I give a hand with packaging. --- class: center, middle # distributed system = complex interactions --- # distributed system = complex interactions .center.medium[![][complex]] [complex]: assets/openstack-arch-kilo-logicl-v1.png --- ### yeah, go ask for an instance - Nova → (nova-scheduler, glance, cinder, neutron, nova-conductor) - nova-conductor → nova-compute - Neutron → (ovs-agent, l3-agent, dhcp-agent) - (l3-agent, dhcp-agent, ovs-agent) → Neutron - Neutron → Nova - Nova → nova-conductor → nova-compute - Instance → ns-metadata-proxy → metadata-agent → Nova Your Instance Is Ready (yes! ✌, or no ☹...) --- .center.small[![][skip_rope]] [skip_rope]: assets/openstack-skip-rope.gif --- ## Releases are stable, but we have too many knobs Many ways to miss configure things .center.small[![][too_many_buttons]] [too_many_buttons]: assets/too_many_buttons.jpg --- ## What logs do we have - Hundreds of scattered logs: every single daemon has one. - Do you cluster?, then even more!, daemons replicated! --- ### Cluster ex. API request & RPC .center.big[![][clustered_request_1]] [clustered_request_1]: assets/clustered_request_1.svg --- ### RPC response .center.big[![][clustered_request_2]] [clustered_request_2]: assets/clustered_request_2.svg --- ### Next API request, balanced .center.big[![][clustered_request_3]] [clustered_request_3]: assets/clustered_request_3.svg --- ## The way openstack logs All projects use oslo_log library, you can send to a centralized log server, but the most common way is writing to files, and rotating. ### oslo_log - example : [oslo_log][] - format definition : [oslo_formats][] [oslo_log]: http://docs.openstack.org/developer/oslo.log/examples.html [oslo_formats]: https://github.com/openstack/oslo.log/blob/88c317195d3f8a0131a07b6d13b1bb594965f5da/oslo_log/_options.py#L124 --- ## OpenStack log format .center.big[![][log_format]] [log_format]: assets/logformat.png --- ## What could we do better - LAAS: LogAsAService?, just kidding, or may be not... - Configure to some sort of centralized log by default and provide logstash, but that's heavy. --- ## oslogmerger to the rescue - it's not magical, but helps - born out of frustration trying to debug HA deployments - proven useful to debug stuff on production and in the gate --- ### oslogmerger: what does it do? - coalescing separate logs, from: FILE1: (date2) (PID) (level) (codepath) (req-id) (message-1) FILE2: (date1) (PID) (level) (codepath) (req-id) (message-2) FILE3: (date0) (PID) (level) (codepath) (req-id) (message-3) - to: (date0) [FILE3] (PID) (level) (codepath) (req-id) (message-3) (date1) [FILE2] (PID) (level) (codepath) (req-id) (message-2) (date2) [FILE1] (PID) (level) (codepath) (req-id) (message-1) ??? Limitations of this approach: time synchronization has to be as good as possible. Clock skew makes the output unusable. --- #### Troubleshooting as a developer - on the gate ```bash $ oslogmerger \ http://logs.openstack.org/.../logs/subnode-2/screen-n-cpu.txt.gz:CMP2 \ http://logs.openstack.org/.../logs/subnode-2/screen-q-agt.txt.gz:OVS2 \ http://logs.openstack.org/.../logs/screen-q-dhcp.txt.gz:QDHCP \ http://logs.openstack.org/.../logs/screen-n-cpu.txt.gz:CMP1 \ http://logs.openstack.org/.../logs/screen-q-agt.txt.gz:OVS1 \ http://logs.openstack.org/.../logs/screen-n-api.txt.gz:NOVAAPI \ http://logs.openstack.org/.../logs/screen-q-svc.txt.gz:NEUTRON ``` - COMPUTE2, OVS2, QDHCP ... are aliases - remote URLs are cached locally --- ### Live demo - Live demo goes here - /me crosses fingers :) ??? oslogmerger http://logs.openstack.org/40/273140/2/check/gate-tempest-dsvm-neutron-multinode-full/79a4d8f/logs/subnode-2/screen-n-cpu.txt.gz:COMPUTE2 http://logs.openstack.org/40/273140/2/check/gate-tempest-dsvm-neutron-multinode-full/79a4d8f/logs/subnode-2/screen-q-agt.txt.gz:OVS2 http://logs.openstack.org/40/273140/2/check/gate-tempest-dsvm-neutron-multinode-full/79a4d8f/logs/screen-q-dhcp.txt.gz:QDHCP http://logs.openstack.org/40/273140/2/check/gate-tempest-dsvm-neutron-multinode-full/79a4d8f/logs/screen-q-agt.txt.gz:OVS1 http://logs.openstack.org/40/273140/2/check/gate-tempest-dsvm-neutron-multinode-full/79a4d8f/logs/screen-n-api.txt.gz:NOVAAPI http://logs.openstack.org/40/273140/2/check/gate-tempest-dsvm-neutron-multinode-full/79a4d8f/logs/screen-q-svc.txt.gz:NEUTRON look for: _get_dhcp_agents_hosting_network --- ### Troubleshooting & support If we add Ansible to the equation, things become: - Nicer - It can be automated and repeated - Bugfixes can be verified --- ### Ansible and oslogmerger - Ansible - orchestrates manipulation of the hosts - introduces probes - retrieves logs - oslogmerger - creates a set of unified logs --- #### Using Ansible to set the hosts (1/2) ```yaml --- - name: Configure debug settings. hosts: all tasks: - name: Set nova and neutron on debug. command: openstack-config --set {{ item }} DEFAULT debug True with_items: - /etc/neutron/neutron.conf - /etc/nova/nova.conf - name: Restart controller (with networker) services. hosts: controllers tasks: - name: Clear logs. command: > # rotating is better than removing, of course :) rm -f /var/log/nova/nova-api.log /var/log/neutron/server.log \ /var/log/neutron/l3-agent.log - name: Restart controller services service: name={{ item }} state=restarted with_items: - neutron-server - neutron-l3-agent - openstack-nova-api ``` --- #### Using Ansible to set the hosts (2/2) ```yaml - name: Restart compute services. hosts: computes tasks: - name: Clear logs. command: > rm -f /var/log/nova/nova-compute.log \ /var/log/neutron/openvswitch-agent.log - name: Restart neutron openvswitch agent and nova compute service: name={{ item }} state=restarted with_items: - neutron-openvswitch-agent - openstack-nova-compute ``` --- #### Ansible: reproduce the issue ```yaml - name: Migrate back and forth hosts: command-controller tasks: - name: Migrate to host 1 shell: | source /root/keystonerc_admin nova live-migration {{ vm_id }} {{ migration_host_1 }} sleep {{ migration_wait }} - name: Migrate to host 2 shell: | source /root/keystonerc_admin nova live-migration {{ vm_id }} {{ migration_host_2 }} sleep {{ migration_wait }} ``` --- #### Using Ansible to grab the logs ```yaml - name: Fetch logs from hosts hosts: all tasks: - name: create local log dir local_action: > command mkdir -p logs/{{ ansible_hostname }}/var/log/{{ item }} with_items: - nova - neutron - name: Fetch logs synchronize: > mode=pull src=/var/log/{{ item }}/ dest=logs/{{ ansible_hostname }}/var/log/{{ item }}/ delete=yes with_items: - nova - neutron ``` - at the end, logs are locally stored to logs/$hostname/var/log/... --- #### Ansible: local log merge ```yaml - name: Final merge hosts: command-controller # a dirty trick to get this locally ran once # is there a better way to do this? tasks: - name: merge logs local_action: shell # < it happens on the machine running ansible oslogmerger logs/h1/var/log/neutron/server.log:NEUTRONSRV \ logs/h1/var/log/neutron/openvswitch-agent.log:OVSH1 \ logs/h1/var/log/neutron/pinger.log:PINGH1 \ logs/h1/var/log/neutron/arpdumper.log:ARPH1 \ logs/h1/var/log/neutron/tapdumper.log:TAPH1 \ logs/h1/var/log/nova/nova-api.log:NOVAAPI \ logs/h1/var/log/nova/nova-compute.log:COMPUTEH1 \ logs/h1/var/log/neutron/l3-agent.log:L3AGENT \ logs/h2/var/log/neutron/openvswitch-agent.log:OVSH2 \ logs/h2/var/log/neutron/pinger.log:PINGH2 \ logs/h2/var/log/neutron/arpdumper.log:ARPH2 \ logs/h2/var/log/neutron/tapdumper.log:TAPH2 \ logs/h2/var/log/nova/nova-compute.log:COMPUTEH2 \ >logs/mergedlogs.log ``` --- ## Introducing probes in the system - pingers - tcpdumps - flow dumping - but, how? --- #### Ansible: setting up probes (1/2) ```yaml - name: Start pingers and arp loggers hosts: all tasks: - name: Kill old probes. shell: | killall pinger.sh arpdumper.sh tapdumper.sh tcpdump ping exit 0 - name: Copy scripts. copy: src=scripts/{{ item }} dest=/tmp/{{ item }} mode=755 with_items: - pinger.sh - arpdumper.sh - tapdumper.sh ( continues on next slide ) ``` --- #### Ansible: setting up probes (2/2) ```yaml (continues on previous slide) - name: Start new pingers in openstack log format. shell: | rm -f /var/log/neutron/pinger.log nohup /tmp/pinger.sh {{ floating_ip }} > \ /var/log/neutron/pinger.log & - name: Start arpdumper. shell: | rm -f /var/log/neutron/arpdumper.log nohup /tmp/arpdumper.sh > /var/log/neutron/arpdumper.log & - name: Start tapdumpers. shell: | rm -f /var/log/neutron/tapdumper.log nohup /tmp/tapdumper.sh {{ vm_tap }} > \ /var/log/neutron/tapdumper.log & ``` --- # Wrap them up! #### ARP dumper ```bash #!/bin/sh ip link set dev br-int up tcpdump -e -n -l -i br-int arp 2>&1 | \ while read line; do echo "$(date +%Y-%m-%d\ %H:%M:%S.%03N) tcpdump DEBUG $line"; done ``` #### Pinger ```bash #!/bin/sh ping -i 0.5 $1 | \ while read pong; do echo "$(date +%Y-%m-%d\ %H:%M:%S.%03N) pinger DEBUG $pong"; done ``` --- #### TAP dumper ```bash #!/bin/sh interface=$1 skip2() { # tail --lines=+3 blocks the output until the input finishes.... read line ; read line while read line; do echo $line done } tcpdump_retry() { interface=$1 echo "Starting tcpdump retrier on interface '$interface'" while true; do tcpdump -i $interface -e -n -l 2>&1 | skip2 sleep 1 # yield before retrying done } tcpdump_retry $interface| \ while read line; do echo "$(date +%Y-%m-%d\ %H:%M:%S.%03N) tcpdump DEBUG $line"; done ``` ??? It's limited, you need to know the name of the tap in advance, a possible improvement here is to automatically look for new tap devices, and start logging them as they pop up. auto-tap-logger :) --- # Another live demo with ansible - OMG, are you sure? nope.. - filtering out useful stuff: ```bash cat logs/mergedlogs.log | \ egrep -e '(Lifecycle|ARP|META|NOVA)' ``` --- # What's missing (1/2) - move it to stackforge/openstack - package - simplify gate log merger - filtering options (by error levels) --- # What's missing (2/2) - ansible template generator - standard set of probes: - pingers - auto-tap-loggers - arp-watcher - others: storage? kvm related? --- class: center, middle # THE END Questions? Ideas? Feedback? Miguel Ángel Ajo (mangelajo@redhat.com) - http://github.com/mangelajo/oslogmerger - http://www.ansible.com - example : [debug_live_migration][] http://mangelajo.github.io/openstack-debugging-presentation/ [debug_live_migration]: hhttps://github.com/mangelajo/oslogmerger/tree/master/contrib/debug-live-migration