Setting KVM hostname per DHCP

The Problem

One problem with virtual machines is, when you clone one you also copy the complete configuration including hostname, static IPs, etc. To fix this you need to boot the cloned VM, edit the config and reboot it. The problem is that you will have at least temporary hostname and/or IP conflicts.

Use DHCP

The better approach is to obtain hostnames and IPs via DHCP.
Then when you clone a machine (e.g. using virt-manager) for testing software updates or other changes, you simply remove the external NIC from the cloned VM, and the internal NIC gets a new MAC assigned automatically.
Then you update the DHCP server configuration (e.g. /etc/dnsmasq.conf) and add the new MAC, assigned IP and hostname there. Then when booting the new VM it automatically gets the correct hostname and new IP, without the need of changing the VM’s configuration files.

Example:

dhcp-host=52:54:00:B9:C5:06,redmine,192.168.0.14,30m
dhcp-host=52:54:00:BF:70:1C,redmine-test,192.168.0.19,30m

After the changes have been tested successfully you can apply the changes to the real system (you still should have a backup).
Don’t remove the cloned ‘test-vm’. Just shut it down and keep it for the next time.

When you need to test again some new changes on the machine you have already the complete clone configuration and MAC address DHCP setup. So you simply need to replace the clone’s disk file (e.g. redmine-clone.qcow) with the latest version of your VM’s disk file (e.g. redmine.qcow). Then you can start the test machine and everything should work just fine without any conflicts.

Example SSH Session:

# log into the VM's host system
ssh -l root blade7
cd /var/lib/libvirt/images
# shutdown the VM before copying the file
virsh shutdown Redmine
cp redmine.qcow redmine-test.qcow
# Restart VM
virsh start Redmine
# Start Clone-VM
virsh start Redmine-test
exit

Configuring the DHCP-Client on Debian (VM Guest)

On Debian you simply need to set the hostname in /etc/hostname to “localhost” to enable receiving the hostname via DHCP.
The DHCP client itself is already configured to request “host-name” info via DHCP. See /etc/dhcp/dhclient.conf. There you should find the option “host-name” in the list of requested DHCP options.

About DNSmasq

As a side note I should mention that using DNSmasq is a great solution. It is a DHCP server and DNS server in one application.
This means no matter if you are adding hostnames manually to the DHCP configuration or getting hostnames via DHCP from any DHCP client, these names can be also resolved via DNS automatically, without any further configuration.

Advertisements

VS2017 has got CMake support

We are using CMake already for years in my company to develop cross-platform software. CMake is really a great piece of software, but the user experience in Visual Studio was not great. It worked, but we often needed to explain how, and the typical Visual Studio user didn’t know CMake at all.

Now CMake support has been built into Visual Studio directly, which means

  • you can open CMake projects directly.
  • there is no need to generate Visual Studio solutions anymore.
  • you can configure CMake options from the GUI.
  • you can directly execute CMake targets like “install” from the Solution Explorer.

Watch this video to see it in action:
https://blogs.msdn.microsoft.com/vcblog/2016/10/05/cmake-support-in-visual-studio/

This is is really some great news for the Visual Studio world.
Hopefully even more developers will use CMake now in the future.

Profiling Memory using GNU glibc tools

One of the tools I’m using for quite a while to profile memory usage is the built-in profiling support in glibc. For some reason this isn’t known very much, so I thought documenting it here makes sense, so I can simply point the people to this post, instead of explaining everything 😉

What does glibc support?

  • Detecting mem leaks
  • Printing a memory histogram
  • Plotting a graph of memory usage over time
  • Measuring not only the heap, but also stack usage
  • Works also for embedded system which use glibc, not only on the PC

How does it work?

The functionality is implemented in a library called libmemusage.so, which
gets preloaded by the dynamic linker simply be defining the variable LD_PRELOAD=/lib64/libmemusage.so.
The path may vary depending on the system you use of course.

Example:

LD_PRELOAD=/lib64/libmemusage.so ./helloworld

You can configure where the profile output is store by exporting the variable MEMUSAGE_OUTPUT=profile.dat.

There exists also a convenience wrapper script named memusage which does all this for you. A second program called memusagestat can generate nice graphics from the profiling data. Normally this scripts don’t get installed with glibc and must be installed separately.

Gentoo: compile glibc with ‘gd’ use flag.
Debian: libc6-dbg contains /usr/lib/debug/lib/x86_64-linux-gnu/libmemusage.so, but the scripts are missing.
On other systems you may find a package called glibc-utils which contains the scripts. As a last resource you can download it from https://www.gnu.org/software/libc/

Now lets see this in action: There for I created a simple example application, which allocated memory and creates one memory leak.

#include <stdio.h>
#include <malloc.h>

int main(int argc, char *argv[])
{
    int i;
    void *data[10];

    printf("Hello World\n");

    for (i = 0; i < 10; ++i) {
        data[i] = malloc(i+10);
    }

    for (i = 0; i < 9; ++i) {
        free(data[i]);
    }

    return 0;
}

Compile it like this: gcc -o hello hello.c
And run it using memusage

$> memusage ./hello
Hello World

Memory usage summary: heap total: 1169, heap peak: 1169, stack peak: 656
         total calls   total memory   failed calls
 malloc|         11           1169              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
 calloc|          0              0              0
   free|          9            126
Histogram for block sizes:
    0-15              6  54% ==================================================
   16-31              4  36% =================================
 1024-1039            1   9% ========

As you can see there are 11 malloc calls, 10 from our code, 1 from printf,
and only 9 frees, so we have found a memory leak.
We can also see the memory distribution in the histogram and in the summary
we can see the heap and stack peak values.

Now lets do the same thing again and lets plot a memory usage chart.

$> memusage -d profile.dat ./hello
$> memusagestat -o profile.png profile.dat

This creates the following graphics:

Profile

Of course this is a little boring in this simple application, but is very useful in bigger applications. For detecting memory leaks are better solutions available like valgrind, but this chart is very useful to see how much memory an application needs at what time, even though no memory leaks exists. E.g. consistently growing memory usage would be a problem.

How to use this on embedded devices?

In most cross-compiler toolchains that use glibc the library libmemusage.so does exist also. You can use this on the embedded system by SSHing and using the LD_PRELOAD approach described above even though the convenience scripts are not available there usually. Then you copy back the results to your PC using SSH and generate the plot on the PC using memusagestat.

Fix performance of VIM syntax highlighting in XML files

Normally Vim is damn fast and this is one reason why I love it. But today I got an XML file with over 3 million lines and after opening that file I wanted to jump to the end of the file. This took me over 1 minute which is unacceptable. Turning of syntax highlighting fixes this performance issue, but of course I don’t want to disable it.

I found the solution on stack overflow:
http://stackoverflow.com/questions/19030290/syntax-highlighting-causes-terrible-lag-in-vim

The problem was caused by one regex entry of syntax/xml.vim.
The line containing the regex for xmlSyncDT was causing this issue.

" synchronizing
" TODO !!! to be improved !!!

syn sync match xmlSyncDT grouphere  xmlDocType +\_.\(<!DOCTYPE\)\@=+

By commenting this out the performance issue is gone without noticeable sideeffects. The comment above the line shows already that the developer of this file is not happy with it. I can live perfectly without this.
Now I can jump to the end of the 3 million line file without any noticeable delays.

The post on Stackoverflow shows also how to debug such issues:

:syntime on
// do something
:syntime report

This statistic shows you where in the syntax highlighter the CPU time gets “burned”.

Easily debug unit tests with GDB

A while ago I developed an ANSI C based unittest framework that gives me the same comfort as the Qt’s unit test framework in C++, but small and portable for embedded ANSI C applications.

You can find the framework here on github :http://gergap.github.io/unittest

This framework also supports data-driven tests, which means you write one test case and can execute the same test n times with different data sets. When something goes wrong you need to debug this. Therefor you can pass the name of the testcase as commandline argument, which means only this testcase gets executed. If it’s a data-driven test this still gets called n times, so there is a second command line argument which contains the dataset name. Now the testcase gets executed only one time with the data set causing the problem.

This is already quiet cool and worked good. But I’m lazy. I didn’t want to open the correct file in the debugger, set the breakpoint, start the application with the correct arguments and let it run to the problematic line. So now I automated also this.

Now you can simply invoke the test with the option “-g” (like debugging in GCC), nothing more. Then the test will abort on the first error and creates a file called ‘test.gdb’ with all instructions necessary to debug this problem.
Then you start GDB with this file and you are there where you want.

Here is an example of the unittest output:

Unit test output

When you run the cgdb as shown in the output you jump right into the debugger with one breakpoint set at the start of the test case and one breakpoint at the error location. You can decide if you want to step to the error or if you simply hit ‘c’ (continue) to run to the error location. Sometimes it also helps to start history recording before pressing continue, so you can use reverse-stepping if you have missed an important line.

Unit test debugging

The generated test.gdb file which does the magic for you only contains three lines:

break test_strtouint16
break ../src/test/utiltest/strtoint.c:277
run test_strtouint16 padding

This new feature is not yet on github, but I’ll upload it soon.
I think it’s a really useful and simple feature.

It would be cool to see this in other unit test frameworks too in the future.

Latency Heatmaps

Latency heatmaps are a great way to visualize latencies which are hard to grasp from pure test data. Brendan Gregg (http://brendangregg.com) has written a great Perl script for generating such heatmaps as interactive SVG graphics. Also the flamegraphs are just awesome, but this is another story.

(Unfortunately SVGs are not allowed on wordpress, so I converted this to PNG for this blog.)

Latency Heatmap

Just recently I used the heatmaps to visualize the accuracy of our OPC UA Server SDKs. So this time I use this opportunity to blog about it.

I used a python test tool for measuring the sampling rate using the OPC UA timestamps.
This outputs a simple list as integer values [µs since UNIX epoch].

1477661743711872
1477661743761997
1477661743811750
1477661743861417
1477661743912030
...

But for generating a heatmap you need input like that:

# time              latency
1477661743761997    50125
1477661743811750    49753
1477661743861417    49667
1477661743912030    50613

Normally when measuring services like UA read and UA write I have both values, the time when measured (sending the request) and the latency (time until I get the response from the server). This time, when measuring the sampling rate for UA monitored items this is a little bit different. I only get the timestamps when the data was sampled. I don’t care when I received the data. So I compute the latency information as the difference of two sample points.

This can simply be computed using a few lines of awk script:

BEGIN { last = 0; }
{
    if (/[0-9]+/) {
        if (last == 0) {
            last = $1;
        } else {
            latency = $1 - last;
            last = $1;
            printf "%u\t%u\n", $1, latency
        }
    }
}

The result I can feed into Brandon’s trace2heatmap.pl Perl script.

The whole process of measuring and generating the SVG is put into a simple BASH script which does the following:
1.) Calling the python test UA client
2.) Calling the awk script to prepare the data for trace2heatmap.pl
3.) Calling trace2heatmap.pl to generate the SVG

This also shows the power of Linux commandline tools like BASH, awk, and Perl. I love how these tools work seamlessly together.

Excerpt of this BASH script:

...
# do measurement
echo "Starting measurement for 10s..."
if [ $PRINT_ONLY -eq 0 ]; then
    ./test.py $URL subscription >log.txt || { cat log.txt; exit 1; }
fi
echo "Done."
# compute latency based on source timestamps
echo "Computing latency data using awk..."
awk -f latency.awk log.txt >latency.txt || exit 1
# generate heatmap
echo "Generating heatmap..."
./trace2heatmap.pl --stepsec=0.1 --unitstime=us --unitslatency=us --grid --minlat=$MINLAT --maxlat=$MAXLAT --reflat=$REFLAT --title="$TITLE" latency.txt > $SVGFILE || exit 1
echo "Done. Open $SVGFILE in your browser."

I used this to measure at 50ms sampling rate, once on Window 10, and once on Linux.
The results are quiet different.

Windows 10 measurement:
Latency Heatmap

It is interesting to see that we are quiet far away from the configured 50ms sampling interval. The reason for this is that our software uses software timers for sampling that are derived from the Windows GetTickCount() API function. The resolution of this is quiet bad and is about 15-16ms. Maybe this could be improved using QueryPerformanceCounter.
See also https://randomascii.wordpress.com/2013/05/09/timegettime-versus-gettickcount/

Linux measurement: (Linux ws-gergap 4.4.6-gentoo)
Latency Heatmap

On Linux we use clock_gettime() to replicate the Windows GetTickCount() functionality. And this works much better. Also we don’t have such runwaway measurement results due to scheduling delays. Event though it’s a standard Linux kernel without real-time extension. Linux does a pretty good job.

Note that both graphics above use the same scale. When zooming in more into the Linux measurement we recognize another phenomenon:
Latency Heatmap
You can see two lines in the measurement. The distance is exactly 1ms. The reason for this is that in our platform abstraction layer we have a tickcount() function which is modelled after the Win32 API, which means it uses ms units. This in turn means our software cannot create more accurate timer events, even though Linux itself would be able to handle this.

We should think about changing this to µs to get better accuracy, and maybe QueryPerformanceCounter can fix the problem also on Windows. But for the moment we are happy with the results, as they are already much better than on Windows.

2nd note: I modified the trace2heatmap.pl a little bit to show also the configured sampling rate (red line). This way it is easier to see how far away the measured timestamps are from the configured sampling rate. The Perl script is really easy to understand, so custom modifications are a no-brainer.

If somebody is interested in these scripts, just leave me a comment and I will put it on github.

Thanks to Brendan for this script and his great book and website.

Windows10 sucks

Yesterday I needed Windows to build a product release for the Windows platform.
Normally I don’t do this, but the colleague was on vacation. I normally touch Windows only a few times a year, and I don’t waste my time and hard disk on installing Windows. Even though I grew up with DOS and Windows I stopped using it, because it makes way too much troubles and Linux works rock solid.

So I booted the colleague’s PC and guess what? It automatically installed some updates without asking. After this the network didn’t work anymore. So I could not execute the build process, which is in a VirtualBox VM anyway. Unfortunately the copy of the VM image on our network share was out of date, so I wanted to copy the image from the PC to the network share – oops, no network. So I tried to copy it to an external hard disk connected via USB. Guess what, Windows10 was not able to recognize this disk. On another WindowsXP machine the disk worked, but not on Windows10 – Happy Anniversary update !!!

Why is this Windows shit always hitting me? I touch Windows as rarely as I can, but always there is a problem. Bluescreeens, “utltra-fast” reboots, something hangs or does not work at all.

On every new Windows version people tell me: “Now it’s much better, no blue screens anymore, bla bla”. And every time I try Windows it is the same crap as before or even worse…

Sorry, normally my post are more objective, but this Windows crap pisses me off. I again lost two work days with fixing this crap. Why to hell I didn’t simply boot a Linux SystemRescue CD and copied over the VM to my Linux box? Good question – I think because I was expecting a simple solution. The next time I will not waste my time with Windows anymore – simply use Linux and stay relaxed.