Hopelessly Over-provisioned: 2013

Tuesday, July 16, 2013

Hidden gem in vCOps Foundation or "How much can you really downsize?" - Part 1

Setting the stage

I've been working intensely with an enterprise licensed vCenter Operations Manager lately and find it a powerful tool to analyse, monitor and optimize vSphere landscapes (and possibly others, I have not had the opportunity to work with AWS adapter for instance). On the side I play around with a few foundation level instances with various customers.

Recently I stumbled across a very annoying issue with the custom interface that I will just briefly outline here, but not go into it too deeply just yet. vCOps allows you to create custom tags and assign them to resources. Google it, its well worth using this feature as it effectively groups resources. So instead of creating a dashboard and filtering for a whole bunch of resources, you end up filtering for your custom resource tag only. vCOps will then, whenever you refresh the dashboard, pull the metrics from the tagged resources. Tag another resource (say you add a datastore to your environment and tag it using your custom tag) your dashboard will automatically display the newly added resource and its metrics.

The problem is that it only works so well with the heatmap widget, but that's not the topic of the post. Should you want info on this, feel free to reach out via the comments below or Twitter @Str0hhut.

In a recent discussion on the issue with my friend Iwan (@e1_ang) he pointed out that the vApp version of vCOps allows custom grouping of resources while the Windows version does not. That got me thinking about what else I might have missed in the vApp version. So I started comparing the user interfaces. In the vApp version you can indeed create a group by clicking the configuration link in the top right corner and then "Manage Group Types". However at the moment it is beyond me to figure out how to assign resources to my newly created group. Info on this is also appreciated via comments or Twitter.

The hidden gem

Back to topic. While scouting out the menu of the vApp I discovered an interesting link that I highly recommend all aspiring downsizers and every one else interested to click on

What follows is a dashboard that is remarkably similar to the custom dashboards of a licensed vCOps edition. And aside from what vCOps says about itself in the regular dashboard this one will give you a lot more information in detail of whats happening. I have downsized my deployment by quite a bit (UI VM has been capped to 3GB Ram and 1vCPU, Analytics to 5GB and 1 vCPU as well), but have been thinking it has been running quite well so far. Sure, health of the Analytics VM is at 51, there is a memory constraint, but other than that it feels quite snappy, graphs are all up and running, data is there plentyful and (almost everything) works as expected. Well, almost. I've been suspecting it might be due to the restricted resources, that the "Normal"-calculation is not working for every resource. The newly found dashboard confirms that at a rather detailed level.

As you can see from the screenshot, both my collection and analytics tier appear to be in bad shape, whereas my presentation tier is happily buzzing along. Furthermore if you drill into the tiers you'll get an in depth view of which services are affected.

This is a the view you get when you drill into the tree as follows:

double click Collection
double click vCenter Operations Collector (the bottom most icon of the resulting tree)
double click vCenter Operations Adapter (top right corner of the resulting tree)

I then selected the OS resource from the health tree and expanded the Memory Usage folder in the metric selector to find some interesting metrics.

Interestingly enough its not the Analytics VM that is swapping like crazy, but the UI VM.

So I started asking myself how I could find out which of the two VMs is really undersized. Its obvious that the Analytics VM needs more power (or is it? Read on!), the analytics tier is on red alert. Going back to the default dashboard both VMs report that they are memory constrained. Both VMs are demanding to use 100% of their configured memory resources, however both are using only about 50% of what they have. Best guess at this point is a memory constraint of the host vCOps is running on. Both vCenter client and vCOps provide plenty evidence that this is the case. This does not come as a surprise, yes the host is very much constrained, that is the reason why I sized the vCOps VMs down to begin with.

I'm still curious about why the UI VM is swapping so much when the Analytics VM does not seem to be swapping at all. Better yet logging into each VM and seeing what the Linux kernel has to say about it I found very different suggestions:

Analytics VM

localhost:~ # free -m

total used free shared buffers cached

Mem: 4974 4937 37 0 12 789

-/+ buffers/cache: 4134 839

Swap: 4102 0 4102

UI VM:

vcops:~ # free -m

total used free shared buffers cached

Mem: 3018 2999 19 0 0 391

-/+ buffers/cache: 2607 411

Swap: 4102 1264 2838

Wait, what? So the Analytics VM is not swapping, we knew that already. In fact, the kernel of the analytics VM somehow even managed to allocate a small amount of memory as an I/O buffer! It seems to me that despite of what vCOps thinks of itself from an OS point of view the Analytics VM is sized just about right.

Resume

This is no resume as to what is happening in this particular environment and how well the vCOps vApp handles down sizing just yet. I have not gathered nearly enough information to fully analyze the situation and draw conclusiong, my brain is buzzing with ideas and paths to follow along. For now I have set memory reservations for both VMs to force the host to provide each with their entitlements. I will have to think this over and investigate some more in the days to come, as well as observe how the memory reservations change the picture. Stay tuned, this may get interesting.

To be continued...

Friday, June 14, 2013

Multi-NIC-vMotion (not so) deep dive

This morning when I opened my mail box I found a message regarding our Multi-NIC-vMotion setup failing. Despite of what KB2007467 says, there is another way of doing Multi-NIC-vMotion, that I will go into in the scope of this article. But what I find most interesting is the methodology vSphere applies when choosing the vMotion interface to migrate a VM.

Multi-NIC-vMotion on a single vSwitch

The above referenced KB article describes a Multi-NIC-vMotion setup on a single vSwitch with multiple uplinks. When you create a VMKernel port for vMotion, you need to override the default NIC teaming order of the vSwitch, as vMotion VMKernel ports can only utilize on physical switch uplink at a time (the same applies to VMKernel ports used for software iSCSI). Thus for a vSwitch with two uplinks you need to create two VMKernel ports with vMotion activated, where each VMKernel port uses one of the uplinks as active, the other is unused (not standby).

Alternative Multi-NIC-vMotion setup using multiple switches

In our environment we use multiple virtual switches to separate traffic. There is a vSwitch with two uplinks for customer traffic to the VMs, there are three vSwitches for admin, NAS and backup access to the VMs and there is a dedicated vSwitch for vMotion traffic. The dedicated switch sports two uplinks and has been configured with two VMKernel interfaces as described in KB2007467. It has recently been migrated to dvSwitch without any problems.

The other virtual switches, with the exception of the customer vSwitch, are heavily underutilized. Thus it seemed only logical to create VMKernel ports on each of those switches for vMotion usage. The ESX hosts have 1TB ram each, but unfortunately are equipped with 1GBit/s NICs only. Putting one of those monster hosts into maintenance mode is a lengthy process.

On the physical switch side we are using a dedicated VLAN for vMotion (and PXE boot for that matter).

Choosing the right interface for the job

After a lot of tinkering and testing, we came up with the following networking scheme to facilitate vMotion properly. Our vMotion IPs are spread over multiple network ranges as follows:

ESX01 - Last octet of management IP: 51

vmk1 - 172.16.1.51/24
vmk2 - 172.16.2.51/24
vmk3 - 172.16.3.51/24

ESX02 - Last octet of management IP: 52

vmk1 - 172.16.1.52/24
vmk2 - 172.16.2.52/24
vmk3 - 172.16.3.52/24

and so on. The network segments are not routed, as per VMware's suggestion. Thus the individual VMKernel ports of a single host cannot communicate with each other, there will be no "crossing over" into other network segments.

When we vMotion a VM from host ESX01 to ESX02 based on its routing table it will initiate network connections from ESX01.vmk1 to ESX02.vmk1, ESX01.vmk2 to ESX02.vmk2 and so on. Only if for some reason vMotion is not enabled on one of the VMkernel ports the hosts will try to connect to a different port. Thus only then the vMotion will fail.

The reason for splitting the network segments into class C ranges is simple: The physical layer is split into separate islands, which do not interconnect. For this specific network segment, a 22-netmask would do fine and all vMotion VMkernel ports could happily talk to each other. However since the "frontend" and "backend" edge switches are not connected this cannot be facilitated.

When we check the logs (/var/log/vmkernel.log), we can see however that all vMotion ports are being used:

2013-06-14T05:30:23.991Z cpu7:8854653)Migrate: vm 8854654: 3234: Setting VMOTION info: Source ts = 1371189392257031, src ip = <172.16.2.114> dest ip = <172.16.2.115> Dest wid = 8543444 using SHARED swap
2013-06-14T05:30:23.994Z cpu17:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.2.114'
2013-06-14T05:30:23.995Z cpu7:8854653)Tcpip_Vmk: 1059: Affinitizing 172.16.2.114 to world 8854886, Success
2013-06-14T05:30:23.995Z cpu7:8854653)VMotion: 2425: 1371189392257031 S: Set ip address '172.16.2.114' worldlet affinity to send World ID 8854886
2013-06-14T05:30:23.996Z cpu13:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.2.114'
2013-06-14T05:30:23.996Z cpu14:8910)MigrateNet: vm 8910: 1998: Accepted connection from <172.16.2.115>
2013-06-14T05:30:23.996Z cpu14:8910)MigrateNet: vm 8910: 2068: dataSocket 0x410045b36c50 receive buffer size is 563272
2013-06-14T05:30:23.996Z cpu13:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 1 added.
2013-06-14T05:30:23.996Z cpu13:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.3.114'
2013-06-14T05:30:23.996Z cpu14:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 2 added.
2013-06-14T05:30:23.996Z cpu14:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.4.114'
2013-06-14T05:30:23.996Z cpu14:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 3 added.
2013-06-14T05:30:23.996Z cpu14:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.1.114'
2013-06-14T05:30:23.997Z cpu12:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 4 added.
2013-06-14T05:30:23.997Z cpu12:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.0.114'
2013-06-14T05:30:23.997Z cpu12:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 5 added.
2013-06-14T05:30:38.505Z cpu19:8854654)VMotion: 3878: 1371189392257031 S: Stopping pre-copy: only 43720 pages left to send, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~571.132 MB/s, 13975% t2d)

This makes for some impressive bandwidth ~571.132 MB/s on measly GBit/s NICs. The reason why our platform was having issues was a few ports, where vMotion was disabled. Thankfully a few lines of PowerCLI codes solved that problem:

$IPmask = "192.168.100."
$vmks = @();

for ($i=107;$i -le 118; $i++) {
   for ($j=1; $j -le 6; $j++) {
       if ($j -eq 5) { continue; }
       $vmk = Get-VMHost -Name $IPmask$i | Get-VMHostNetworkAdapter -Name vmk$j
       if ($vmk.VMotionEnabled -eq $false) { $vmks += $vmk }
   }
}

foreach ($vmk in $vmks) {
   $vmk | Set-VMHostNetworkAdapter -VMotionEnabled $true
}

Monday, June 10, 2013

One more thing I don't like about Debian Wheezy

I'm running a Wheezy iSCSI target, that already caused some headaches. Today I wanted to add two VMDKs to the Wheezy VM to be able to provide more storage to my test cluster. In the past that was easy. Just add your disks, login to the Linux box and issue

echo "scsi add-single-device a b c d" > /proc/scsi/scsi
(Usage: http://www.tldp.org/HOWTO/archived/SCSI-Programming-HOWTO/SCSI-Programming-HOWTO-4.html)

However with Wheezy there is no /proc/scsi/scsi. The reason being that is has been disabled in the kernel config.

root@debian:/proc# grep SCSI_PROC /boot/config-3.2.0-4-amd64
# CONFIG_SCSI_PROC_FS is not set

Wtf?! (pardon my French!)

The solution, however, is quite simple, and annoying in itself as well. All you need to do is install the scsitools package. Thankfully, the list of dependencies on a (relatively, I installed VMware tools, thus is has the Kernel headers, gcc, make, perl and iscsitarget incl. modules) fresh Debian installation is quite short...

fontconfig-config{a}
libdrm-intel1{a}
libdrm-nouveau1a{a}
libdrm-radeon1{a}
libdrm2{a}
libffi5{a}
libfontconfig1{a}
libfontenc1{a}
libgl1-mesa-dri{a}
libgl1-mesa-glx{a}
libglapi-mesa{a}
libice6{a}
libpciaccess0{a}
libsgutils2-2{a}
libsm6{a}
libutempter0{a}
libx11-xcb1{a}
libxaw7{a}
libxcb-glx0{a}
libxcb-shape0{a}
libxcomposite1{a}
libxdamage1{a}
libxfixes3{a}
libxft2{a}
libxi6{a}
libxinerama1{a}
libxmu6{a}
libxpm4{a}
libxrandr2{a}
libxrender1{a}
libxt6{a}
libxtst6{a}
libxv1{a}
libxxf86dga1{a}
libxxf86vm1{a}
scsitools
sg3-utils{a}
tcl8.4{a}
tk8.4{a}
ttf-dejavu-core{a}
x11-common{a}
x11-utils{a}
xbitmaps{a}
xterm{a}

That's all it takes for you to run "rescan-scsi-bus", which will discover your disks. That was easy, wasn't it?

Friday, June 7, 2013

Access denied. Your IP address [A.B.C.D] is blacklisted. - OpenVPN to the rescue!

Ok, so some of your ISP's fellow customers got their boxes infected and are now part of a botnet (in this specific case apparently the name of the trojan is "Pushdo", "Pushdo is usually associated with the Cutwail spam trojan, as part of a Zeus or Spyeye botnet." src.: http://cbl.abuseat.org). "Doesn't bother me" you may think. "I got all my gear secured" you may think.

Well, that's where you're wrong.

It does bother you!

Upon my morning round of blogs I realized I couldn't access http://longwhiteclouds.com/ any more. Instead I was being greeted with this friendly message:

Access denied. Your IP address [A.B.C.D] is blacklisted. If you feel this is in error please contact your hosting providers abuse department.

This is just one effect. I have been having a seriously choppy internet experience for the past two or three days that I'd like throw in the pot of symptoms I am seeing.

A bit of research quickly revealed what was going on. As a part time mail server admin for my company I know that we use spamhaus.org (among other services and mechanisms) for spam checking. A check in the Blocklist Remov al Center provided information about the source and reason for the blockage. Just enter the IP in question and click on Lookup. I find myself, both in the Policy Based Blocklist as well as the Composite Blocking List and possibly else where, too.

Suggestions

Well, firstly, lets be sociable and inform our ISP. They may know already and be working on the case, or not.

But that doesn't help me right now! I wanna read blogs now!

OpenVPN to the rescue

Luckily I have access to a corporate OpenVPN based network. Unlike other solutions this network does not per sé route all traffic but just provides access to the corporate network. However in this case I wish to do just that.

If all I am worried about, is longwhiteclouds.com I can just set a static route to the tun-interface IP like so

user@box> ip r | grep tun0
192.168.1.0/24 via 172.16.5.17 dev tun0
192.168.5.0/24 via 172.16.5.17 dev tun0
172.16.5.17 dev tun0 proto kernel scope link src 172.16.5.18
192.168.7.0/24 via 172.16.5.17 dev tun0
user@box> ifconfig tun0 | grep inet
          inet addr:172.16.5.18 P-t-P:172.16.5.17 Mask:255.255.255.255
user@box> sudo route add -host longwhiteclouds.com gw 172.16.5.18

But how do you route everything through the tunnel? Firstly you need to set a static route to your provider's VPN endpoint. Once that is out of the way you can reset your default gateway to your own tunnel.

user@box> ip r | grep default
default via 192.168.1.1 dev eth0
user@box> grep remote /etc/openvpn/corporate_vpn.conf
#remote vpn.example.com 1194
remote 1.2.3.4 1194
tls-remote vpn
user@box> sudo route add -host 1.2.3.4 gw 192.168.1.1
user@box> sudo route del default
user@box> sudo route add default gw 172.16.5.18user@box> ip r
default via 172.16.5.18 dev tun0 scope link
[...]
1.2.3.4 via 192.168.1.1 dev eth0

Now everything is swell again in network land, you requests are happily traversing through the VPN tunnel.

user@box> tracepath longwhiteclouds.com
1: 172.16.5.18                                          0.349ms pmtu 1350
1: 172.16.5.1                                         312.647ms
1: 172.16.5.1                                         314.739ms
[...] until they finally reach their destination

Hope that helps someone at some point...

Btw.: Excuse the formatting, I'm not too happy with blogger these days.

Monday, June 3, 2013

iscsitarget-dkms broken in Debian Wheezy

Now that was disappointing. An aging iSCSI bug has resurfaced in Debian's latest and greatest stable release, Wheezy, or in numbers 7. Its rendering Debian's iSCSI package useless. Upon scanning a Debian target using an initiator, e.g. ESXi's software iSCSI adaptor, the following messages pop up:

Jun 3 04:30:44 debian kernel: [ 242.785518] Pid: 3006, comm: istiod1 Tainted: G O 3.2.0-4-amd64 #1 Debian 3.2.41-2+deb7u2
Jun 3 04:30:44 debian kernel: [ 242.785521] Call Trace:
Jun 3 04:30:44 debian kernel: [ 242.785537] [<ffffffffa03103f1>] ? send_data_rsp+0x45/0x1f4 [iscsi_trgt]
Jun 3 04:30:44 debian kernel: [ 242.785542] [<ffffffffa03190d3>] ? ua_pending+0x19/0xa5 [iscsi_trgt]
Jun 3 04:30:44 debian kernel: [ 242.785550] [<ffffffffa0317da8>] ? disk_execute_cmnd+0x1cf/0x22d [iscsi_trgt]
[...]

With ESXi in particular eventually the lun will show up, after a bunch of timeouts, I suppose, but is not usable in any way and may disconnect at any time.

Solution:

Thankfully there is a solution to the dilemma. Some Googleing around I found this again rather old thread in a Ubuntu forum describing the very same issue. Combined with the knowledge of the aforementioned bug I followed the instructions, grabbed the latest set of iscsitarget-dkms sources, compiled them and whatdoyouknow, it works like a charm.

Tuesday, May 28, 2013

pvscsi vs. LSI Logic SAS

I've talked about me being a poor man with not much of a lab before, to great lengths. It shall suffice to say that this has not changed since I started this blog. However I do have access to quite a bit of infrastructure to test and play around with. And so I did.

In a recent innovations meeting one of my colleagues suggested the use of pvscsi over the default LSI Logic drivers. The idea being the same as with vmxnet3 over e1000g adapters to save CPU resources. However that does not automatically yield a performance improvement. Other people have talked about their findings in the past and VMware themselves have said something about it too. To my surprise my findings were a bit different than proposed by VMware.

The CPU utilization difference between LSI and PVSCSI at hundreds of IOPS is insignificant. But at larger numbers of IOPS, PVSCSI can save a lot of CPU cycles.

My setup was very simple, a 64bit W2K8R2 VM with 4GB Ram, 2 vCPUs on an empty ESX cluster and empty storage. I was running my tests during off hours so impact by other VMs on the possibly shared storage (I do not know for sure, unfortunately, how the storage is setup in detail. Thus I don't know if the arrays are shared or dedicated. The controllers will be shared however.) is unlikely, the assigned FC storage LUNs for test purposes only. Apart from the OS drives the VM had two extra VMDKs, each using its own dedicated virtual SCSI controller, pvscsi and LSI Logic SAS.

I might have done something seriously wrong but here's what I found:

Using iometer's Default Access Specification (100% random access at 2kb block sizes, 67% read) I did indeed find very significant differences, but not what I had expected:

pvscsi: Avg Latency: 6.28ms, Avg IOPS: 158, Avg CPU Load 53%
LSI: Avg Latency: 3.16ms, Avg IOPS: 316, AVG CPU Load 34%

Multiple runs confirmed these findings.

Later changing the access specs to a more real world scenario VMware's proposition became more and more true, the values approached each other. At 60% random IO both adapters managed roughly 300 IOPS at 10% CPU load.

Conclusion

I cannot conclude much as I know too little about the storage configuration. However I wanted to see what happened if I scaled up a little. Using the very same storage I deployed a NexentaStor CE, gave it 16GB Ram for caching and 2 VMDKs on the same datastores as the initial VM (each Eager Zeroed Thick) and configured a Raid0-ZPool. I configured 4 zVol LUNs inside the storage appliance and handed them out via iSCSI, migrated the W2K8 VM into the provided storage (and nested ESXi for that matter, just to make it a little more irrelevant) and ran the same tests again. Now utilizing multiple layers of caching I got quite different values:

pvscsi: Avg Latency 1.59ms, Avg IOPS 626, Avg CPU Load 11%
LSI: Avg Latency 1.72ms, Avg IOPS 582, Avg CPU Load 21%

The performance impact is indeed insignificant, none of this is interesting for enterprise workloads. The CPU utilization difference is significant however, as it nearly doubles! As I said before all of this is irrelevant and pretty much a waste of time, it just shows that the platform doesn't have the bang properly make use of a paravirualized scsi controller to begin with. To me that is a little disappointing and an eye opener.

Follow up

Overriding capacity management I migrated the VM onto a production cluster to see whether the storage systems there are a bit more capable. However again the results are not what I expected:

pvscsi: Avg Latency 0.58ms, Avg IOPS 1708, Avg CPU Load 17%
LSI: Avg Latency 0.47ms, Avg IOPS 2126, Avg CPU Load 21%

Again I conclude that none of this is relevant, unfortunately, and I'm going to have to go into questioning the engineering team who set up this storage platform to find some answers as to how they decided what to set up.

Invalid configuration for device '0'

This dreaded message came upon me just now when I tried to reconnect a VM. I had previously shut this VM down, exported it as a OVF and imported it into a test environment to run some iometer tests against a more powerful storage to compare pvscsi performance to LSI Logic SAS. After the test environment's trial license expired I threw the entire thing away and wanted to reconnect me original VM, only to find the above mentioned dreaded message.

Following VMware's KB 2014469 on this issue I first verified that the VM was indeed connected to a free port. I then migrated it using VMotion to a different host, still no good. The third option did however do the trick and thus helped me learn something new about ESXi. It can in fact reload a VMs configuration at runtime and thus resync it with vCenter. And thanks to awk being available Option 3 can easily be shorted to a one-liner:

vim-cmd vmsvc/reload $(vim-cmd vmsvc/getallvms | grep -i VMNAME | awk '{print $1}')

Friday, May 17, 2013

Migrate vSphere 4.1 to new host and fresher infra - follow up

Recently I moved an oldish 4.1 environment to a new base of operations. The process was fairly straight forward, but a few minor things I think are worth mentioning as a follow up.

Update Manager

So I did not disable the old Update Manager installation before moving the entire thing to a new host. I had already decided to go without UM for the new setup and had not spared a thought on the consequences. Furthermore when I had originally set up the old environment, I had failed to follow the best practices and had used IP addresses instead of FQDNs throughout the setup.

The result was a connection error because VCenter was continuously trying to connect to the old Update Manager installation. Searching around I could not find a good way to remove the UM binding so I decided to walk "The Windows Walk", installed UM to overwrite the previous registration in the process and uninstalled it properly afterwards.

Connection error resolved.

In retrospect what I could have done is to enable the old UM service again, let vCenter connect to it and then uninstall it properly.

Performance Statistic Rollup

The other thing that sticks out when I check vCenter Service Status is this message:

Performance Statistics rollup from Past xxx to Past xxx is not occurring in the database

In my specific setup the notification claims roll ups are not available for the following durations:

- Previous day to previous week
- Previous week to previous month
- Previous month to previous year

VMware KB2015763 describes this issue for 5.x installations and furthermore points out to enable statistic rollups in Administration > vCenter Server Settings > Statistics.

Regardless of whether I use vSphere Operations Manager to accumulate the data now or not I am not happy about these warnings, even though they may not affect operations as such. As you may guess historical performance data is not available in vSphere for now.

Both the 5 minutes and 30 minutes interval duration roll ups were already enabled, I added 2 hours and 1 day intervals as well, to no effect.

Digging around some more I find a few possible reasons and explanations for the behaviour:

1. Using SQL Express Edition the VPXD's internal scheduler handles statistic roll ups. My installation uses an Express Edition that was not bundled with the vSphere installer. Also, if you are using a full blown SQL server and the SQL Agent is not running, the installer supposedly reminds you to start the service, which in this case it did not.
2. Another issue might be KB1030819. I followed the instructions inscribed, as the datatype was reported to be "numeric" instead of "bigint".

After some rather tedious mucking around and trying to work out a way on how to automate statistic rollups, running them by hand a few times in the past few days I have decided to migrate the environment to a full blown SQL 2008 server. I found out that our dev team has a fully licensed SQL server running that I may utilize. Our vSphere environemtn is small enough so that no mutual performance impacts are to be expected. Gladly this migration will be very easy and straight forward again and will pave the way to move vCenter itself back to its original home base.

Cannot delete Portgroup - works as designed

This morning when I start work I notice all those emails about a new ticket in our trouble ticket system. The owner added me among other colleagues as a monitor. The issue is the following:

There are numerous virtual port groups that do not have any VMs connected (after a lengthy and mostly automated migration to a new naming scheme) that cannot be deleted. They show up greyed out in the vSphere client and if you drill down each of those port groups will show its associated VMs and templates. If you check the config of any affected VM they will show the new port groups only, the summary pane will furthermore list the greyed out old port groups. The proposed problem solution is a restart of the vCenter service, as there is an OS update pending anyway.

Cause

As I have seen this behaviour before on several, albeit very rare occasions, I wanted to take my chances and investigate. I went ahead to the first affected port group and found a template and a running VM associated with it. The template was an easy and logical case. My colleagues had migrated all VMs but had failed to migrate the templates yet. Convert it to a VM, change the port group association and away it went. Convert back to template and everything is fine.

The running VM however is a different case. Its settings showed no binding to the old port group. Because its a customer's system I cannot just go and change things around wildly. I had a closer look and noticed an active snapshot. We have a policy that snapshots may not be kept longer than one week. However that policy is not being enforced by any automatism or audit trail. The Tasks & Events pane in my vSphere client was not able to tell me when the snapshot had been created...it was already beginning to mold and smell unpleasently. Same goes for the other affected VMs.

And it makes perfect sense. If I want to go back to my original point in time - when I took the snapshot - I expect the VM to be in the same network (thus virtual port group). Thinking I might be able to get a list of VMs and port groups via a simple PowerCLI script I went to work and came up with this slightly ugly code:

$Portgroups = Get-Datacenter $myDC | Get-VirtualPortGroup | Where {$_.Name -like $myPgFilter}
$PGNames = @()
foreach($PG in $Portgroups) {
    $PGNames += $PG.Name
}
$VM_with_Snapshots = @()

$VMs = Get-VM
$Snapshots = $VM | Get-Snapshot

foreach($SS in $Snapshots) {
    $NAs = $SS.VM.NetworkAdapters
    foreach($NetAdd in $NAs) {
        if($PGNames -contains $NetAdd.NetworkName) {
            $VM_with_Snapshots += $SS.VM
        }
    }
}

I'm sure there is massive potential for optimization, I'm not a coder nor do I have much practise at the moment (things are about to change in the near future though). However this approach did not yield the expected result as this script will not return the previous (and still caught in a snapshot) port group assignment, but the current one. Pointers towards the proper results would be greatly appreciated.

Among the affected VMs there are, however, some test systems and infrastructure systems as well that I have sufficient control, knowledge and privileges to test my snapshot theory. And of course, I was right.

I now want to refer to this very recent (re-)tweet and extend it to VCPs as well! :)

RT @josh_odgers: @grantorchard Business as usual, the VCDX comes in and cleans up the mess and ensures customer satisfaction :) > +1
— Michael Webster (@vcdxnz001) May 15, 2013

Monday, May 13, 2013

HP vPV - my own take

HP's Virtualization Performance Viewer, in short vPV, has been stirring up some dust in the past few weeks. I first learned about it from my good friend Amitabh, a passionate HP Infrastructure and Virtualization engineer with, what it seems to be a very broad view on current and up and coming technologies. When I first read his post I had just started playing around (and quite enjoying myself doing so too) with VMware's Operations Manager after hearing about it on several occasions such as the 1st Singapore VMUG meeting of 2013. I dug around some and setup a demo installation to let vPV and Operations Manager go head to head, talked to one of the managers at my employer and just got to know what it is and what it does.

I held a small presentation to some of my colleagues including the vPV aware manager at my company's headquarters recently. We had a look at both products side by side and talked about first impressions and ideas.

Just minutes ago I read Ben's take on vPV which prompted me to say a few words about the product and my experiences with it as well.

vPV - The ugly truth

To understand, what vPV is, I think its quite helpful to know how it came to be. The afore mentioned manager had a chance to talk to one of the HP guys at the last GPC in Las Vegas and revealed an interesting detail: HP's vPV was initially developed as a helper for their internal operations, more or less "by accident". They realized the potential and showed it off, got a few interested visitors and decided to release it as a product.

vPV - What it does!

As both Amitabh and Ben pointed out the installation is dead simple. Download and deploy the virtual appliance, point it at your vCenter and you're ready to roll. You get an instant and, depending on your environment, rather colorful picture.

vPV supports Hyper-V (though not tested by any of the three of us, as far as I can tell) and vSphere. It offers a general overview and drill down capabilities and makes good use of HP's uCMDB to visualize the structures of your environment and qualities and properties of your managed objects. It allows you to access the very same real time metrics that can be found in the vSphere client, with the added bonus of 24 hours retention (for the free edition, 30 days for the licensed product) as opposed to just 1h. You get every last value (including the infamous CPU Co-Stop) in a cute little graph, can arrange them to correlate issues in the workbench or just rely on the dashboard to get the grand total.

To me its easier, and as Ben pointed out, a lot snappier than using vSphere client to access performance graphs of your vSphere environment. A major benefit is the ability to easily place a half dozen of graphs on one screen.

vPV - What it does not!

Same as the vSphere client itself, strictly speaking, its not a real time performance analysis tool (or rather visualizer). vPV retrieves its data from vCenter. vCenter retrieves its data from the connected ESXi hosts, which in turn retrieve data from the VMware tools in each VM and measure the individual VM's performance metrics from the hypervisor side. This chain alone adds multiple delays in itself, the overhead of drawing a fancy graph (and possibly using a lot of RMS to connected the dots without making the graphs bounce up and down like crazy) is not even included in that one yet. In order to visualize the data and make it humanly processable true real time is out of the question, as with most other tools and helpers in this field. You rather get the past 24 hours of everything that was going on, in nicely smoothed lines.

Unlike vCOPS, vPV does not process the data to generate, what I like to call a "management compatible dashboard". vCOPS, even the free edition, stands out for its health level visualization. It will correlate CPU ready times along the number of assigned vCPUs to a host's workload and generate dynamic thresholds where vPV will only show the individual values. Thus vPV will also not free you from analysing dozens of metrics on multiple objects to find the true source of an ongoing performance issue.

On the side there is one major issue with the free edition. It does not support any means of access control. By default it is open to anyone and everyone within your network. As it allows you to view the entire vSphere environment (up to 200 managed objects in the free edition, of course) it surpasses your vSphere permissions and provides an interesting source of information for the sneaky attacker in your own network. So be sure to at least work out some sort of .htaccess protection.

Unlike vCOPS, it does support Hyper-V in the free edition. However to my understanding the paid version does not integrate AWS (and potentially other Cloud vendors and OS'ses) at the moment.

At the end of the day ...

vPV is a great if you're serious about vSphere performance analysis and trouble shooting and a good helper when vSphere client itself does not provide the data you're looking for. I agree with Ben's resume that its a great addition to your preferred set of tools and being free for 200 managed objects should be on any aspiring VMware admin's list of favourites.

HP has done a great job in putting their modules together to aggregate already available data and visualize it in an attractive manner. However I have my doubts that this tool alone helps "to rapidly analyze bottlenecks", as they put it. In my very humble opinion you have to already know what you're looking at (and for) to make good use of vPV.

In that sense it is no real competition to vCOPS, it simply plays in a different field. Integrate it with BSM and I'm sure a trained and able HP consultant will be able to generate reports and dashboards that will blow your mind away. If you're looking for an all-in-one multi purpose solution, vCOPS might be the better choice as vPV itself may evolve somewhat, but as a stand alone tool it can and will only be one out of many to assist you in your virtual performance trouble shooting journey. It sure should come in handy when you train yourself to be a VCAP.

Friday, April 19, 2013

Migrate vSphere 4.1 to new host and fresher infra

For the past few weeks one of my vSphere installations has been seeing some major problems, that I have not been able to drill down into properly. As I set it up a few years ago and then handed it over to one of my coworkers for management it became over utilized, poorly managed and started acting weird. Recently it has been crashing every three days or so due to the transaction log file filling up. The infrastructure (Essentials License) runs on SQL Express 2005 (the one bundled with vSphere 4.1U2), the DB runs at its default settings (simple recovery model, one transaction log file, auto grow, max size 2GB) and there is a poorly configured backup (my bad, I was a bit of a greenhorn back then). Now what happens is a transaction is running on the DB that fills up the log file and then crashes the VPXD. Not ideal.

Because there are a few other things that bother me about this environment, I figured its about time I moved it to a separated VM and reinstalled it properly, leveraging all the experience of the past 3 years or so and taking into account the best practices of my vSphere Install, Configure and Manage course and certification last year.

The new setup

Firstly I rolled out a Windows 2008 R2 server, 64bit, configured hostname, dns domain, added it to our Samba domain (remember to configure this), installed all available updates and gave it a few reboots for good measure. Once the box was up and running smoothly I installed a fresh SQL Express 2008 with Management Tools from here (I actually installed a few other versions before that only to constantly find compatibility issues with W2K8R2 and so forth, but this one is fully supported). Once that was out of the way I imported an older backup of the production database, all went smoothly.

Next step was to create a data source. Following the documentation (which I did not need for this project, I might say so proudly) I created a System DSN that points to the newly created and populated database and tested it. The appropriate SQL Client libs has already been installed along with the SQL Server Express Edition. Keep in mind for a client server setup you need to install a supported SQL Client, the OS's preinstalled client has a beard longer than Santa Clause... well, you get the point. The Express Edition of SQL 2008 R2 seems to have TCP/IP enabled by default, a thing I seem to remember is different on the full blown servers. Also 2008 does install an SQL agent but don't bother trying to activate it, it will not work, very confusing.

With SQL Express and the data source working smoothly it was time to install the VCenter. At this point I was doing merely a test installation, so I would have to keep in mind to be cautious as to not have it connect to my ESX-hosts right away. The installation is pretty straight forward. Just choose your precreated dsn, let the installer use the existing database instead of wiping it and choose to manually connect your hosts and update the agents. I found, however, that the installation should be done as local admin, not domain admin. This might have something to do with our Samba domain, might be related to something else or might even be in the documentation. It did cost me a few hairs but eventually it wasn't anything that could not be conquered.

Once the installation was finished, I connected to the newly created vCenter and found every thing in place, hosts, vms, folder structures, resource pools, permissions and so on and so forth. Tests finished, ready to roll.

The day of the migration

During off hours this morning (a 6 hour time shift can be very helpful sometimes) I went away, stopped the production and newly created VPXDs, web services and so on, created a new backup of the database and moved it to the newly created environment. Confidently after restoring the database, I wanted to start the VPXD, only to find it crashing right away. What went wrong?

With the newly created installation I had also used the latest (and greatest?) release of 4.1, Update 3. The database of U3 is not compatible to U2, naturally. There might be a quicker way to solve this, I opted for a quick reinstallation of vSphere as this would update the database accordingly. After all, I had no reason to expect failure and, in the unlikely event something did actually go wrong, I could still fail back to the old and tested (and buggy) environment. Everything went as expected though.

Once the setup was finished I connected to the vCenter, reconnected my three hosts and found everything to be working just like it should. Playing around with the vSphere client I noticed only a few things to be off.

vSphere Service Status display was not working
Two of the plugins available for installation are not working as expected.

I had an idea what these issues might be related to. The old setup was not based on FQDNs, but everything was hardwired to IP addresses. The Service Status module would thus try to connect to my old environment, which has been shut
down. This post however quickly resolved issue number one. Just replace the mentioned variables' values with the correct ones, restart vCenter for good measure and reconnect with vSphere client. Now the status is working.

As for the plugins, one is the converter plugin. I tried installing through vSphere client and it pointed to my old IP address. Just install converter on your vSphere host and it will prompt connection data for the new environment thus registering and the installation properly. The second plugin is vcIntegrity, which as far as I can tell, is Update Manager related. I could just go and install the Update Manager on my new box, but I don't want that. For such a small environment, I opt to manage updates manually in the future, so I will have to have a look into how to fix that minor issue (is it really minor?). vcIntegrity also shows up as an error in vSphere Service Status.

Cleanup

I have a Operations Manager Appliance running, which wouldn't reconnect to the new environment. Running very short on time and really treating it as a fire and forget I didn't know how to access the admin UI. I guess I would have been able to rerun the setup wizard, I just redeployed the appliance. The HP Virtualization Performance Viewer however had already been using an FQDN and moving it to the new environment works seamlessly.

I'm happy to see that the migration was rather simple with very little caveats. Log file usage so far is good, will have to wait and see how it develops in the days to come. The DB itself is rather resilient, I found that it has self repairing capabilities recently when I was trying to work around the issues I had. All in all I'm quite happy and think I can go on vacation tonight and have sound nights of sleep without having to worry about a crashing vSphere environment any more.

Wednesday, April 17, 2013

ZFS

Very short article on brief ZFS testing. Initially I was running a Debian VM presenting a single lun as iSCSI target to my test host. Running on an ageing laptop the performance was not very good naturally. I then had a vision. What if I could stripe the traffic over multiple devices?

I have two fairly new USB drives lying around here, that would do the trick. I created two additional virtual disks, each on one of the USB drives, attached them to the Debian VM, setup ZFS and created a RAID0 pool striping across all three drives. A first dd gave me some promising results, I was writing zeroes at roughly 130MB/s. I'm not too familiar with ZFS and didn't want to waste any time reading a lot, so I just created a 400GB file using the above mentioned dd-method and exported it to my host as a iSCSI lun. Performance in the VMs however was not very good, IOMeter was seeing 80 to max 200 iops (80% sequential at 4kb). For a comparison I created a 1GB ramdisk and attached it to my test-vm as raw device. There I would see >3000 iops consistently. More so with the above described ZFS setup I would have serious issues when trying to power on a few VMs. The vCenter client would time out a lot as VCVA wouldn't be able to handle its basic functions. At one point I saw a Load of 40 inside the VCVA, while I was installing a VM from an ISO, nothing overly complicated.

Even with a hugely underpowered lab like mine I figured there is still quite some tuning I could do. However time being an issue I opted for a preconfigured NAS appliance. A quick look at Google made it pretty clear, I would use Nexenta Community Edition. I figured ZFS coming from Solaris, I would better use a Solaris based appliance rather than FreeNAS (for which my heart beats though as it's based on FreeBSD). So far what I'm seeing looks promising. I configured the ZFS pool and iSCSI target slightly different from my earlier deployment, using three virtual disks spread across my two USB drives and one internal HD at 100GB, created a pool and thus far only exported two 100GB luns out of the iSCSI target.

On the VMware side I created a SDRS cluster. No other reason than "Because I can".

Given the very low specs of my lab I'm quite happy with the results. IOMeter specs are 80% random IO at 4kb, 2/3 reading, 1/3 writing. At 100% sequential read is pretty consistent around 800 iops it peaks at a little over 900 iops. Heavy cache usage can be seen at 100% sequential writes, where IOmeter just bounces around from 13 to close to 2000 iops.

Unfortunately this setup is still not quite capable of handling more than one VM, especially when it comes to swapping.

Lab specs:

T61 laptop with C2D T8300 @ 2.4GHZ CPU, 3GB Ram, Nexenta CE VM has been assigned with 1.5GB (nowhere near enough for proper ZFS testing), internal drive WD1600BEVS-08RST2, externals are WD Elements 1023 and 10A2.

It runs Win7Pro64bit, VMware Workstation 9, there are no VMware tools installed in the NAS appliance (I assume that might kick things up a little more).

T400 running ESXi straight from a thumb drive, C2D P8700 @ 2.53GHZ, 8GB RAM.

Both laptops are connected directly via cross ethernet cable, the link's bandwidth is 1GBit/s.

Monday, April 15, 2013

Circular dependencies from hell - a poor man's lab part 2

So, after the embarrassing discovery, that I was trying to allocate and already taken IP address to the VCVA I dropped all W2K8 installation attempts (which went quite well, don't get me wrong, despite the fact that I was also having issues with the IP address, I really need to start writing things down) and deployed another VCVA. Everything went very smoothly, the VCVA is up and running and setting hostname, static IP and toggling SSL certificate recreation and a reboot everything is working smoothly now.

As I type this I am installing another Debian linux as a test VM to play around with Iometer. I may try to optimize the iSCSI link to utilize jumbo frames, but for now I will just leave everything as it is. I am very happy with the general performance of the VMs so far and gotta say this setup looks very promising, it may actually be a very capable lab setup.

Updates will follow tomorrow.

Circular dependencies from hell - a poor man's lab

Important update at the end of the article! It voids part of the article, will go back to deploying VCVA and then continue.

Recently I installed ESXi on my Thinkpads (1 and 2) using a USB thumb drive. The installation was very straight forward and great for a lab environment. However a host without a data store is no good and the internal drive is not to be messed with. The NAS in the living room is slow and only accessible via wifi (unless I want to carry the laptop around). Initially I just plugged in an old USB hdd to see if I could use it. Out of the box that doesn't work, most likely because the ESX does not support USB storage for data stores (USB can be passed through to VMs though, so it might be to prevent conflicts).

A virtual NAS might be a good way to make it all very portable and flexible I though. So I just grabbed the T61, installed Workstation 9 (and completed a Cloud Cred Task while I did so) and setup a Debian VM that would be my iSCSI target. The host has 3GB Ram, so I figured 2GB for the VM would be fair. That way its got a little bit of caching leverage. Pop in a cross over cable to connect both laptops and everyting should be fine... or so I thought.

I configured a local network among the two hosts, bridged the VM into it, fired up vCenter client to connect to the ESX and attached the iSCSI target... Here's me showing off my mad MSPaint drawing skills. Anyways you get the point.

Next I deployed VCVA onto the data store, which went surprisingly well...Boots up nicely still, but fails to configure the network via DHCP, as there is no DHCP server around. Confident I would be able to configure an IP manually I logged into the console and followed the instructions (/opt/vmware/shared/vami/vami_config_net) and configured a static IP address, only to find that I could not connect to the web config. I rebooted the VM countless times, checked resources inside the VM, everything looks good, but the web gui does not seem to come up. This is silly I thought, fired off a "aptitude install isc-dhcp-server" in the iSCSI VM and redeployed the VCVA. Yet again it did not get an IP address during boot. The DHCP logs say its been asigned, but the VM shows 0.0.0.0. I'm sensing resource contentions, but that's kind of the theme of the blog, isn't it? Logged into the VM again, reran dhclient, get an IP, login to web gui and configure the appliance straight forward (Eula, default SSO config). All is well in VCVA land, it takes a while but eventually all services start up fine. Next I configure a host name and static IP address, which becomes active instantly (at least ping works) but thats pretty much it. I get the certificate warning in the browser (and vCenter client) but cannot connect any further. As if the lighttp-proxy works, but the Tomcat handling the actual request, doesn't play along. Checking various VMware related services in /etc/init.d show all status running, however I notice once again that the hostname change is not reflected in the VM. I've seen that before, you change the hostname in the webconfig, reboot the appliance and are stuck with localhost.localdom again. That's a bit disappointing. I then configured hostname and corresponding /etc/hosts file manually on the console as I could still not access the webconfig and rebooted the VM. Unfortunately that doesn't seem to do the trick as now I was greeted with "Checking vami-sfcbd status........failed, restarting vami-sfcbd". That's where I'm getting really annoyed. As much as I like the idea of the appliance, a lot of times its more pain than gain. So as I type this I am installing a W2K8 server that will host my vCenter...

Summary

Another more or less road trip approach (as in "It's not the outcome that counts, but how you get there") that shows that with a little creativity you can actually set up a simple and very cheap lab at home, that will be more powerful (and especially more of a training experience) than installing ESXi in a VM. If you're (currently) limited by your hardware e.g. you do not own a home lab and neither of your accessible workstations/laptops etc. runs on at least a Intel Core i3, you're more or less stuck with this solution if you intend to run 64bit VMs inside your lab ESXi. And it also shows that operating the VCVA outside its recommended specs and/or in an unsupported environment is not always a good idea.

Update

Forget all the ranting about VCVA, it was my own fault. I had the T61 configured to the same IP as the VCVA ;).

Friday, April 12, 2013

Raw Device Mappings, SCSI Bus Sharing and VMotion

I keep bumping into this issue time and time again and find myself not using the exactly right terminology to explain it, it seems. Just today I was talking to Ben and again we disagreed on the topic, at least to some extend. We did not end up arguing as I have before during a job interview, but settled for a draw.

So once and for all (and mostly just for my brain to remember the terminology by writing it down): VMs using Raw Device Mappings (applies to physical and virtual) and SCSI Bus Sharing (Option "Physical" for the SCSI controller, reads: "Virtual disks can be shared between virtual machines on any server.") cannot be vmotioned! See also KB1003797.

The reason being (correct me if I'm wrong, storage is not my strongest side) is that the VM's virtual SCSI controller is mapped through to the physical SCSI controller or rather HBA of the host giving the VM exclusive and direct access to the SCSI device.

When to use this configuration?

In order to run certain configurations of a few clustering products, such as Oracle RAC, on VMware ESXi you may need a shared storage device. If you want to run a two node cluster of any sort by putting VM1 on Host A and VM2 on Host B to maximize your failover capacity, you have quite a few options to set up your shared storage devices. Shared VMDK comes to mind, just add a VMDK to VM1 and reuse the same for VM2. However this setup does not support concurrent write access (for O10R2 RACs on RHEL4 and 5 this means node crashes). Software iSCSI inside the VM can also be utilized and will give you full VMotion capability, as it only relies on a network connection, but you may not get the performance you want/need. Lastly adding a raw device mapping on a separate virtual SCSI controller to maximize performance is an option. The SCSI controller has to be configured as "physical". When the above two configurations still allowed you to migrate the VMs, having this setup will greet you with an error message saying that the VM is configured with a device that prevents migration.

Light at the end of the tunnel!

There is, however, a fully supported way of doing things now. With the introduction of Fault Tolerance it became a necessity to be able to simultaneously write to a VMDK file.

Enter the multi-writer flag (KB1034165).

Disabling concurrent write access protection of a VMDK will solve the problem of cluster nodes blocking VMotion and thus DRS, and creating a nightmare scenario for host maintenance, where you have to go through the full lengths of your change management process including the shut down of the VMs on the host in question. I have heard numerous positive reports about this mechanism but have yet to give it a whirl myself. In any case VMFS is a capable cluster file system and given the underlying storage system did not fall off a dump truck you should be good to go with this scenario.

Happy clustering!

Sunday, April 7, 2013

ESXi on T61

In the spirit of over provisioning or rather ignoring limited hardware specs I realized I could just give it a whip and boot up the old T61 with the USB thumb drive I created recently. And guess what, it works just the same.

Tuesday, April 2, 2013

ESXi on Lenovo T400

Just as expected, not a challenge at all. Hardly worth writing about, but I will anyways. I just installed ESXi to a thumb drive on a Lenovo T400. The installation was very straight forward, just like installing to any supported piece of hardware.

As for the playing around with it part, I have to disappoint for now. As I have only very limited access to a second laptop at the moment I could only test connectivity via vSphere Client once, nothing more so far. Next week this will change. I will have a second laptop. That will also give me a few days figuring out what to use as storage for the box. Local disk is pretty much out of the question. Another USB drive might come in handy, I still have a few lying around here somewhere. They will not exactly make this a speedy showcase, but hey, better than nothing.

No fiddling with additional drivers, just plug into local network, restart management network and connect to the IP acquired by DHCP, or reconfigure your management network using static IP, whatever floats the boat.

Tuesday, March 26, 2013

OpenQRM for ESXi management? A road trip approach!

Recently I was pointed towards OpenQRM for hypervisor management (I realize its good for a lot more than just that). Wondering what it could do I went ahead and set up a PoC... with a surprising outcome. Please keep in mind I have not gone to reading the OpenQRM documentation (yet), which I'm sure is extensive and thorough.

In correspondence with the blog title I shall elaborate a little bit on the test lab "infrastructure". Well, infrastructure is a bit over the top, its all being run from an aging laptop. The laptop in question is a Lenovo Thinkpad T400, anno ca. 2009, powered by an Intel Core 2 Duo P8700 CPU (not the best choice when it comes to virtualizing hypervisors, I tell ya!), 8GB RAM and a WD3200BEVS-0 (320GB Western Digital notebook drive, 5400 rpm, 8MB cache). You can see where this is heading. Running on Ubuntu 12.04 LTS its been serving me quite well as my day to day laptop for internet, mail, remote works and small virtualization projects on top of VMware Workstation, VirtualBox and so forth. However recently I went a little further.

The PoC

At first I went ahead and installed ESXi 5.1 in VMware Workstation 9, just to see what would happen. In a nut shell, it works and will allow you to run nested 32bit VMs. A few years ago I had set up a small PoC of VCenter 4.1 connecting to a 2 node Oracle 10G real application cluster including two ESXi4.1 server VMs, all running inside VirtualBox, spread across two hosts with 4GB RAM each. Why would I do such a thing? Well, one reason was the so called BIC-factor. BIC == Because I Can. Labs are fun. The other was that in a client project we were seeing some DB constraints and I wanted to see how VCenter would behave along side a load balanced DB cluster. I used the same cluster setup later to connect the - back then - brand new HP Quality Center 11, only to find a minor flaw in their documentation, but that's a different story. It worked just nicely, but was hellishly slow. Both hosts were swapping like mad, the small Debian VM providing one iSCSI LUN to the ESXes for persistent storage and one to the RAC for their DBs was screaming at me to terminate it, it was in major pain. But the main outcome was in fact that it did work!

So in order to have a bunch of hosts for OpenQRM to connect to I set up a complete lab, starting with a Win2K8R2 64 bit VM to act as Active Directory controller, DNS server and as VCenter Client and PowerCLI Terminal. 1GB of RAM is more than enough, I've come to realize that Windows boxes are not so bad on their memory consumption either. Secondly I deployed the VCenter Server Appliance only to run into a few culprits. Using namely Tom Fojta's Blog and Duncan Epping's Yellow-Bricks for pointers I went ahead, deployed and downsized to 2GB right away.

If you do that, watch out for the following:

- setting hostname/fqdn using the admin web gui might fail

- recreating SSL certificates to reflect the hostname/ip address change might fail

- active directory setup might fail

I suggest running the appliance with lots of Ram, swapping won't be too bad at this stage and you will be able to configure it fully in due time. Once you made sure that everything works the way you want it, go ahead and downsize to no less than 2GB Ram. Thats about the break even point where the swapping inside the VM will get so bad that it's virtually unusable (load dropping from 10 to about 6 after 20(!!!) minutes of uptime).

Back on topic the next thing was to get auto deploy up and running. I had a DHCP/BOOTP server VM from a different PoC lying around, which was even preconfigured to serve ESXi5.1 images as well, but I decided to go with the TFTP server on the VCenter appliance and just used the VM for DHCP (I know the appliance comes with its own DHCP server). Fired up the Auto Deploy service, created a VM for ESXi, generated mac address and added to DHCP config so it would be assigned to the right group and network range and gave it a go...the VMs PXE client comes up and complains that there is no server profile. So next on I went to the AD server and setup following the Auto Deploy Proof of Concept setup guide, added my ESXi image profile, my auto deploy rules and then deployed my first host. Once it was up and running I went through the painfully slow process of creating and adjusting a host profile, by then my laptop was "swapping like a pig", so to speak. Suffice it to say the host profile editor in the VCenter Client will not work at this setup. It will fail with "vcenter server took too long to respond". The vSphere web client can still be used, although I would like to keep this out of the setup as it introduces a minimum of 800MB more memory consumption on the VCenter appliance and some unknown amount of resource consumption on the flash plugin side (yes, the client browser is on the same laptop). After some fiddling around with the host profiles the hosts came up just fine, keyboard layout German, reflecting my physical keyboard on the host, root password set to 'start#123' (this one is actually important and I will get to it in a few moments), ntp configured, up and running, cluster in VCenter configured and hosts being added to it by FQDN, the whole shebang.

Introducing OpenQRM

Luckily at this point the OpenQRM VM was already setup. I was seeing some unpleasant load situations on this poor laptop already...

ronald@thinkpad:~$ uptime

16:54:28 up 3 days, 18:39, 5 users, load average: 22.81, 19.91, 13.92

The VM in question is a Debian 64 bit minimal installation with OpenSSH-server, 512MB Ram and 1 CPU. It was actually provisioned first but with the VMs to come in mind I wanted to find a compromise of what OpenQRM may or may not require and what I may (or would definitely not have to) spare. The installer is fully automatic, you just download a less than 3 MB tgz, unpack and run it. It is targeted at Debian and CentOS distributions and will install required dependencies automatically, download boot images for server provisioning and pretty much do a bang up job at self configuring. Props to the OpenQRM guys for this fantastic installer! Get it fired up, then get yourself a coffee, lean back and watch the show (or do something more sustainable in the meantime, a half hour run on the treadmill for instance).

Once the install was finished I went ahead, logged into the GUI, enabled a few plugins at random including VMware and its dependencies and started playing around a little. The VMware plugin in OpenQRM will scan the network for available ESX servers and will then ask to provide login credentials so the hosts can be added to its environment and be managed by it. While the discovery was running my browser was asking me repeatedly whether I wanted to terminate the script as it seemed to be running for too long a time.

First I'd like to note that it did indeed discover both ESX hosts and the VCenter. Well done.

Next thing was to supply login credentials and thats where things went a little sideways. Entering my trusty and all time favorite password 'start#123' I was greeted with an error message saying the only the following characters where supported:

[A-Za-z1-0]

That was a major disappointment and an instant mood kill!!! Having gone through the aforementioned lengths to setup the environment only to find that was reason enough to finish up for the day and have dinner. Now I didn't give up hope yet. I went away and reenabled the vSphere web client, went to the host profile editor and changed the ESX hosts passwords to a more suitable 'Start123', which given the limited resources took me about 45 minutes. By then I was being pushed to finally join the rest of the family at the dinner table... and thus the story ends here.

Hope you enjoyed reading my first tech blog post as much as I enjoyed writing it. I shall add a few more in the days to come, talk a little about the things I'm doing. Maybe I help someone along the line. Or maybe its just a brain dump for myself. In any case, I hereby welcome myself to the tech bloggosphere! :)

Cheers, Ronald!