Hopelessly Over-provisioned: April 2013

Friday, April 19, 2013

Migrate vSphere 4.1 to new host and fresher infra

For the past few weeks one of my vSphere installations has been seeing some major problems, that I have not been able to drill down into properly. As I set it up a few years ago and then handed it over to one of my coworkers for management it became over utilized, poorly managed and started acting weird. Recently it has been crashing every three days or so due to the transaction log file filling up. The infrastructure (Essentials License) runs on SQL Express 2005 (the one bundled with vSphere 4.1U2), the DB runs at its default settings (simple recovery model, one transaction log file, auto grow, max size 2GB) and there is a poorly configured backup (my bad, I was a bit of a greenhorn back then). Now what happens is a transaction is running on the DB that fills up the log file and then crashes the VPXD. Not ideal.

Because there are a few other things that bother me about this environment, I figured its about time I moved it to a separated VM and reinstalled it properly, leveraging all the experience of the past 3 years or so and taking into account the best practices of my vSphere Install, Configure and Manage course and certification last year.

The new setup

Firstly I rolled out a Windows 2008 R2 server, 64bit, configured hostname, dns domain, added it to our Samba domain (remember to configure this), installed all available updates and gave it a few reboots for good measure. Once the box was up and running smoothly I installed a fresh SQL Express 2008 with Management Tools from here (I actually installed a few other versions before that only to constantly find compatibility issues with W2K8R2 and so forth, but this one is fully supported). Once that was out of the way I imported an older backup of the production database, all went smoothly.

Next step was to create a data source. Following the documentation (which I did not need for this project, I might say so proudly) I created a System DSN that points to the newly created and populated database and tested it. The appropriate SQL Client libs has already been installed along with the SQL Server Express Edition. Keep in mind for a client server setup you need to install a supported SQL Client, the OS's preinstalled client has a beard longer than Santa Clause... well, you get the point. The Express Edition of SQL 2008 R2 seems to have TCP/IP enabled by default, a thing I seem to remember is different on the full blown servers. Also 2008 does install an SQL agent but don't bother trying to activate it, it will not work, very confusing.

With SQL Express and the data source working smoothly it was time to install the VCenter. At this point I was doing merely a test installation, so I would have to keep in mind to be cautious as to not have it connect to my ESX-hosts right away. The installation is pretty straight forward. Just choose your precreated dsn, let the installer use the existing database instead of wiping it and choose to manually connect your hosts and update the agents. I found, however, that the installation should be done as local admin, not domain admin. This might have something to do with our Samba domain, might be related to something else or might even be in the documentation. It did cost me a few hairs but eventually it wasn't anything that could not be conquered.

Once the installation was finished, I connected to the newly created vCenter and found every thing in place, hosts, vms, folder structures, resource pools, permissions and so on and so forth. Tests finished, ready to roll.

The day of the migration

During off hours this morning (a 6 hour time shift can be very helpful sometimes) I went away, stopped the production and newly created VPXDs, web services and so on, created a new backup of the database and moved it to the newly created environment. Confidently after restoring the database, I wanted to start the VPXD, only to find it crashing right away. What went wrong?

With the newly created installation I had also used the latest (and greatest?) release of 4.1, Update 3. The database of U3 is not compatible to U2, naturally. There might be a quicker way to solve this, I opted for a quick reinstallation of vSphere as this would update the database accordingly. After all, I had no reason to expect failure and, in the unlikely event something did actually go wrong, I could still fail back to the old and tested (and buggy) environment. Everything went as expected though.

Once the setup was finished I connected to the vCenter, reconnected my three hosts and found everything to be working just like it should. Playing around with the vSphere client I noticed only a few things to be off.

vSphere Service Status display was not working
Two of the plugins available for installation are not working as expected.

I had an idea what these issues might be related to. The old setup was not based on FQDNs, but everything was hardwired to IP addresses. The Service Status module would thus try to connect to my old environment, which has been shut
down. This post however quickly resolved issue number one. Just replace the mentioned variables' values with the correct ones, restart vCenter for good measure and reconnect with vSphere client. Now the status is working.

As for the plugins, one is the converter plugin. I tried installing through vSphere client and it pointed to my old IP address. Just install converter on your vSphere host and it will prompt connection data for the new environment thus registering and the installation properly. The second plugin is vcIntegrity, which as far as I can tell, is Update Manager related. I could just go and install the Update Manager on my new box, but I don't want that. For such a small environment, I opt to manage updates manually in the future, so I will have to have a look into how to fix that minor issue (is it really minor?). vcIntegrity also shows up as an error in vSphere Service Status.

Cleanup

I have a Operations Manager Appliance running, which wouldn't reconnect to the new environment. Running very short on time and really treating it as a fire and forget I didn't know how to access the admin UI. I guess I would have been able to rerun the setup wizard, I just redeployed the appliance. The HP Virtualization Performance Viewer however had already been using an FQDN and moving it to the new environment works seamlessly.

I'm happy to see that the migration was rather simple with very little caveats. Log file usage so far is good, will have to wait and see how it develops in the days to come. The DB itself is rather resilient, I found that it has self repairing capabilities recently when I was trying to work around the issues I had. All in all I'm quite happy and think I can go on vacation tonight and have sound nights of sleep without having to worry about a crashing vSphere environment any more.

Wednesday, April 17, 2013

ZFS

Very short article on brief ZFS testing. Initially I was running a Debian VM presenting a single lun as iSCSI target to my test host. Running on an ageing laptop the performance was not very good naturally. I then had a vision. What if I could stripe the traffic over multiple devices?

I have two fairly new USB drives lying around here, that would do the trick. I created two additional virtual disks, each on one of the USB drives, attached them to the Debian VM, setup ZFS and created a RAID0 pool striping across all three drives. A first dd gave me some promising results, I was writing zeroes at roughly 130MB/s. I'm not too familiar with ZFS and didn't want to waste any time reading a lot, so I just created a 400GB file using the above mentioned dd-method and exported it to my host as a iSCSI lun. Performance in the VMs however was not very good, IOMeter was seeing 80 to max 200 iops (80% sequential at 4kb). For a comparison I created a 1GB ramdisk and attached it to my test-vm as raw device. There I would see >3000 iops consistently. More so with the above described ZFS setup I would have serious issues when trying to power on a few VMs. The vCenter client would time out a lot as VCVA wouldn't be able to handle its basic functions. At one point I saw a Load of 40 inside the VCVA, while I was installing a VM from an ISO, nothing overly complicated.

Even with a hugely underpowered lab like mine I figured there is still quite some tuning I could do. However time being an issue I opted for a preconfigured NAS appliance. A quick look at Google made it pretty clear, I would use Nexenta Community Edition. I figured ZFS coming from Solaris, I would better use a Solaris based appliance rather than FreeNAS (for which my heart beats though as it's based on FreeBSD). So far what I'm seeing looks promising. I configured the ZFS pool and iSCSI target slightly different from my earlier deployment, using three virtual disks spread across my two USB drives and one internal HD at 100GB, created a pool and thus far only exported two 100GB luns out of the iSCSI target.

On the VMware side I created a SDRS cluster. No other reason than "Because I can".

Given the very low specs of my lab I'm quite happy with the results. IOMeter specs are 80% random IO at 4kb, 2/3 reading, 1/3 writing. At 100% sequential read is pretty consistent around 800 iops it peaks at a little over 900 iops. Heavy cache usage can be seen at 100% sequential writes, where IOmeter just bounces around from 13 to close to 2000 iops.

Unfortunately this setup is still not quite capable of handling more than one VM, especially when it comes to swapping.

Lab specs:

T61 laptop with C2D T8300 @ 2.4GHZ CPU, 3GB Ram, Nexenta CE VM has been assigned with 1.5GB (nowhere near enough for proper ZFS testing), internal drive WD1600BEVS-08RST2, externals are WD Elements 1023 and 10A2.

It runs Win7Pro64bit, VMware Workstation 9, there are no VMware tools installed in the NAS appliance (I assume that might kick things up a little more).

T400 running ESXi straight from a thumb drive, C2D P8700 @ 2.53GHZ, 8GB RAM.

Both laptops are connected directly via cross ethernet cable, the link's bandwidth is 1GBit/s.

Monday, April 15, 2013

Circular dependencies from hell - a poor man's lab part 2

So, after the embarrassing discovery, that I was trying to allocate and already taken IP address to the VCVA I dropped all W2K8 installation attempts (which went quite well, don't get me wrong, despite the fact that I was also having issues with the IP address, I really need to start writing things down) and deployed another VCVA. Everything went very smoothly, the VCVA is up and running and setting hostname, static IP and toggling SSL certificate recreation and a reboot everything is working smoothly now.

As I type this I am installing another Debian linux as a test VM to play around with Iometer. I may try to optimize the iSCSI link to utilize jumbo frames, but for now I will just leave everything as it is. I am very happy with the general performance of the VMs so far and gotta say this setup looks very promising, it may actually be a very capable lab setup.

Updates will follow tomorrow.

Circular dependencies from hell - a poor man's lab

Important update at the end of the article! It voids part of the article, will go back to deploying VCVA and then continue.

Recently I installed ESXi on my Thinkpads (1 and 2) using a USB thumb drive. The installation was very straight forward and great for a lab environment. However a host without a data store is no good and the internal drive is not to be messed with. The NAS in the living room is slow and only accessible via wifi (unless I want to carry the laptop around). Initially I just plugged in an old USB hdd to see if I could use it. Out of the box that doesn't work, most likely because the ESX does not support USB storage for data stores (USB can be passed through to VMs though, so it might be to prevent conflicts).

A virtual NAS might be a good way to make it all very portable and flexible I though. So I just grabbed the T61, installed Workstation 9 (and completed a Cloud Cred Task while I did so) and setup a Debian VM that would be my iSCSI target. The host has 3GB Ram, so I figured 2GB for the VM would be fair. That way its got a little bit of caching leverage. Pop in a cross over cable to connect both laptops and everyting should be fine... or so I thought.

I configured a local network among the two hosts, bridged the VM into it, fired up vCenter client to connect to the ESX and attached the iSCSI target... Here's me showing off my mad MSPaint drawing skills. Anyways you get the point.

Next I deployed VCVA onto the data store, which went surprisingly well...Boots up nicely still, but fails to configure the network via DHCP, as there is no DHCP server around. Confident I would be able to configure an IP manually I logged into the console and followed the instructions (/opt/vmware/shared/vami/vami_config_net) and configured a static IP address, only to find that I could not connect to the web config. I rebooted the VM countless times, checked resources inside the VM, everything looks good, but the web gui does not seem to come up. This is silly I thought, fired off a "aptitude install isc-dhcp-server" in the iSCSI VM and redeployed the VCVA. Yet again it did not get an IP address during boot. The DHCP logs say its been asigned, but the VM shows 0.0.0.0. I'm sensing resource contentions, but that's kind of the theme of the blog, isn't it? Logged into the VM again, reran dhclient, get an IP, login to web gui and configure the appliance straight forward (Eula, default SSO config). All is well in VCVA land, it takes a while but eventually all services start up fine. Next I configure a host name and static IP address, which becomes active instantly (at least ping works) but thats pretty much it. I get the certificate warning in the browser (and vCenter client) but cannot connect any further. As if the lighttp-proxy works, but the Tomcat handling the actual request, doesn't play along. Checking various VMware related services in /etc/init.d show all status running, however I notice once again that the hostname change is not reflected in the VM. I've seen that before, you change the hostname in the webconfig, reboot the appliance and are stuck with localhost.localdom again. That's a bit disappointing. I then configured hostname and corresponding /etc/hosts file manually on the console as I could still not access the webconfig and rebooted the VM. Unfortunately that doesn't seem to do the trick as now I was greeted with "Checking vami-sfcbd status........failed, restarting vami-sfcbd". That's where I'm getting really annoyed. As much as I like the idea of the appliance, a lot of times its more pain than gain. So as I type this I am installing a W2K8 server that will host my vCenter...

Summary

Another more or less road trip approach (as in "It's not the outcome that counts, but how you get there") that shows that with a little creativity you can actually set up a simple and very cheap lab at home, that will be more powerful (and especially more of a training experience) than installing ESXi in a VM. If you're (currently) limited by your hardware e.g. you do not own a home lab and neither of your accessible workstations/laptops etc. runs on at least a Intel Core i3, you're more or less stuck with this solution if you intend to run 64bit VMs inside your lab ESXi. And it also shows that operating the VCVA outside its recommended specs and/or in an unsupported environment is not always a good idea.

Update

Forget all the ranting about VCVA, it was my own fault. I had the T61 configured to the same IP as the VCVA ;).

Friday, April 12, 2013

Raw Device Mappings, SCSI Bus Sharing and VMotion

I keep bumping into this issue time and time again and find myself not using the exactly right terminology to explain it, it seems. Just today I was talking to Ben and again we disagreed on the topic, at least to some extend. We did not end up arguing as I have before during a job interview, but settled for a draw.

So once and for all (and mostly just for my brain to remember the terminology by writing it down): VMs using Raw Device Mappings (applies to physical and virtual) and SCSI Bus Sharing (Option "Physical" for the SCSI controller, reads: "Virtual disks can be shared between virtual machines on any server.") cannot be vmotioned! See also KB1003797.

The reason being (correct me if I'm wrong, storage is not my strongest side) is that the VM's virtual SCSI controller is mapped through to the physical SCSI controller or rather HBA of the host giving the VM exclusive and direct access to the SCSI device.

When to use this configuration?

In order to run certain configurations of a few clustering products, such as Oracle RAC, on VMware ESXi you may need a shared storage device. If you want to run a two node cluster of any sort by putting VM1 on Host A and VM2 on Host B to maximize your failover capacity, you have quite a few options to set up your shared storage devices. Shared VMDK comes to mind, just add a VMDK to VM1 and reuse the same for VM2. However this setup does not support concurrent write access (for O10R2 RACs on RHEL4 and 5 this means node crashes). Software iSCSI inside the VM can also be utilized and will give you full VMotion capability, as it only relies on a network connection, but you may not get the performance you want/need. Lastly adding a raw device mapping on a separate virtual SCSI controller to maximize performance is an option. The SCSI controller has to be configured as "physical". When the above two configurations still allowed you to migrate the VMs, having this setup will greet you with an error message saying that the VM is configured with a device that prevents migration.

Light at the end of the tunnel!

There is, however, a fully supported way of doing things now. With the introduction of Fault Tolerance it became a necessity to be able to simultaneously write to a VMDK file.

Enter the multi-writer flag (KB1034165).

Disabling concurrent write access protection of a VMDK will solve the problem of cluster nodes blocking VMotion and thus DRS, and creating a nightmare scenario for host maintenance, where you have to go through the full lengths of your change management process including the shut down of the VMs on the host in question. I have heard numerous positive reports about this mechanism but have yet to give it a whirl myself. In any case VMFS is a capable cluster file system and given the underlying storage system did not fall off a dump truck you should be good to go with this scenario.

Happy clustering!

Sunday, April 7, 2013

ESXi on T61

In the spirit of over provisioning or rather ignoring limited hardware specs I realized I could just give it a whip and boot up the old T61 with the USB thumb drive I created recently. And guess what, it works just the same.

Tuesday, April 2, 2013

ESXi on Lenovo T400

Just as expected, not a challenge at all. Hardly worth writing about, but I will anyways. I just installed ESXi to a thumb drive on a Lenovo T400. The installation was very straight forward, just like installing to any supported piece of hardware.

As for the playing around with it part, I have to disappoint for now. As I have only very limited access to a second laptop at the moment I could only test connectivity via vSphere Client once, nothing more so far. Next week this will change. I will have a second laptop. That will also give me a few days figuring out what to use as storage for the box. Local disk is pretty much out of the question. Another USB drive might come in handy, I still have a few lying around here somewhere. They will not exactly make this a speedy showcase, but hey, better than nothing.

No fiddling with additional drivers, just plug into local network, restart management network and connect to the IP acquired by DHCP, or reconfigure your management network using static IP, whatever floats the boat.