Hopelessly Over-provisioned: May 2013

Tuesday, May 28, 2013

pvscsi vs. LSI Logic SAS

I've talked about me being a poor man with not much of a lab before, to great lengths. It shall suffice to say that this has not changed since I started this blog. However I do have access to quite a bit of infrastructure to test and play around with. And so I did.

In a recent innovations meeting one of my colleagues suggested the use of pvscsi over the default LSI Logic drivers. The idea being the same as with vmxnet3 over e1000g adapters to save CPU resources. However that does not automatically yield a performance improvement. Other people have talked about their findings in the past and VMware themselves have said something about it too. To my surprise my findings were a bit different than proposed by VMware.

The CPU utilization difference between LSI and PVSCSI at hundreds of IOPS is insignificant. But at larger numbers of IOPS, PVSCSI can save a lot of CPU cycles.

My setup was very simple, a 64bit W2K8R2 VM with 4GB Ram, 2 vCPUs on an empty ESX cluster and empty storage. I was running my tests during off hours so impact by other VMs on the possibly shared storage (I do not know for sure, unfortunately, how the storage is setup in detail. Thus I don't know if the arrays are shared or dedicated. The controllers will be shared however.) is unlikely, the assigned FC storage LUNs for test purposes only. Apart from the OS drives the VM had two extra VMDKs, each using its own dedicated virtual SCSI controller, pvscsi and LSI Logic SAS.

I might have done something seriously wrong but here's what I found:

Using iometer's Default Access Specification (100% random access at 2kb block sizes, 67% read) I did indeed find very significant differences, but not what I had expected:

pvscsi: Avg Latency: 6.28ms, Avg IOPS: 158, Avg CPU Load 53%
LSI: Avg Latency: 3.16ms, Avg IOPS: 316, AVG CPU Load 34%

Multiple runs confirmed these findings.

Later changing the access specs to a more real world scenario VMware's proposition became more and more true, the values approached each other. At 60% random IO both adapters managed roughly 300 IOPS at 10% CPU load.

Conclusion

I cannot conclude much as I know too little about the storage configuration. However I wanted to see what happened if I scaled up a little. Using the very same storage I deployed a NexentaStor CE, gave it 16GB Ram for caching and 2 VMDKs on the same datastores as the initial VM (each Eager Zeroed Thick) and configured a Raid0-ZPool. I configured 4 zVol LUNs inside the storage appliance and handed them out via iSCSI, migrated the W2K8 VM into the provided storage (and nested ESXi for that matter, just to make it a little more irrelevant) and ran the same tests again. Now utilizing multiple layers of caching I got quite different values:

pvscsi: Avg Latency 1.59ms, Avg IOPS 626, Avg CPU Load 11%
LSI: Avg Latency 1.72ms, Avg IOPS 582, Avg CPU Load 21%

The performance impact is indeed insignificant, none of this is interesting for enterprise workloads. The CPU utilization difference is significant however, as it nearly doubles! As I said before all of this is irrelevant and pretty much a waste of time, it just shows that the platform doesn't have the bang properly make use of a paravirualized scsi controller to begin with. To me that is a little disappointing and an eye opener.

Follow up

Overriding capacity management I migrated the VM onto a production cluster to see whether the storage systems there are a bit more capable. However again the results are not what I expected:

pvscsi: Avg Latency 0.58ms, Avg IOPS 1708, Avg CPU Load 17%
LSI: Avg Latency 0.47ms, Avg IOPS 2126, Avg CPU Load 21%

Again I conclude that none of this is relevant, unfortunately, and I'm going to have to go into questioning the engineering team who set up this storage platform to find some answers as to how they decided what to set up.

Invalid configuration for device '0'

This dreaded message came upon me just now when I tried to reconnect a VM. I had previously shut this VM down, exported it as a OVF and imported it into a test environment to run some iometer tests against a more powerful storage to compare pvscsi performance to LSI Logic SAS. After the test environment's trial license expired I threw the entire thing away and wanted to reconnect me original VM, only to find the above mentioned dreaded message.

Following VMware's KB 2014469 on this issue I first verified that the VM was indeed connected to a free port. I then migrated it using VMotion to a different host, still no good. The third option did however do the trick and thus helped me learn something new about ESXi. It can in fact reload a VMs configuration at runtime and thus resync it with vCenter. And thanks to awk being available Option 3 can easily be shorted to a one-liner:

vim-cmd vmsvc/reload $(vim-cmd vmsvc/getallvms | grep -i VMNAME | awk '{print $1}')

Friday, May 17, 2013

Migrate vSphere 4.1 to new host and fresher infra - follow up

Recently I moved an oldish 4.1 environment to a new base of operations. The process was fairly straight forward, but a few minor things I think are worth mentioning as a follow up.

Update Manager

So I did not disable the old Update Manager installation before moving the entire thing to a new host. I had already decided to go without UM for the new setup and had not spared a thought on the consequences. Furthermore when I had originally set up the old environment, I had failed to follow the best practices and had used IP addresses instead of FQDNs throughout the setup.

The result was a connection error because VCenter was continuously trying to connect to the old Update Manager installation. Searching around I could not find a good way to remove the UM binding so I decided to walk "The Windows Walk", installed UM to overwrite the previous registration in the process and uninstalled it properly afterwards.

Connection error resolved.

In retrospect what I could have done is to enable the old UM service again, let vCenter connect to it and then uninstall it properly.

Performance Statistic Rollup

The other thing that sticks out when I check vCenter Service Status is this message:

Performance Statistics rollup from Past xxx to Past xxx is not occurring in the database

In my specific setup the notification claims roll ups are not available for the following durations:

- Previous day to previous week
- Previous week to previous month
- Previous month to previous year

VMware KB2015763 describes this issue for 5.x installations and furthermore points out to enable statistic rollups in Administration > vCenter Server Settings > Statistics.

Regardless of whether I use vSphere Operations Manager to accumulate the data now or not I am not happy about these warnings, even though they may not affect operations as such. As you may guess historical performance data is not available in vSphere for now.

Both the 5 minutes and 30 minutes interval duration roll ups were already enabled, I added 2 hours and 1 day intervals as well, to no effect.

Digging around some more I find a few possible reasons and explanations for the behaviour:

1. Using SQL Express Edition the VPXD's internal scheduler handles statistic roll ups. My installation uses an Express Edition that was not bundled with the vSphere installer. Also, if you are using a full blown SQL server and the SQL Agent is not running, the installer supposedly reminds you to start the service, which in this case it did not.
2. Another issue might be KB1030819. I followed the instructions inscribed, as the datatype was reported to be "numeric" instead of "bigint".

After some rather tedious mucking around and trying to work out a way on how to automate statistic rollups, running them by hand a few times in the past few days I have decided to migrate the environment to a full blown SQL 2008 server. I found out that our dev team has a fully licensed SQL server running that I may utilize. Our vSphere environemtn is small enough so that no mutual performance impacts are to be expected. Gladly this migration will be very easy and straight forward again and will pave the way to move vCenter itself back to its original home base.

Cannot delete Portgroup - works as designed

This morning when I start work I notice all those emails about a new ticket in our trouble ticket system. The owner added me among other colleagues as a monitor. The issue is the following:

There are numerous virtual port groups that do not have any VMs connected (after a lengthy and mostly automated migration to a new naming scheme) that cannot be deleted. They show up greyed out in the vSphere client and if you drill down each of those port groups will show its associated VMs and templates. If you check the config of any affected VM they will show the new port groups only, the summary pane will furthermore list the greyed out old port groups. The proposed problem solution is a restart of the vCenter service, as there is an OS update pending anyway.

Cause

As I have seen this behaviour before on several, albeit very rare occasions, I wanted to take my chances and investigate. I went ahead to the first affected port group and found a template and a running VM associated with it. The template was an easy and logical case. My colleagues had migrated all VMs but had failed to migrate the templates yet. Convert it to a VM, change the port group association and away it went. Convert back to template and everything is fine.

The running VM however is a different case. Its settings showed no binding to the old port group. Because its a customer's system I cannot just go and change things around wildly. I had a closer look and noticed an active snapshot. We have a policy that snapshots may not be kept longer than one week. However that policy is not being enforced by any automatism or audit trail. The Tasks & Events pane in my vSphere client was not able to tell me when the snapshot had been created...it was already beginning to mold and smell unpleasently. Same goes for the other affected VMs.

And it makes perfect sense. If I want to go back to my original point in time - when I took the snapshot - I expect the VM to be in the same network (thus virtual port group). Thinking I might be able to get a list of VMs and port groups via a simple PowerCLI script I went to work and came up with this slightly ugly code:

$Portgroups = Get-Datacenter $myDC | Get-VirtualPortGroup | Where {$_.Name -like $myPgFilter}
$PGNames = @()
foreach($PG in $Portgroups) {
    $PGNames += $PG.Name
}
$VM_with_Snapshots = @()

$VMs = Get-VM
$Snapshots = $VM | Get-Snapshot

foreach($SS in $Snapshots) {
    $NAs = $SS.VM.NetworkAdapters
    foreach($NetAdd in $NAs) {
        if($PGNames -contains $NetAdd.NetworkName) {
            $VM_with_Snapshots += $SS.VM
        }
    }
}

I'm sure there is massive potential for optimization, I'm not a coder nor do I have much practise at the moment (things are about to change in the near future though). However this approach did not yield the expected result as this script will not return the previous (and still caught in a snapshot) port group assignment, but the current one. Pointers towards the proper results would be greatly appreciated.

Among the affected VMs there are, however, some test systems and infrastructure systems as well that I have sufficient control, knowledge and privileges to test my snapshot theory. And of course, I was right.

I now want to refer to this very recent (re-)tweet and extend it to VCPs as well! :)

RT @josh_odgers: @grantorchard Business as usual, the VCDX comes in and cleans up the mess and ensures customer satisfaction :) > +1
— Michael Webster (@vcdxnz001) May 15, 2013

Monday, May 13, 2013

HP vPV - my own take

HP's Virtualization Performance Viewer, in short vPV, has been stirring up some dust in the past few weeks. I first learned about it from my good friend Amitabh, a passionate HP Infrastructure and Virtualization engineer with, what it seems to be a very broad view on current and up and coming technologies. When I first read his post I had just started playing around (and quite enjoying myself doing so too) with VMware's Operations Manager after hearing about it on several occasions such as the 1st Singapore VMUG meeting of 2013. I dug around some and setup a demo installation to let vPV and Operations Manager go head to head, talked to one of the managers at my employer and just got to know what it is and what it does.

I held a small presentation to some of my colleagues including the vPV aware manager at my company's headquarters recently. We had a look at both products side by side and talked about first impressions and ideas.

Just minutes ago I read Ben's take on vPV which prompted me to say a few words about the product and my experiences with it as well.

vPV - The ugly truth

To understand, what vPV is, I think its quite helpful to know how it came to be. The afore mentioned manager had a chance to talk to one of the HP guys at the last GPC in Las Vegas and revealed an interesting detail: HP's vPV was initially developed as a helper for their internal operations, more or less "by accident". They realized the potential and showed it off, got a few interested visitors and decided to release it as a product.

vPV - What it does!

As both Amitabh and Ben pointed out the installation is dead simple. Download and deploy the virtual appliance, point it at your vCenter and you're ready to roll. You get an instant and, depending on your environment, rather colorful picture.

vPV supports Hyper-V (though not tested by any of the three of us, as far as I can tell) and vSphere. It offers a general overview and drill down capabilities and makes good use of HP's uCMDB to visualize the structures of your environment and qualities and properties of your managed objects. It allows you to access the very same real time metrics that can be found in the vSphere client, with the added bonus of 24 hours retention (for the free edition, 30 days for the licensed product) as opposed to just 1h. You get every last value (including the infamous CPU Co-Stop) in a cute little graph, can arrange them to correlate issues in the workbench or just rely on the dashboard to get the grand total.

To me its easier, and as Ben pointed out, a lot snappier than using vSphere client to access performance graphs of your vSphere environment. A major benefit is the ability to easily place a half dozen of graphs on one screen.

vPV - What it does not!

Same as the vSphere client itself, strictly speaking, its not a real time performance analysis tool (or rather visualizer). vPV retrieves its data from vCenter. vCenter retrieves its data from the connected ESXi hosts, which in turn retrieve data from the VMware tools in each VM and measure the individual VM's performance metrics from the hypervisor side. This chain alone adds multiple delays in itself, the overhead of drawing a fancy graph (and possibly using a lot of RMS to connected the dots without making the graphs bounce up and down like crazy) is not even included in that one yet. In order to visualize the data and make it humanly processable true real time is out of the question, as with most other tools and helpers in this field. You rather get the past 24 hours of everything that was going on, in nicely smoothed lines.

Unlike vCOPS, vPV does not process the data to generate, what I like to call a "management compatible dashboard". vCOPS, even the free edition, stands out for its health level visualization. It will correlate CPU ready times along the number of assigned vCPUs to a host's workload and generate dynamic thresholds where vPV will only show the individual values. Thus vPV will also not free you from analysing dozens of metrics on multiple objects to find the true source of an ongoing performance issue.

On the side there is one major issue with the free edition. It does not support any means of access control. By default it is open to anyone and everyone within your network. As it allows you to view the entire vSphere environment (up to 200 managed objects in the free edition, of course) it surpasses your vSphere permissions and provides an interesting source of information for the sneaky attacker in your own network. So be sure to at least work out some sort of .htaccess protection.

Unlike vCOPS, it does support Hyper-V in the free edition. However to my understanding the paid version does not integrate AWS (and potentially other Cloud vendors and OS'ses) at the moment.

At the end of the day ...

vPV is a great if you're serious about vSphere performance analysis and trouble shooting and a good helper when vSphere client itself does not provide the data you're looking for. I agree with Ben's resume that its a great addition to your preferred set of tools and being free for 200 managed objects should be on any aspiring VMware admin's list of favourites.

HP has done a great job in putting their modules together to aggregate already available data and visualize it in an attractive manner. However I have my doubts that this tool alone helps "to rapidly analyze bottlenecks", as they put it. In my very humble opinion you have to already know what you're looking at (and for) to make good use of vPV.

In that sense it is no real competition to vCOPS, it simply plays in a different field. Integrate it with BSM and I'm sure a trained and able HP consultant will be able to generate reports and dashboards that will blow your mind away. If you're looking for an all-in-one multi purpose solution, vCOPS might be the better choice as vPV itself may evolve somewhat, but as a stand alone tool it can and will only be one out of many to assist you in your virtual performance trouble shooting journey. It sure should come in handy when you train yourself to be a VCAP.