Oracle (and when the SUN doesn’t shine anymore"

I’m not hiding the fact I’ve been a SUN MicroSystems fan all my life. They had great products, great engineering philosophy and best of all great people who knew how to pick a potato. The problem was that they went down the same path as DEC (huhhh, who…???) DEC, Digital Equipment Corporation. One of those other fabulous engineering companies who fell as pray to the PC world due to their lack of marketing knowledge and sales strategies. Google around for that.

Oracle was by far the worst company to acquire SUN. They have a massively different company mindset which is 100% focussed on getting another boat for Uncle Larry “I want you.. o, no.. I want your money” and this went on to be a head-on collision with the SUN philosophy. Given the fact Oracle had a massive war-chest and SUN was struggling to keep afloat allowed them to get the entire SUN IP for a nickle and dime.

The worst thing for Oracle was that all of a sudden they inherited a hardware division with, be honest, great products but also a huge drag on sales numbers. (which was likely to be the reason for SUN’s struggling.) No easy way out here since product support and near term roadmap line-ups had to be fulfilled. Oracle is, has always been, and always will be, a software company so over the last couple of years you can already see that the majority of all hardware products are starved to death. Don’t expect any new developments here.

SUN was bought for two reasons: Java and Solaris. Well, only certain parts of Solaris. COMSTAR was one of them an ZFS the other. Java of course was the biggest fish since that piece of the pie runs in almost every device on the planet from cell-phones to toasters. ZFS allowed Oracle to create Exadata  and tailor this to very specific workloads. (Duhhhh…. lemme guess> Oracle Databases). The funny thing is that they almost give away this Exadata box since they know it only performs well with their database and this is were you start paying the big bucks.

So lets get back to what is left of SUN. SUN was also a very big supporter of the open source world. Projects like OpenOffice, Netbeans, GlasFish etc are all neglected by Oracle for them to die a certain death. OpenOffice (originally acquired by SUN as StarOffice had a really nice spin since some developers had absolutely no trust in Oracle anymore and “forked” the entire code branch in LibreOffice which is now the most actively maintained Office Suite outside M$ Office. Oracle is sidely hanging on to MySQL and allow some people to put some effort in this project. The reason is obvious. Its a stepping stone to one of Oracle own big bucks databases and suites so the biggest sponsoring is done on migration software from MySQL to OracleDB itself. If Oracle decides to pull the plug on MySQL it will be simply forked as well and continue under another name were Oracle has absolutely no insight and loses any business advantage. Don’t ever expect any Oracle IP going into MySQL. Larry needs a bigger boat.

Another product SUN “donated” to the open source world was OpenSolaris. A free (as in free beer) spin of SUN’s mainstream operating system. SUN’s intentions for OpenSolaris was to provide a free platform for developers to have easy access to Solaris. This would allow for more applications to become available and as such a larger ecosystem to live on into companies using those. The steppingstone to a revenue generating operating system for those applications would then be real easy. A similar fashion Microsoft has followed for quite a while. (Provide a real cheap consumer product for developers to hook onto and sell at a premium to companies). Unfortunately it wasn’t mend to be so as soon Oracle took over the OpenSolaris project was starved to death.

So when taking into account all things that happened with the SUN acquisition it is very sad to see that such great products and philosophy is butchered by pure greed. Many distinguished engineers  like James Gosling, Time Bray and Bryan Cantrill left immediately and many more followed. The entire Drizzle team resigned, as well as all of the jRuby engineers. In fact the only SUN blooded executive to stay was John Fowler who kept onto his hardware group.

In retrospect the only thing Oracle bought for that 5.6 billion dollars is Java which is a very heavy price for a piece of software (soon to become obsolete) and an empty shell.

This once more shows that great products will always lose against marketing, an effective salesforce and a money hungry CEO.

Don’t get me wrong, I’m not against someone making a fair chuck of money but effectively killing of an entire company and leaving so many people in the cold doesn’t really show any form of ethics. A good friend of Larry, the late Steve Jobs, had similar characteristics however he also had a heart for great products.

Regards,
Erwin

PS. The comment of Java becoming obsolete is because many major new web technologies are now being put in place to bridge the gap to Java. This includes semantics, document and extensive data control, device control etc. Within 5 years Java will likely have a serious competitor which allows developers to gain more freedom and interoperability than Java now can provide.

tcsd service failed to start

With Fedora comes an option to have tcsd installed. Well, its not really an option. It installs by default apparently. This got me a bit baffled to see failed services on an almost brand new PC.

So what is exactly this tcsd service?

It turns out that the tcsd is a user space daemon to interact with the TPM kernel module in which is needed by hardware provided encryption services. For this you will need to have this TPM chip and since I don’t have this (nor likely to have the need for such in the near future) I’m fine to turn this off with “systemctl disable tcsd.service.

The tcsd service is a small section in the overall Trusted Computing Platform stack of solutions. The over goal is to have a piece of hardware covering encryption services to all levels of the computing stack. The idea is to have a separate bulletproof section in the system providing a trust chain not relying on memory and storage. This prevents rootkits and other type of malicious stuff infecting you system. By system I don specifically mean PC or sever since the stack is meant to be open for all sorts of equipment. If you need to secure your toaster you could potentially do so. You’ll also find the TPM architecture used by companies like Hitachi, Boing, Cisco and Microsoft. From a storage perspective TPM also plays a role in the SNIA Storage Security Industry Forum.

The overall specification is outlined by the Trusted Computing Group. A fairly large group of companies who define and contribute to the specification and develop products for this specific purpose.

Many opensource resources exist on the web but for a best start go to the above mentioned link. The Trousers libraries are the Linux opensourse interfaces mainly developed by IBM with help from many around the world.

See http://trousers.sourceforge.net

This page provides an short overview of what sits were in the TCG stack.

What I don’t know (yet) is where this all might play in the UEFI discussion Microsoft started off a while ago.  It either seems to complement each other or you’ll have conflicts. Don’t know yet. Might be worthwhile investigating.

Cheers,
Erwin

NVIDIA card and Nouveau

So with the new box I ordered a NVidia GeForce GT 640 Grafx card. I need some desktop realestate and thus a very high resolution card. This one came very good in the middle from a price and performance perspective.

Since a couple of kernel version ago Linux comes with the OpenSource nouveau drivers which are the alternative for the official NVidia drivers which are still closed source. I’m not that kind of guys who buys a very good piece of machinery to let it cripple by incomplete drivers. (No offence to the Nouveau developers. It’s not their fault NVidia doesn’t play nice with the open-source world.) So I do want to use the official drivers but that lets you run into some problem since the Nouveau drivers are loaded by default.

This calls for some blacklisting so you add in /etc/modprobe.d a new file called blacklist-nouveau.conf with a oneliner:

blacklist nouveau

This prevents the nouveau driver from being loaded at boot time. At least that’s what you think 🙁

Then install the official NVidia driver with “yum localinstall “.

It turns out that the nouveau driver is also statically compiled into the kernel boot image so you have to copy or rename that one and use dracut to create a new one which also takes your balcklisted nouveau driver into account:

#> dracut -f /boot/initramfs-$(uname -r).img $(uname -r)

Then reboot the system once more ad you’re done.

The lsmod shows you a line like this:
nvidia              11262717  41
and the nouveau driver is out of the picture.

Cheers
Erwin

Some disk settings I adjusted

Given the fact I now have an SSD drive running the /boot and root partition I do want to make the most of it. So in order to improve and keep this improvement over time I did the following:

I first reduce the amount of “swappiness” to the minimum. The box has 16G ram so I have enough headroom plus I move the swap partition to the spinning disk.

In sysctl -a:

vm.swappiness = 1
vm.vfs_cache_pressure = 100

I enabled the discard option on the ext4 filesystems to enable TRIM in order to free up block upon release

In fstab:
/dev/mapper/vg_monster-lv_root /                       ext4    defaults,discard        1 1
UUID=3de72813-da36-4a6e-89e1-4805b0fc03ea /boot                   ext4    defaults        1 2
/dev/sdb1             swap                    swap    defaults        0 0

So the vg_monster-lv_root sits on the SSD drive and the swap space + /home partition on the spinning rust.

There are two reasons for this.
1. I can monitor the rotating disk for increasing faults. By default any spinning disk has some spare blocks so it can either try and rewrite the failing block to a good one or just mark the block as bad so I would most likely lose just one block or sector.
2. SSD’s don’t have the option for marking a single block as bad. Most likely an entire cell fails which in general will brick the disk. I can rebuild an OS fairly quickly but my homedrive with all settings and data is a much larger piece of work. In addition it’s much easier to rsync a single directory that the entire box to another medium. 🙂

In addition I changed the default CFQ scheduler to deadline in other to get the optimum number of queues and timeout deadlocks on read/write operations. This scheduler prevents from processes having to wait for requests by other processes too long causing them to timeout.

[root@monster ~]# cat /sys/block/sda/queue/scheduler
noop [deadline] cfq
[root@monster ~]# cat /sys/block/sdb/queue/scheduler
noop deadline [cfq]
[root@monster ~]#

I added some udev rules to sort this out on boot:

[root@monster ~]# cat /etc/udev/rules.d/60-disk-scheduler.rules
# set deadline scheduler for non-rotating disks
ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”0″, ATTR{queue/scheduler}=”deadline”

# set cfq scheduler for rotating disks
ACTION==”add|change”, KERNEL==”sd[a-z]”, ATTR{queue/rotational}==”1″, ATTR{queue/scheduler}=”cfq”

Some more to come when I figure some stuff out.

Cheers
Erwin

dot desktop in Gnome

OK so this one got me going for a while. Yes, did not read the Developers and Administrator guides. Maybe I should have.

This week I received a new PC with some serious grunt. Boot time takes on Fedora 17 +- 4 seconds including a shitload of daemons.

I also did not want to lose an of my settings and data so I rsync-ed the entire ~ folder from my old PC to this one. Beside the usual packages that are installed I also have some serious modified settings but one of the most annoying things I could not figure out was that many icons in the Gnome grid were reporting these square boxes and I also was missing some other icons I would have expected to be in the grid. On any normal interface you do a right-click and you get presented with a dialogue-box which lets you add/remove/muck-up these icons. Not so in Gnome-shell. It turns out you have to do this my hand by adding so call “xxx.desktop” files in the ~/.local/share/applications folder. Most app packages provide this file and yank it it there but if you have some which don’t then just copy and modify an existing one.

I do seriously hope the Gnome devs will sort this out asap since this looks like going back to the stone age.

Cheers
Erwin

D-Link System DWA-131 802.11n Wireless N Nano Adapter(rev.A1) [Realtek RTL8192SU]

With Fedora 17 you have to be using some older firmware. I checked and there is a difference in size. The “file” command doesn’t give anything in return however the actual size needs to be 129304 bytes in size. The one that comes with one of the Fedora 17 updates seems to overwrite it with one approximately 129095 bytes in size. If the incorrect is loaded you will not get an error message and the driver seems to load just fine however there is no way in hell you’ll be able to get some traffic going.

The driver at this point in time (Oct 2012) is still coming from the staging area.

I haven’t been able to figure out where the “r8712u: [r8712_got_addbareq_event_callback] mac = 24:65:11:57:2e:82, seq = 0, tid = 1” message comes from.

Will dig into the source code when time permits.

Cheers.
Erwin

The great misunderstanding of MPIO

Dual HBA’s, Dual Fabrics, redundant cache, RAID-ed disks, dual controllers or switched matrices, HA X-bars, multipath software installed and all OS drivers, firmware, microcode etc etc is up-to-date. In other words you’re all sorted and you can sleep well tonight.

And then Murphy strikes……..

As I’ve described in my previous articles it takes one single misbehaving device to really screw up a storage environment. Congestion and latency will, at some point in time, cause FC frames to go into the bit-bucket hence causing one or multiple IO errors. So what is exactly an IO error?

When an application want to read or write data it does this (in the open-systems world) via a SCSI command. (I’ll leave the device specific commands for later)
This command is than mapped at the FC4 layer into FC frames which then travel via the FC network to the target.

So lets take for example a database application that needs to read an piece of data. This is never done in chunks of a couple of bytes like single rows but it is always done with a certain size. This depends on the configuration of the application. For arguments sake lets assume the database uses 8KB IO sizes. So the read command is issued against a LUN on the SCSI layer more-or-less outlines the LUN id, the offset and the block-count from that offset. So for a single read-request an 8KB read is done on the array.  Since a fibre channel frame holds only 2 KB, this IO is split into 4 FC frames which are linked via, so called, sequence id’s. (I’ll spare you the entire handling on exchanges, sequences etc….). So if one of these frames are dropped somewhere under way we’re missing 2K out the total 8K. this means that for example frame 1, 2 and 4 have arrived back at the HBA but before the HBA can forward this to the SCSI layer is has to wait for frame 3 to arrive to be able to re-assemble the full IO. If frame 3 was dropped for whatever reason, the HBA has to wait for a pre-determined time before it will flag the IO as incomplete and will thus mark the entire FC exchange as invalid and will send and abort message with a certain status code to the SCSI layer. This will trigger the SCSI layer to retry the IO again and as such will consume the same resources on the system, FC fabric and storage array as the original request. You can imagine this can, and in many occasions will, cause performance issues or, in even more subsequent occurrences, an application failure.

Now when you look at the above traffic-flow all this time there has not been a single indication that the actual physical or logical path has disappeared between the HBA and the storage port. No HBA, storage or switch port has gone offline. The above was just the result of frames being dropped due to congestion, latency or any other reason. This will not trigger any MPIO software to logically remove a path and thus it will just keep on sending IO’s to the target over a path that may be somewhat erroneous. Again, it is NOT the purpose of MPIO to monitor and act up IO-errors.

If you are able to identify which path observes these errors you can disable this path from the MPIO software and you can fix the problem path at your earliest convenience. As I mentioned above this kind of behaviour very often occurs during Murphy time ie. during your least convenient time. This means that you will get called during your beauty sleep at 3:00AM  with a message that you’re entire ERP application is down and that 4 factories and 3 logistics distributions centre’s are picking their nose at $20,000 a minute.

So what happens when a real path problem is observed. Basically it means that a physical or logical issue occurred somewhere down the line. This can be a physical issue like a broken cable or SFP but also a bit- or word synchronisation issue between two ports in that path. This will trigger the switch to send a so called RSCN (Registered State Change Notification) to be sent to all ports in the same fabric and zone as the one that observed the problem. (Now, this also depends on the RSCN state registration of those devices but these are 99% of the time OK). This RSCN contains all 24-bit fabric addresses which are affected. (There can be more than one of course when ISL’s are involved.)

As soon as this RSCN arrives at the initiator the HBA will disassemble it and notify the upper layer of this change. This is done with different status codes as the IO errors as I described above. Based up the 24-bit fabric ID’s MPIO can then determine which path to that particular target and LUN was affected and as such can take it off-line. There can still be one or more IO errors as this depends on how many were in-flight during the error.

So what is the solution. As always the best way is to prevent this kind of troublesome scenario’s. Make sure you keep an eye on error counters and immediately fix these devices. If for some reason a device starts to behave this way during your beauty sleep, you need to make sure beforehand it will not further impact the rest of the environment. You can do this by disabling a ports either on the switch, HBA or storage port but that depends on where the problem is observed. Use tools that are build into the software like NX-OS or FOS to identify these troublesome links and disable them with features like portfencing. Although it might still have some impact this is nothing compared to a ongoing issue which might take hours or even days to identify.

As always use the manuals to determine how to set this up. If you’re inside the HDS network you can access a tool I wrote to very easily generate the portfencing configuration. Send me an email about this if you’re interested.

Hope this explains a bit the difference between IO errors and path problems w.r.t. MPIO and removes the confusion of what MPIO is intended to do.

Kind regards,
Erwin

P.S. for those unaware, MPIO (Multi Path IO) is software that maps multiple paths to target and LUNs to a single logical entity on a host so it can use all those paths to address that target/LUN. Software like Hitachi Dynamic Link Manager, EMC Powerpath, HP SecurePath and Veritas DMP fall into this category.