IOUG Podcast 21-JUL-2012: A Disastrous Day in the Life of a Cloud DBA / Cloud GPS

Harry E Fowler

12 years ago

For the week of July 21st, 2012:

How Many Ways Can a Cloud Die (besides evaporating)?
Keep Yourself from Getting Lost in the Clouds

“IOUG Podcast 21-JUL-2012: A Disastrous Day in the Life of a Cloud DBA / Cloud GPS”

Subscribe to this Podcast (RSS) or iTunes

Last week, we briefly examined some of the elements of the changing world of the DBA. And earlier the week before, Salesforce.com suffered its second outage in two weeks leading some to discover the fault lines within the touted seamless and endless world of cloud computing architecture. This week we take a closer look at what points of failure occurred during those outages.

Back in 2008, Force.com released a white paper detailing the design and architecture elements of its back-end infrastructure, also noted at the time by the blog author, Paul O’Rorke, as having “8 DBAs supporting 150,000 customer databases running inside of only 10 Oracle databases. They scale up in part by adding a new Oracle database for every additional five thousand customers.” (Currently, that translates as 14 instances per the Force.com dashboard – http://trust.salesforce.com/trust/status/) We can extrapolate that those were probably either 10g or 11g databases at the time, with RAC and DataGuard enabled, to provide real-time replication between Force.com’s three primary datacenters.

/* Developer side-note: Force.com employs a technique of using embedded metadata definitions instead of data-definition language (DDL) allowing the re-use of existing data structures for multiple customer purposes without object re-definition. This technique is commonly found in many multi-purpose application frameworks, such as PTC’s Integrity, or even conceptually speaking, Apache Hadoop. This is how multiple customers with multiple data structural requirements can share a single instance. */

The June 28, 2012 outage was based upon a storage subsystem failure, according to a summary published by ZDNet’s Editor in Chief, Larry Dignan. Ten years ago, block corruption at the disk level tended to throw a tablespace offline, or render various data objects unreadable which database software solved by using Error Correcting Code (ECC) technology to allow online rebuilds and background recovery to occur transparently. So what kind of storage subsystem failure causes a modern database application to become unavailable? Well, besides awaiting Force.com’s eventual root cause analysis to be published, we would probably in this situation focus on what was reported on their dashboard – that a single instance became unavailable (NA2 in this case.) One cause leading an entire instance to become unavailable would be SYSTEM tablespace corruption, wherein the database reverts to a state only allowing local connections by the SYSDBA to act upon the recovery steps required in order to re-open the instance. But we know from the architectural design, that a redundant standby fail-over instance was part of the design. So what else happened that required over 4 hours for that fail-over to occur?

Listeners might be configured not to be fail-over aware (connections to the original hostname stop dead without re-routing)
The fail-over database encountered errors during switch-over
The storage subsystem replicated block corruption at the data level to both instances (meaning what stopped the primary, also stopped the standby database)
The redo logs had stopped applying on the standby database, requiring additional log restore and recovery.
The application middleware was not configured to switch to the standby connections
Routing to the standby database was available at the database level, but not the middleware level.
And so on…

We think of modern-day information systems like a well-engineered aircraft carrier, but more often they’re closer in resemblance to a Jenga tower, with many precisely engineered pieces that are very stable, until you remove a key piece that collapses the tower. Clusters are supposed to be our local datacenter protection for host failures, but what happens when a component that is not cluster-aware fails? In Force.com’s very robust architecture, there are 14 different instances that you might be connected to during your client/server session negotiation, probably with front-end load balancing appliances (e.g. F5 networks and Riverbed) in-between, which negotiate and manage your connection to the middleware. Are those appliances similarly clustered as the back-ends that they serve and were they able to switch active connections to an alternate instance?

Our world of applications has become much more complex than just SQL clients attaching to databases. A block-level denial of service might trigger middleware components to go off-line, bringing down some or all of an application’s availability. The more recent Force.com outage on July 10, 2012, was reported by InfoWorld’s Chris Kanaracus as based upon power outages affecting 3 primary North American instances. In that situation, the power outage probably exhausted available on-site UPS systems, and might not have had power generators, or sufficient fuel for the generators to supply power throughout the extended outage.

So we can presume in this case that we lost entire connectivity to one or two of the primary datacenters, due to the power outage. Because the entire Salesforce.com Application Store became unavailable to users, there must be an architectural fault in either redundancy of the systems providing the application catalog or portal, or those systems providing access to that portal (perhaps single sign-on, load-balancing, or home page websites). The dashboard log indicates that actual recovery was required for these systems, indicating that there was no redundant system available to take over the role of the system that had failed.

Also noted in Kanaracus’s article were the failure of shared infrastructure components between the non-production instances and Salesforce.com’s primary Application Store. This is unclear whether the components were non-critical shared elements, but their lack of availability did end up affecting availability of the entire application suite. Components that might fall into this shared category could be DNS (where hostname lookups failed), LDAP (sign-on authentication failed) or even a shared hardware switch (routing failures.)

When designing cloud architectures, the focus is not only load capacity and scalability, but also the elements of global availability and reliability based upon redundancy and/or re-purposing (the ability of a host to take over another host’s application services.) When protracting your disaster recovery scenarios, do not forget to include the basic elements such as point-of-entry appliances and firewalls, authentication systems, and even redundant monitoring to inform you when something has failed. And practice your disaster recovery under controlled conditions before the disasters happen. In other words, don’t forget to “pull the plug” and examine how to recover from it.

Oracle’s latest Engineered Systems offerings come packaged with real-time monitoring systems that attach you to this same Cloud of complexity. How will you begin recovery if your new Fusion Applications Exadata system becomes disconnected from My Oracle Support because your local Telco provider experiences a network outage? In this constantly changing world of technology, IOUG continues to pursue excellence in education to help you to maintain your ability to succeed even when your infrastructure is floating in the clouds.