Causes of Data Loss: Are You Sure You Can Recover?
-
Posted by Harry E Fowler
- Last updated 6/27/19
- Share
By Tim Boles | Edited by Michael Gangler
There have been studies done by various researchers and companies on the causes of data loss. Generally, these lists include hardware failure, human error, software corruption, and computer viruses. The intention of this article is to provide you a guide and thoughts on many situations you might encounter. Since everything depends on your unique system and circumstance, this is just a springboard for you to investigate your own system and solutions.
Causes of Data Loss
Hardware Failures in Real Life
Let us take into consideration a few hardware failures that I have encountered and some things to consider.
A power surge took out the server and hard drives with the binaries of the database software.
- Do you have copies of those binaries?
- What patches have been installed since the last backup?
The cooling system fails for the server room overnight. The room gets so hot that the SAN system crashes. When the SAN system has rebooted, the drives that house the redo logs fail to come up.
- Are all your redo logs on the same SAN?
- Are all your redo logs on the same disks of the SAN?
RMAN repository database is corrupted.
- Did you back up your repository?
- Do your scripts work without a repository?
- Do you know how to backup/restore without it?
Only Human
The second most prevalent cause of data loss is human error. This makes up approximately one-quarter of the reasons for data loss. It does not matter how careful you are, how meticulous, or change management conscious, humans are ultimately that, human, and make mistakes.
A very stringent change control process can only go so far. I worked on one system that a shell script had been written that would transverse the directory tree and delete files that fell within a category. It worked fine on the test/development system, but when it was executed on the production server it deleted several database files. Now you might say to yourself, “Any database administrator worth their salt should be able to quickly restore those database files and have the system up and running in a few minutes.” What if that data file belonged to a Read-Only tablespace? That can have a lot of implications depending on your backup procedures.
We could go on all day with different scenarios and your ability to recover from them but that is not the purpose of this article. We want to be sure you are ready to recover. However, to recover you must have a good backup.
Backup Basics
What to Backup?
You may think your only job is to be sure that the data files are backed up. But there is so much more that needs to be considered. Don’t work in a bubble. Work with other administrators, application developers, and system engineers to make sure you understand your system architecture and what might need to be restored to make the system pristine again.
Get a system overview and then start brainstorming all the connected pieces and who is responsible for making sure everything is backed up consistently. Communication is key here. I worked in one environment where the SAs were responsible for non-database file system backups and the DBAs for the database backups. The problem was that the SA’s thought that the Oracle Home would be classified as part of the database backup. When the database server needed to be restored who do you think was held accountable for the missing Oracle Home?
Backing Up Non-Datafiles
Here is a list of a few non-Datafiles that might be important to your situation. There might be many others. Do you know how often the following are being backed up, who controls the backups and most importantly how to get them restored?
- Oracle Software Home (binaries)
- BFILES
- Password Files
- pfiles (spfiles are covered with newer versions)
- tnsnames.ora
- listener.ora
- sqlnet.ora
- /etc/oratab
- scripts (shell, SQL)
- Encryption keys (wallets)
- Application files that need to be in sync with the database.
- Oracle Gateway software and configuration
- OraInventory
- Cronjobs (or other scheduling software)
Multitenant Architecture
The multitenant architecture of Oracle 12c introduces additional considerations. One example is PDBs do not have individual archive logs. The archive logs belong to the CDB. It is this type of change requires reviewing and testing of the current backup and recovery procedures to make sure they can still be used.
Be A Boy Scout
The Boy Scout motto is “Be Prepared!” That is exactly what every DBA should adhere to when it comes to data safety. When you think about your ability to restore a system are you prepared? What is your source for information to restore your database?
- Memory / Experience
- Oracle Documentation / Books
- Internet Search Engines
- Co-worker
- Monitoring Tools (i.e. Oracle Enterprise Manager)
The problem with those sources is that they may not be available, be outdated, or not relevant to your environment. There is no reason to be racking your brains trying to remember commands, system architecture, or procedures when you are in the middle of a crisis and need to restore a system quickly. You should have backup and recovery documentation detailing your system and another with procedures to address the more common restore needs.
There are so many good reasons for creating and maintaining accurate documentation.
- Not dependent on a single person being available
- Boosts ability to concentrate – not sweating the small stuff
- Gain experience and knowledge
- Refine backup/restore procedures
Backup & Recovery Documentation
There are two types of B&R documentation. The first is a system overview and the other is recovery support for emergencies.
System Overview
- Backup implementation – procedures / rational
- Physical
- Hot/Cold
- Full/Incremental
- Exports
- Full
- Schema, Table, (Transportable) Tablespace
- Non-datafile backups
- Application tires
- Scheduling – (who/how/change control)
- Retention policies (time and Off-site Location)
Database Specifics
Many times when reviewing this type of information for a system I often ask around, ”Why did they set up things like that? ” Often to be answered, “Not sure it was that way when I got here.” That is so frustrating when you see things that seem to be better solutions but must research to find if there was a legitimate reason it was not already implemented. It is important that the reasons for decisions be documented if they are more than just technology based. I worked on one system where they had to destroy backups every night because of data retention policies. You should note those policies and any supporting information so any “new” person to the system will have some type of reference.
Recovery Support
A backup is only good if you can use it to restore. The time it takes to restore can be drastically reduced if you have the procedures already in place and know they work. Get some stakeholders together and brainstorm some scenarios that might cause data loss on your systems and test those scenarios. Your recovery support documentation for each of these scenarios should include at least the following:
- Outage Procedures – who is in charge / what groups are involved / steps to take
- Script listing – location / usage / execution syntax / description
- Test Documentation
- Recovery scenarios tested
- Documented restore procedures
Test Documentation
Take the time and record the steps as you go through a recovery scenario. This helps you learn things you don’t know, know the systems better and builds a blueprint if the scenario ever truly occurs. However, remember things always change and you will never be able to document everything that can go wrong, but you will at least have a better chance of dealing with problems when they do occur.
Take the time to document it and have another DBA review the general procedures and then the restoration procedures with your instructions. Make sure those procedures are tested periodically and specifically with major version changes. I speak from experience. I created a B&R test documentation on a system and went on to other projects. When I returned to that project a few years later one of the DBAs got onto my case saying that they had problems and attempted to use the documentation to recover and it failed. After talking to her I found three things:
- No one had “practiced” with the procedures since I had left
- They had upgraded from 9i to 10g while I was gone as well
- The commands that worked just fine for 9i had slightly different syntax in 10g
Remember a system backup is only good if it can be used to restore the system, so test restores periodically with different scenarios. This is always a great project for new hires. Following is an appendix that outlines some brainstormed possible recovery scenarios. Some of them are simple or perhaps not applicable to your environment and I am sure you can think of others that are not on this list. It is only a springboard for your use.
About the Author
Tim Boles is a Senior Manager for Hitachi Consulting. He has over seventeen years of experience as a DBA working primary in Oracle production systems. He holds a master’s degree in Computer Science from West Virginia University and an MBA in Information Systems from the University of Phoenix. He can be reached at tim2boles@gmail.com and twitter @timboles_dba.
Appendix – Possible Data Loss Scenarios
- Media Loss
- Loss of Controlfile
- Loss of a data file for a tablespace
- System, rollback segment, UNDO, user data, index, read-only, partition
- Loss of Redo Log file
- Inactive online, current online, archived
- Loss of entire redo group
- Inactive online, current online, archived, all redo groups
- Data Block Corruption
- Physical, Logical, In backup
- Loss of a BFILE
- Recovery of Entire Database
- Recovery to new file system
- Point in Time Recovery of the entire database
- Recovery of RMAN catalog
- Creation of standby database
- Creation of duplicate database on the test system
- Recovery PDB / CDB / non-CDB
- What If….
- Database crashes during backup.
- Oracle Home is destroyed.
- Entire database server replaced.
- SAN loses multiple drives.
- Database crashes during table movement.
- Database crashes during the use of Flashback Technology
- Read-Only tablespace was created before the last backup.
- Read-only tablespace was created after the last backup
- Recovery catalog is not available
- User / Software Error
- Recovery of dropped schema
- Recovery of dropped table
- Large transaction corrupts many tables
- Data corruption in the entire schema
- Data corruption in schema 5 hours old but all other schemas need to remain the same.
- Trigger or procedure is recompiled with the wrong code
- Recover previous trigger/procedure code
- Restore data corrupted by trigger/procedure