Oracle Database Backup and Recovery Overview

There is no skill more important for a DBA to have than database backup and recovery. Having a production database that cannot be restored because it wasn’t being backed up can indeed cost a DBA their job. When a production database is down, most likely executive level management will be aware of the outage and the pressure for the DBA to restore the database and get it back up and running will be great. An in depth understanding of Oracle backup and recovery is essential for any DBA who is serious about their career as an Oracle professional.

This blog entry is a high level introduction to the concept of Oracle database backup and recovery. We will not be discussing advanced add on features which address data availability such as Dataguard or RAC. We will be addressing traditional database backup and recovery. Please refer to the Oracle documentation and other blog entries for more in depth commentary.

Background

There are two types of backups in the Oracle world: a cold backup and a hot backup. A cold backup is taken while the database is down and unavailable. A hot backup is taken while the database is up and available.

There are also two types of methods for backing up an Oracle database: User managed backup and recovery and Server managed backup and recovery. User managed backup and recovery is when the DBA manually performs steps to backup and recover an Oracle database. Server managed backup and recovery is when the DBA uses a utility known as RMAN, Recovery MANager, to backup and recover an Oracle database. The details of the RMAN utility are out of the scope of this blog.

There are two types of Oracle database recovery: crash and media recovery. Crash recovery, also known as instance recovery, occurs automatically by the Oracle instance. More specifically, it is performed by the Oracle server background process SMON. This is the type of recovery that would be necessary after a sudden crash of a database instance. Reasons for the crash could be a sudden halt of the operating system of the database server e.g. system kernel panic/dump, server power failure, hardware failure, memory exhaustion, etc. It could also be the result of the DBA shutting down the database with the ABORT option.

When the database is shutdown gracefully, via one of the following shutdown options (NORMAL, IMMEDIATE, TRANSACTIONAL), one of the things Oracle does is write a consistent SCN, System Change Number, to the header of each datafile in the database. This ensures that at the time of database shutdown all of the datafiles are consistent. On a subsequent Oracle instance startup, when Oracle goes to OPEN the database, it will check this SCN in the header of all datafiles. It will check to see if they all match and are also consistent with the SCN which is in the controlfile. If they are consistent then that means that there was a clean shutdown prior so the database opens right up without any kind of recovery. There are some other flags that get set internally as well which Oracle will use to determine if recovery is needed on startup. If Oracle is started back up after a server crash for example, then it will see that the SCNs in the datafiles do not match what is in the controlfile or with one another. This is indicative of a database which was not shutdown gracefully prior and it needs to be recovered in order to be made consistent. At this point Oracle will automatically perform crash recovery. During crash recovery committed transactions are applied to the datafiles from the online redo log files. The crash recovery process also rolls back uncommitted transactions using its undo data. An inconsistent database cannot be opened. Crash recovery can be thought of as Oracle performing the cleanup that a prior graceful shutdown should have performed.

Example of Online Redo Log Configuration

This database has three log file groups. Each group has three members/files each. Each member/file is a mirrored copy of the others in the same group. This is called multiplexing and is how we protect our online redo log files.

01

Media recovery is something that is more manual for the DBA. This is when there is damage to the physical files which make up the database, e.g. datafiles, control files, online redo log files. An example of this could be disk subsystem failure or someone either maliciously or accidentally deleting these files. Media recovery is restoring the datafiles from a backup.

An Oracle database can run in one of two modes: ARCHIVE log mode and NOARCHIVE log mode. It is a field best practice that all production Oracle databases run in archivelog mode. This will give the DBA the ability to roll a database forward in the event of a media failure. This mode will also give you the ability to run online/hot backups. These two features are not available when the database is not running in ARCHIVE log mode.

If the database is not running in ARCHIVE log mode then the only media recovery possible would be a point in time recovery as of the time that the database backup was taken, meaning at the time of the backup but all subsequent transactions will not be recovered. The roll forward process is the process of replaying transactions in the archive log files and in the online redo log files to bring the database to its most current state. A good example of this would be a scenario where we schedule a database backup to occur at 1am. Here are the steps of the backup shell script:

Database in NOARCHIVE Log Mode

  1. The data dictionary is interrogated to determine the physical location of the datafiles, controlfiles and online redo log files.
  2. Database is shutdown gracefully with the IMMEDIATE option.
  3. The physical datafiles, controlfiles and online redo log files are copied from their location to the backup location using an OS utility, i.e. cp.
  4. The database is started back up and OPENED.

Now let’s say the backup ran from 1:00am to 1:30am, it took 30 minutes to complete. At 12pm noon, 11 hours later, we have a media failure. Our datafiles are deleted accidentally, OOPS! You go to your trusty cold backup taken ~11 hours earlier. You restore the datafiles, controlfiles and online redo log files from the backup location and open the database. Now there is another problem. The users of the application notice that the data is ~11 hours old. They have lost all work from ~1am to now.

This happened because when the database is not running in archivelog mode it writes to the online redo log file members in one redo log group and when that group fills, it moves onto the next group, when that group is filled it then moves onto the third group. Now what happens when the third group is filled? It needs to circle back to the first group. In no archive log mode, all changes to the first group will be overwritten because Oracle needs somewhere to write transactions or it is useless.

However, if you run your database in archive log mode, then when an online redo log group fills it is then archived to an file in another location to keep the redo in that file safe from being overwritten. The file is known as an archive log file and the location is known as the archive log destination. Once Oracle writes an archive log file, that file is offline meaning that Oracle will never write to it again. The archive logs contain a chronological history of all data changes made to the database since the last backup. Oracle will keep doing this.

Example of How Oracle Fills Redo Log files in NOARCHIVE Log Mode

The first group of online redo log files fill up with redo:

02

Oracle switches to the second group of online redo log files and fills them up with redo:

03

Oracle switches to the third group of online redo log files and fills them up with redo:

04

Oracle has no more online redo log groups to switch to so it goes back to the first one. It has no choice but to overwrite group # 1, rendering that redo data gone.

05

Database in ARCHIVE log mode

  1. The data dictionary is interrogated to determine the physical location of the datafiles, controlfiles.
  2. The tablespaces are put into hot backup mode – (Details of this are out of scope of this blog).
  3. The physical datafiles are copied from their location to the backup location while the database is up and running using an OS utility, i.e. cp.
  4. The tablespaces are taken out of hot backup mode – (Details of this are out of scope of this blog).
  5. The controlfile is backed up using a backup command.

Let’s take the same file deletion scenario as above. You restore the datafiles and controlfile from the backup location but before you open the database you issue a command to perform media recovery. Media recovery will open each archive log file and apply their changes to the restored datafiles sequentially until your specified criteria. Our criteria will be all archive logs/complete.

Since Oracle applied all the committed changes found in the archive log files created since the 1am backup and from the online redo log files, which were not deleted in this scenario, when you open open the database after the recovery, you will have all of the database changes since 1a.m. Your users will see what the database looked like up until the datafile deletion. This result is typically the desired outcome.

Example of How Oracle Fills Redo Log files in ARCHIVE Log Mode

The first group of online redo log files fill up with redo. However, this time when Oracle switches to the next group of online redo logs, it makes a copy of the redo which is in group # 1 and generates an archive log of that online redo log file. The archive log generated will get a sequence number, i.e. 100. This is done by the Oracle server background process LGWR. Now the first group is available to hold redo again.

06

Oracle does the same thing with the next group. The archive log generated will get a sequence number, i.e. 101. Now the second group is available to hold redo again.

07

The same thing with the next group. The archive log generated will get a sequence number, i.e. 102. Now the third group is available to hold redo again.

08

It is important to note that if you were to lose an archive log, let’s say 101, then you will have to use incomplete recovery and stop the recovery right then and there. You can’t skip over archive logs during recovery. They are sequential and hold a chronological history of database changes. Archive logs are extremely important and should be backed up frequently, multiple times a day and should be made redundant. Oracle allows you to specify many archive log destinations to ensure you have them.

Two Types of Media Recovery that can be Performed in ARCHIVELOG Mode

The two types of media recovery are incomplete and complete. Complete recovery is when you haven’t lost your online redo log files or any of the archive log files generated. Incomplete recovery is when you are missing archive log files or online redo log files.

Since you don’t have all of your redo you can only recover up to a certain point. There are three types of incomplete recovery you can perform:

  • SCN based – You instruct Oracle to recover your database to a certain SCN. This is specific point in the database’s life.
  • Time based – You instruct Oracle to recover your database to a specific date and time.
  • Cancel based – You instruct Oracle to stop recovering at this moment because you don’t have any more redo to apply. Open the database as is – “What it is, is what it is”.

Complete Recovery Example

In our scenario, we only lost the datafiles not the online redo log files. Because of this we can perform what is known as a complete recovery. A complete recovery is when you still have the online redo log files or their multiplexed members/copies available. This will allow you to not only apply all of the archive log files, effectively applying all database changes since the hot backup up until your last archive log but you will also have up to the latest committed transaction applied as well. That means if the last data change the application made and committed was at 11:59am then when you open the database, your database will look just like it did at the time of the datafiles being deleted. You would not have lost any committed transactions.

Example of complete database recovery:

09

Incomplete Recovery Example
If you lost all of your online redo log files because they were all deleted or you didn’t multiplex them then you will have to perform an incomplete recovery. An incomplete recovery will allow you to stop the recovery up until the point/the archive log that you know you have. For example if the last archive log was created at 11:45am and our media failure occurred at 12:00pm noon, then that means that after Oracle is done applying all of the redo from the archive log files, it will then open the database and now the application users will see data as of 15 minutes prior to the failure as opposed to up to 11:59am, right up to the failure. That is because the online redo log files are gone.

After an incomplete recovery you would open the database with the option RESETLOGS. This option tells Oracle to automatically recreate the online redo log files and start new. The only two ways to safeguard against losing any committed transactions would be to either multiplex/copy your online redo log files or implement Dataguard. You can see from this scenario how important online redo log files are to the recovery process of an Oracle database.

Example of incomplete database recovery:

10

In conclusion, I think it is pretty obvious how important database backup and recovery is. It is your number one job as a DBA to safeguard your employer’s data. If data is stored in an enterprise level database such as Oracle, you can be sure that it is considered to be very important.

A backup and recovery strategy should never be second thought in any Oracle shop: it is your lifeline.

 

Leave a Reply

Your email address will not be published. Required fields are marked *