Microsoft SQL Server Database File Structure (SQL Server 2008 R2)

Note: this article originally appeared as article #46 on ugts.org. I have since moved it to Blogger since it was the only resource that most people wanted from the website.

A database data file (MDF or NDF) in SQL Server is stored using the format defined by the SQL Server Storage Engine. Database files use a moderately simple format with a small set of data structures to store all the objects in an SQL Server database.

A database file at its simplest level is an array of 8KB pages - bytes 0x0-0x1FFF in the file belong to page ID = 0. Bytes 0x2000-0x3FFF belong to page ID = 1, etc... Page ID values start at 0 and count upward. Page IDs are 32 bit signed values, meaning that the largest possible database file in SQL server is 2^31 pages * 2^13 bytes per page = 2^44 bytes = 16TB. This is a limitation that has been present from SQL Server 2000 up to 2008 R2.

Pages are the fundamental unit of allocation for data stored to disk. Every object and structure is the database has a set of pages that it are allocated for it, and two objects cannot share the same page. Note that database log files (LDF) do not have pages - they have a much simpler structure - they are just a sequence of log records. Also note that database data and log files do not have to have MDF, LDF, and NDF file extensions for SQL server to be able to use them, but that is the convention: MDF for the primary data file, NDF for secondary data files, and LDF for log files.

Eight consecutive pages form an extent (64KB). An extent is the usual size for a disk I/O read or write operation, since larger reads and writes allow disk access to perform more optimally. If all 8 pages in an extent belong to the same object, the extent is said to be uniform. If they belong to different objects, then the extent is mixed.

A uniform extent is preferrred to a mixed extent for performance because the fewer the extents, the less disk I/O is required to read or write the object, and less metadata is needed to track the allocation of the object. However, uniform extents are not used when the total size of the object is less than a single extent, because that would waste far too much space if the database had a large number of very small objects. The first 8 pages for an object will be allocated individually from mixed extents, but after that, only uniform extents will be allocated for the object.

As noted, a database can have multiple files beyond just a single MDF file - the additional data files have an NDF file extension. The File ID is used to indicate which file in a list of database files is referenced, and it starts at 1 and counts up to 2^15-1 = 32767. This means that a database can have up to 32767 files, for a total limit of 512PB per database. FileID = 0 is an invalid value which typically means 'not used'

The combination of a PageID and a FileID is a 6 byte value called a page locator that uniquely identifies a page in a database. A page locator is written as FileID:PageID, such as 1:142 to indicate FileID = 1, PageID = 142. 0:0 typically means 'no page', or 'not used'. For the purposes of this article, a page locator of *:7 would mean page 7 in each database file.

An RID value (Row ID, also called a row locator, record locator, or Record ID) is a page locator plus a 2-byte slot index, for a total of 8 bytes. They are used to uniquely identify a table or index row within a page. The slot index starts counting at 0 up to the number of data rows on a page, and will be described in more detail. An RID is typically written in the form: FileID:PageID:Slot, such as 1:142:0 to indicate FileID = 1, PageID = 142, slot = 0.

Since SQL server is a transactional database system with logging and rollback capability, and it desigend to be maximally responsive at all times to the maximum number of users. To facilitate this, data is generally only moved around within a database and updated to the minimum extent necessary to get the job done quickly and to make it easy to undo the change, should that be necessary. This means that database files are never automatically shrunk or compacted - unused pages will sit around in a file until they are reallocated. If a data row is deleted from within a page, the space is simply deallocated but the data where the record used to be is not cleared. This means that the page locators and RID values do not have to be updated most of the time, and undo-ing an operation is usually as simple as flipping a few bits back to their previous state. It means that database structures don't move around very much in a database, and if they get corrupted, you can sometimes restore a backup file to compare page contents before and after the corruption to see what went wrong. It also means that databases can become fairly large and fragmented over time, and that scheduled maintenance is sometimes necessary to compact and defragment the database when it gets out of hand.

SQL Server provides a few undocumented DBCC commands - DBCC IND and DBCC PAGE - that allow you to view the internal structure of a database page or object. Alternatively, you can always take the database offline, detach the database files, and view them in a hex editor if you want to edit them directly, but that is both inconvenient, requires downtime, and lacks the parsing capabilities built-in to the DBCC commands.

To use these DBCC commands, you must first turn on trace flag 3604 (which redirects error log messages to the standard output) as follows:

DBCC TRACEON (3604)

DBCC PAGE ([dbname|DbID], [FileNum], [PageNum], [printOpt])

The printOpt parameter is a type of verbosity - it can be from 0-3, with 3 being the most verbose. However, if you try to use printOpt = 3 on a free (unused) page, DBCC PAGE will fail. The command will also obviously fail if you use a FileNum or PageNum outside the range of valid values allocated for the database files. To find out which PageID values belong to a database object, use the DBCC IND command:

DBCC IND ('[dbname]', '[objectname]', [indexid])

For example, if you have a table 'Beta' in database 'Alpha', to view the contents of the clustered index (IndexID = 1), you would use the following:

DBCC IND ('Alpha','Beta',1)

This will return a table of all the pages that belong to the selected index of the table Beta in the database Alpha, and you can then call DBCC PAGE on the PageID values to view them.

When you run the DBCC IND command, you will see a PageType column. There are 15 different known PageType values in SQL server, most of which can be broadly classified into three types - data, status, and options:

Unused - 0 - page is unused
Data - 1 - used for table data (leaf pages of clustered indexes or heap pages)
Index - 2 - used for index rows (non-clusterd index pages, and/or upper levels of indexes)
Text Mix - 3 - stores multiple LOB values and indexes for LOB B-trees
Text Tree - 4 - stores data for a single LOB value
not used (?) - 5/6
Sort - 7 - stores intermediate results during a sort operation
GAM - 8 - global allocation map, tracks extents which are completely unused - the first one is at *:2 in each file, and there is for every 4GB in a database file.
SGAM - 9 - shared GAM, tracks mixed extents with free space available - the first one is at *:3 in each file, and there is one for every 4GB in a database file.
IAM - 10 - index allocation map, tracks space used by a given 'allocation unit'.
PFS - 11 - page free space - tracks various page statistics - the first PFS page is at *:1 in each file.
not used (?) - 12
Boot - 13 - boot information page at 1:9 - stores global status for the database, similar to the file header.
Instance Header - 14 - present only in the master database at 1:10. Stores the various settings you see when getting properties on an SQL server instance.
File Header - 15 - the top level page at *:0 which describes each database file.
DCM - 16 - differential changed map - tracks which extents have changed since the last full backup in order to do a differential backup - the first is at *:6 in each file.
BCM - 17 - bulk changed map (also called the ML map) - tracks which extents have been involved in a bulk / minimally logged operation since the last full backup. Like the DCM page, this allows you to do a differential or incremental backup even though not all the operations since the last full backup have been logged. The first BCM page is at *:7 in each file.

Every page in the database that has been initialized will have a 96 byte header, and the remaining part of the page (8096 bytes) is for data content specific to the type of page. The 96 byte header has the following format (as determined by Mark Rasmussen and Paul Randal):

0x00: Header Version (m_headerVersion) - 1 byte - 0x01 for SQL Server up to 2008 R2
0x01: PageType (m_type) - 1 byte - as described above, this will be 0x01 for data pages, or 0x02 for index pages.
0x02: TypeFlagBits (m_typeFlagBits) - 1 byte - for PFS pages, this will be 1 in any of the pages in the interval have ghost records. For all other page types, this field is ignored.
0x03: Level (m_level) - 1 byte - for a B-Tree, this is the level in the tree, with 0 being the leaf nodes. For a heap (which is just a flat list), this value is ignored.
0x04: FlagBits (m_flagBits) - 2 bytes - various page flags: 0x200 means the page has a page checksum stored in the TornBits field. 0x100 means torn page protection is on, and has detected an error.
0x06: IndexID (m_indexId (AllocUnitId.idInd)) - 2 bytes - the idInd member on the allocation unit (similar to index_id on sysindexes, but not the same, see the section on allocation units for details)
0x08: PrevPage - 6 bytes - PageID:FileID of the previous page at the same level in this B-tree, 0:0 if this is the first page.
0x0E: PMinLen (pminlen) - 2 bytes - the size in bytes of the fixed length part of the data records on this page.
0x10: NextPage - 6 bytes - PageID:FileID of the next page at the same level in this B-tree, 0:0 if this is the last page.
0x16: SlotCount - 2 bytes - the number of entries in the slot array (though some of them may be deallocated).
0x18: ObjectID (m_objId (AllocUnitId.idObj)) - 4 bytes - in SQL 2000 and earlier, this held the ObjectID. In SQL 2005 and higher, this holds the idObj member of an allocation unit ID, which is not usually the same as sys.objects.object_id. For system tables, it this is the same, but for user defined tables, idObj continues to be sequentially assigned, whereas object_id is a random 32-bit number. See the section on allocation units for details.
0x1C: FreeCount (m_freeCnt) - 2 bytes - the total number of bytes of free space on this page, not necessarily contiguous.
0x1E: FreeData (m_freeData) - 2 bytes - the offset to the next available position to store row data on this page.
0x20: ThisPage (m_pageId) - 6 bytes - PageID:FileID of this page - not really necessary because the position in the file determines this, but helps to detect database corruption.
0x26: ReservedCount (m_reservedCnt) - 2 bytes - the number of bytes that have been reserved by open transactions to allow for rollback and to prevent that space for being used for any other purpose.
0x28: LSN (m_lsn) - 10 bytes - the LSN of the last log record that changed this page.
0x32: XactReserved (m_xactReserved) - 2 bytes - the amount that was last added to m_reservedCnt.
0x34: TransactionID - XdesID - 6 bytes - the ID of the last transaction that affected ReservedCount.
0x3A: GhostRecord (m_ghostRecCnt) - 2 bytes - the number of ghost records on this page.
0x3C: TornBits (m_tornBits) - 4 bytes - this will either hold a page or torn bits checksum, which is used to check for corrupted pages due to interrupted disk I/O operations. This is described in greater detail later on.
0x40: Reserved - 32 bytes - these don't appear to be used for anything, and are usually zero.

The advantages of having the common format for the page header for all page types are:

A single place to identify the type of page, and which database object owns it.
A standard way to define a linked list of pages using the PrevPage, ThisPage, and NextPage page locators, when one page is not enough to hold all the data.
A standard way to store and compare checksums to detect data corruption at the page level.
A standard way to indicate that records have been deallocated without having to delete them immediately, marking them for future garbage collection.
A standard way to track LSN values to rollback or rollforward changes due to transactions or database restores.

As shown in the header's AllocationID field, a page can only belong to a single allocation unit. An allocation unit is an (ObjectID, allocation type, partition) tuple, where allocation type is one of IN_ROW_DATA, LOB_DATA, or ROW_OVERFLOW_DATA. All data pages will have an IN_ROW_DATA allocation unit, and the other two are optional.

LOB_DATA holds chunks of values greater than 8KB in size. ROW_OVERFLOW_DATA holds values too large to be stored in-row and have to be stored in a separate page. Partitions are an enterprise level SQL Server feature allowing very large objects to be artificially split across boundaries for improved performance. They are discussed further on.

SQL Server 2008 R2 defines 25 different types of objects, including: table, stored procedure, view, trigger, synonym, function, etc.... Each database has system tables that contain the metadata about the objects in the database. In SQL Server 2000, you could query and view (and modify) the contents of system tables directly, but in SQL Server 2005, you aren't allowed to do this anymore, and you must view the information through the views that Microsoft supplies. Both the sys.objects and dbo.sysobjects views will show you all the objects defined in a database:

SELECT * FROM sys.objects

Every database object has a unique Object ID. Object IDs are unique 32 bit numbers assigned semi-randomly, except that system objects have ID values <= 100 in SQL Server 2008 R2 and earlier, with a few exceptions.

LOB means a Large OBject, which is anything over 8KB in size, such as varchar(max), text, varbinary(max). SLOB means Small LOB, which is a value not longer than 8000 bytes, but which must be stored on a different page via row overflow. CLOB means Character LOB (such as text and varchar), whereas BLOB means Binary LOB (such as image and varbinary).

Note that some pages in a database may not be initialized. In this case, the page will have garbage data both in the header and data portion of the page. Other pages may explicitly have a page type of 0. In this case, only the first two bytes of the page can be relied upon, and the rest of the data on the page may be zeroes or garbage.

The most commonly used page types in a database are Data, Index, Text Map, Text Tree, and Index Allocation Map (IAM). Let's start by looking at data and index pages.

Data and Index Pages

Table data is usually stored as a B-Tree ordered by the table's clustered index. If the table has no clustered index, then the data is stored as a simple unordered list of data pages. The leaf pages of a clustered index, and the pages in a heap hold the rows of the table. The upper level of a clustered index, and all levels of a non-clustered index store only the colums in the index, and are used to look up data rows by the index. Because of this, data pages are sometimes referred to as Heap or B-Tree pages (HoBT). Data and index pages share the same structure, with only minor variations.

The 8096 byte data content area of a data or index page is used to store an array of data/index records. At the end of page is a slot array of 2-byte values, each holding the offset to the start of the record. The slot array grows backwards as records are added.

The remaining space not counting the slot array is used for data records. Microsoft officially quotes that the size limit for a data record is 8060 bytes, instead of 8096 bytes. This has probably been done for safety to reserve extra space for header information should that ever be necessary. By limiting a data record now to 8060 bytes, the space can be used in a later version of SQL server without breaking any data records near that limit. Note also that the 8060 byte limit is just an upper limit - you won't ever get more than 8060 bytes without row overflow or LOB storage, and for a complex row with lots of columns and variable length fields, you will probably have less than 8060 bytes as the limit due to the NULL bitmap and the variable length column data lengths and offsets that take up extra space in the record.

Each row of data is structured as follows:

Status Byte A - 1 byte - a bit mask with the following information: Bit 0: not used. Bits 1-3: type of record: 0 = data, 1 = forward, 2 = forwarding stub, 3 = index, 4 = blob or row overflow data, 5 = ghost index record, 6 = ghost data record, 7 = not used. Bit 4: record has a NULL bitmap. Bit 5: record has variable length columns. Bit 6: record has versioning info. Bit 7: not used.
Status Byte B - 1 byte - Only present for data records - indicates if the record is a ghost forward record.
Offset past Fixed-Length Columns - 2 bytes - the offset to the end of the fixed length column data where the number of columns in the row is stored.
Fixed-Length Column data - data for fixed length columns only (not including varchar), densely packed with no padding between columns. If a given column has a NULL value, then the full space for the column is still used, but the content will be zeroed out according to the default value for that data type (usually zeroes, but for strings it will be a sequence of space characters). SQL server uses the space for the column regardless of whether or not the value is NULL so that it doesn't have to read the NULL bitmap to determine what offset to use to get to the column data, and so that the row does not need to be resized or moved if the NULL status on a fixed-length column changes. On the other hand, this does waste a fair amount of space for large datatypes such as datetime, bigint, char(N), etc...
Number of columns in this row - 2 bytes. It may seem odd to you to have to specify this, since the number of columns is defined at the table level, and not the row level, but this allows you to change the schema for a table without having to resize or move all of the existing rows which have been densely packed together. The new column is simply added to the end of the table, and existing rows do not need to be modified, resized, or moved - the column count shows that the older row just doesn't have the new columns.
NULL bitmap - this is a bitmap showing which columns have NULL values. Size = # of columns / 8, rounded up to the nearest whole number. The NULL bitmap is always present, even if the table has no NULLable columns, but if nothing is NULLable of the columns that are present for this row (see the previous field), then the NULL bitmap will be all zeroes. The first column stores a 0 for non-NULL or a 1 for NULL in the least significant bit, and so on for all the columns, even the columns which are not allowed to store NULL values. The bitmap is stored this way so that you can change a column from NULL to NOT NULL and back again without requiring that SQL server rewrite the NULL bitmap values of all the rows and possibly have to resize some rows if the size of the bitmap grows or shrinks in number of bytes. Note that unused bits in the bitmap can be set to either 0 or 1 depending on the version of SQL server and/or the data that may have been in the page prior to the columns changing from NULL to non-NULL. The value of these unused bits is irrelevant.
Number of variable length columns - 2 bytes. If the row has variable length data, then this stores the number of variable length columns.
End Offsets of Variable length columns - 2 bytes for each variable length column. This stores the offset of the first byte past the end of the column data, relative to the start of the record. If this two byte value has the high-bit set to 1, then the column is a complex column (such as a sparse vector), otherwise it is a simple column. Sparse vector columns are described in detail in the section on filtered indexes and sparse columns.
Variable length column data - the variable length columns which are stored in-row (not on other pages) are stored densely packed in sequence, without needing any null terminating sequences in-between the columns, because the lengths were stored just before the values. If the row has a 'Uniquifier' column (described later), then there will be a NULLable variable length column before all the other variable length columns in the list.
Padding - if the row size is less than 9 bytes, then the length of the row will be padded to be 9 bytes long. This is done because a forwarding record is at least 9 bytes, and SQL server must be able to replace a primary record with a forwarding record without having to restructure the page and the indices that point to the row.

Looking at this data structure, you can see that the minimum size for a row of data, not counting space used for indices, is 9 bytes, plus 2 bytes for the entry in the slot array. It would be 8 bytes if you could have a row with a single byte value, without needing the padding. But since the padding is needed, it makes no difference whether the row requires one byte or two bytes of data - it still uses 9 bytes by the storage engine. In practice, you rarely have a table smaller than 4 bytes, so this is not an issue.

Every row has at least 9 bytes of overhead space on a page, and most rows will also have variable length data, so most rows will have 11 bytes of overhead. In practice, there is usually a fair amount more overhead than this. In order from the greatest to least impact:

Indices - a table usually has one or two extra indices defined on it, and these indices use up space proportional to the sizes of the columns in these indices. Sometimes there can be so many indices on a table that more space is used to maintain the secondary indices than is used to store the table data!
Empty or Sparse Pages and Unused Space - the database files will have unallocated pages as the files grow, and partially filled pages as records are deleted. This free space can become quite substantial over time. Also, even if the data page is filled end to end with records, it is rarely completely full, because SQL server will not split the in-row data portion of a row between two data pages. Furthermore, in practice, most tables in a real production database will have a clustered index, and rows in a clustered index are forced to be listed in the order of that index. So even though it might be possible to re-arrange rows between pages so that less space would be wasted, SQL server will not store the rows out of order. And for tables without a clustered index (i.e. a heap), you have absolutely no guarantees about the order of rows, and SQL server may or may not try to re-order the rows to save space.
Page Header - the page header uses up 96 bytes out of the 8192 bytes on a page, which is 1.2% overhead.

Index pages have the same general format as data pages, except that they only contain the columns in the index, and either the columns in the clustered index or the RID of the data record the index row points to. They may contain included columns if the index has some of these. There are three variations in the format of an index page (clustered leaf pages are the data pages for the table rather than index pages):

Clustered non-leaf - the pages for the upper levels of a clustered index
Non-clustered leaf - the leaf pages for the non-clustered index
Non-clustered non-leaf - the upper levels of the non-clustered index

The variations of an index record compared to a data record are as follows:

No Status Byte B (-1 byte) - Index records don't need a status byte B, so this is omitted, except for the clustered index pages.
No Fixed Length Offset (-2 bytes) - Index records all have the same number of columns, except in cases where there is a 'UNIQUIFIER' (described below), and the optional uniquifier is placed in the variable length columns area. This is because (unlike data pages) if an extra column is added to the schema, the index is rebuilt immediately so that all the index rows have the new number of columns. Therefore the fixed length offset past the fixed length fields is not needed because all rows for an index need the same amount of allocated space. However, for the clustered index, the non-leaf level pages DO have a fixed length offset field.
No Padding (-1 byte) - Index records don't ever need to be replaced with forwarding records, so they don't need padding.
Minimum of 2 columns (+1 byte) - An index record needs 2 columns at a minimum (the index column and the primary key of the row being referenced), unlike a data record which could have only a single column. The minimum size would be reached with a single 1-byte index column (such as bit / tinyint / char(1) / binary(1)), and a single 1-byte primary key.

Clustered index pages at non leaf level only have the columns in the clustered index, because the index columns are the clustered index columns and there is no need to have them in there twice. Non-leaf level index pages also have a 6 byte page locator to point to index pages at the next level down. Included Columns (new for SQL 2005, described further on) are only present at leaf level. Note that the B-trees for indexes in SQL server are not exactly like typical B-trees because the 'separator' records at the higher levels are repeated at the lower levels, and the lowest (leaf) level has index records for all of the rows, including the separators. This simplifies the procedure for the lookup of index records, but wastes of a small amount of storage space to store the separators at all levels of the tree.

Note that the clustered index key columns are present also on the non-leaf pages of non-clustered indexes, not because it is needed there, but most likely because it simplifies the code that data structure, and does not require significantly more space to store, seeing that there are relatively few non-leaf index records.

The minimum size for an index record is 6 bytes (compared to 9 bytes for a data row) for an index record on a non-clustered index at leaf level, where the only columns present are a 1-byte clustered index and a 1-byte index column, with a status byte A, a number of columns 2-byte value, and a NULL bitmap byte.

Normally the index records will contain the clustered index columns to lookup the rows. But if the table has no clustered index, then the table is a heap, and RID values will be used instead to lookup rows. Since rows in a heap are not moved or ordered in any way, RID values stay fairly constant, though a heap can become heavily fragmented over time as records change size or are deleted.

Note that a clustered index does not have to be unique. If the index looks up records by a clustered index, and it is not unique, then SQL server has to add information to the data and index rows to make the clustered index values unique. This extra column that gets added is called a 'Uniqueifier', which is just a 32-bit int which counts from 1 upward and is stored as the first column in the variable length columns list. Because the uniquifier is in the variable length columns list, it can be omitted for the rows where it is not needed. When your table requires a uniquifier, it adds an overhead of two bytes to every data and index row in your table, because the length of the uniquifier field must be stored for every row, and for the rows where it is present, there is additional overhead of 4 bytes.

Note that while a clustered index doesn't have to be unique, a primary key must be unique and the columns in the primary key must not allow NULLs - SQL server will not permit you to create a primary key which has NULL columns or non-unique column tuple-values. And if you try to designate columns as part of the primary key in the designer, the UI will automatically uncheck the checkbox to allow NULLs for that column.

Note also that a table can have a clustered index or not, and it can have a primary key or not, and it can have both a clustered index and a primary key and they can be on the same index or different indexes. But there can be no more than one primary key for the table, and no more than one clustered index for a table. The clustered index is mainly concerned with the order of the storage of the rows in the data pages, and the primary key is mainly concerned with uniquely identifying the rows. You can have one without the other, or both, or neither, and you can have both and have them on separate indexes. Indexes will lookup always lookup records by the clustered index value (with optional uniquifier), unless the table has no clustered index, in which case the lookup is by RID to a heap.

Index Column Limitations

The maximum number of columns in an index (clustered or not) is 16, and the maximum size allowed for a set of index columns (clustered or not) is 900 bytes. If you try to insert or update a row so that any index exceeds this limit, the operation will fail. It is supposed this limit was made so that every page in the B-tree in the worst case of a 900 byte index and 900 byte clustered index could still hold at least 4 index records (record size 1800+ bytes out of 8096 available in a page), and the B-tree does not have too many levels to walk through to reach the leaf level 0. This is still true even in SQL Server 2005 with Included Columns (described below) because included columns are not included in non-leaf level records, and so at least 4 index records can fit on a page at non-leaf level.

Included Columns

Starting with SQL Server 2005, an index can be defined with included columns. You cannot edit the list of included columns through the GUI (the property will be grayed-out / read-only), but you can define them if you create the index using T-SQL. An included column is included in the index records at leaf level like any other index column, but it is not used to look up data - it is only present as an optimization so that queries don't need to go read the actual data page if all they need are columns that are already included in the index record. Furthermore, a computed column can be defined as an included column. As usual, this is a performance / disk space and memory trade-off. Indexes should be as small as possible, so it is not wise to add extra included columns unless the performance gains are very noticeable.

Computed Columns

Computed columns are dynamically computed from the values of other columns in the same table. A computed column can optionally be marked as PERSISTED, in which can the column's value is stored on the data page like any other column and is only re-computed when the dependent column values change. If a computed column is not persisted, then the column value is always calculated at run-time, which saves space, but somewhat at the cost of performance. Computed columns can only be used in an index in certain cases, either in the key, or as an included column. If a computed column is included in an index in any way, the value is persisted in the index records rather than being computed dynamically, since having to refer to the data page to calculate the column value dynamically would defeat the purpose of the index.

Filtered Indexes, Sparse Columns, and Complex Columns

Two new features in SQL Server 2008 that go hand-in-hand are sparse columns and filtered indexes. A filtered index is an index with a where clause added, so that only the data rows that satisfy the where clause are indexed. Filtered indexes are most often used to exclude NULL values from being indexed, when only a small fraction of the rows in the table have non-NULL values for the column that is being indexed. This can yield a substantial savings in space used by the index.

Sparse columns are columns that are only included in the row record when the column's value is non-NULL. This is slightly different than the way variable length columns are stored. With a variable length column, 2 bytes are always stored for each column to keep track of the length of that column, even if it is 0, and the NULL bitmap always has a bit allocated for the column to keep track of whether the column has a NULL value or not. In contrast to this, sparse columns use no extra space in the row record if the value is NULL, but they do have extra overhead when the value is not NULL. As mentioned before, sparse columns are stored in a new special structure in the data row called the sparse vector, which is stored within a complex column in the variable length columns list. Here is what the sparse vector contains:

Column header - 2 bytes - this will have the value 5 for a sparse vector.
Sparse column count - 2 bytes - the number of sparse columns in the vector.
Column ID set - 2 bytes per sparse column present - the column IDs of the sparse columns that are present (giving the limit of up to 2^15 columns per table)
End Offsets of Sparse Columns - 2 bytes per sparse column present - this is a list of the end offsets of each sparse column value, just like the variable length column offsets.
Sparse column data - the actual sparse column data, arranged just like the variable length column data.

So we see that the sparse vector has an overhead of 4 bytes per sparse column that is present, plus the actual sparse data, and no overhead if the sparse column is not present (it does not even appear in the NULL bitmap). This can be constrasted with NULLable variable length columns. Even if a variable length column is NULL, it still requires 2 bytes in the variable column section to keep track of the length of the column, and it takes a bit in the NULL bitmap. Note that SQL server does a slight optimization for variable length columns in that if you have 4 such columns, but only the first three columns have non-NULL values, then the 4th column will not be stored, and the column count will only be 3 for that record, but in most cases, nearly all of the variable length columns will require 2 bytes of overhead per column. One could argue that SQL server ought to know to exclude the NULL columns because it has the NULL bitmap, and 1/8 of a byte per column is usually good enough for minimizing overhead, but SQL server does not do this as of 2008 R2. If less than about 20% of the values in a column are non-NULL, then a sparse column is clearly the way to go to save space.

Complex columns in general start with a single byte for the column header to indicate what kind of column it is. 1 = LOB pointer, 2 = row-overflow pointer, 5 = sparse vector.

Foreign Keys and Indexes

Foreign key constraints in SQL Server are not indexes and do not automatically create indexes in the referenced tables. Now suppose that you have added a foreign key constraint to prevent deletion of a record if there is a reference to it in another table. When this constraint is in place, every time you do a delete of a record in the parent table, SQL server has to join to the referencing table to make sure there are no references to the record you are trying to delete. And if this foreign column has no index on it, then this join has to do a table scan as opposed to an index lookup, which can be incredibly slow. In general, you need to be wary of this with foreign key constraints and almost always make sure that the foreign columns have indexes created.

Heaps and Fragmentation

When a table does not have a clustered index, the rows in the data pages are stored as a heap, where the rows have no particular order. Rows are usually added to the end of the heap in the order that they are inserted, but may sometimes be inserted into free space in between existing rows. If a row is updated such that it has to grow (increase in variable length data), then the row is taken out of its current position and moved to the end of the heap, and the slot array is updated to show the new position of the row. If there is no space on the current page for the updated row, then the record becomes a forwarding stub to the actual record which is moved to a different page, so that the RID values in the indexes don't have to change (which would cause a lot of extra logging). If a row is deleted, then the deleted row is simply cleared from the slot array by writing a zero to the slot array position. Heaps do not get automatically compacted, and over time they can become very fragmented.

Indexes and the Fill Factor

Indexes are stored a B-trees, and when the index is initially created, a page will have a maximum number of index rows that it can hold. By default, SQL server will try to fill each page to capacity to minimize the number of pages needed overall and minimize the size of the index. However, a completely full index causes a large initial number of page splits as rows are updated or deleted. Specifying a fill factor of less than 100 percent causes SQL server to fill each page to less than capacity when the page is index is initially created or rebuilt. This may be good for speed in the short term for tables with lots of write activity. It is only of temporary value because eventually all index pages average around 75%.

Forwarding Stubs

As mentioned in the section on heaps, if a row in a heap grows too much to fit on its original page any longer, it will be replaced with a 9-byte forwarding stub. This stub is simply the Status Byte A (1 byte) which will be 0x04, followed by the new RID of the actual record (8 bytes). Note that tables with clustered indexes do not have or need forwarding stubs, since these tables reference records by the clustered index key rather than the RID value.

Allocation Units / Partitions

As mentioned before, each data page points to an allocation unit through the IndexID (idInd) and ObjectID (idObj). An allocation unit ID uniquely identifies a group of data pages for a heap or index. Allocation unit IDs are 64-bit values where the uppermost 16 bits are the idInd (IndexID) value, the middle 32 bits are the idObj (ObjectID) value, and the lower 16 bits appear to be a partition id (which is always 0 in non-enterprise editions of SQL server).

If you have multiple partitions on an index, then each partition (by design) gets its own separate storage for data pages, and (by design) each partition must be searched separately for records - which can be a speed optimization or penalty depending on how well the partition is defined for large tables. Because there is one allocation unit per partition in a partitioned index, allocation units are sometimes called partitions by SQL server.

Note that the idObj values are not the same as sys.objects.object_id values - idObj values are assigned sequentially start at 1, and refer to unique allocation units (which are unique by the schema object, index id, and type of data - in-row, lob, or overflow) but object_id values are random 31-bit numbers that refer to schema objects. For system objects, the idObj and object_id values are the same, and the idInd value holds the index_id for the system object (though not exactly the same). The idInd values are all 256 for allocation units for user-defined objects.

Since allocation unit IDs are 64-bit, they have enough space to hold 2^31 unique objects, 2^15 indexes, and 2^15 partitions, which is enough not to be limiting in any aspect.

System Page Types Using the Data Page Format

Since the data page format is versatile, it also used by all of the system page types, with the following differences:

System table structures are often hard-coded into SQL server rather than being defined in the database itself. Otherwise you would have a chicken and egg problem, where SQL server would have to know the table format before being able to read the table that stored the table format.
Many system record types do not contain a NULL bitmap. This is different than user-defined tables, where you have to extreme lengths to have the NULL bitmap omitted, such as having a table composed entirely of sparse columns.
System records are not limited to 8060 bytes per page. For example, PFS records are 8092 bytes - the maximum size permitted by the data content area minus the slot array at the end.
Since the table structure is hard coded, system data records do NOT contain a field for the number of columns in the row.
Not every system record on a system data page necessarily has the same structure. For example GAM, SGAM, and IAM pages have two record slots, and record 0 is a different schema than record 1. With user defined tables, every record on the page always uses the same schema, except that columns at the end of the list may be omitted via the number of columns field.
All of the system page types have the special ObjectID value 99. This does not appear in the sys.objects view, and indicates that the page is a system page.

GAM Intervals and GAM / SGAM / IAM / DCM / BCM Pages

SQL server tracks which pages and extents belong to which database objects / allocation units through GAM (Global Allocation Map) intervals. A GAM interval is a range of up to 63904 extents in a database file (3994MB, just short of 4GB, since an extent is 64KB in size). The first GAM covers extents 0-63903. The 2nd GAM covers extents 63904-127807, etc... Each GAM interval in a database file will have a single GAM, SGAM, DCM, and BCM page allocated for it to track space usage and status in that GAM interval. There will also be up to one IAM page per allocation unit in the GAM interval. The GAM, SGAM, IAM, DCM, and BCM pages are all 'synoptic' in the sense that they all have the same definition of a GAM interval in terms of the starting and ending extents.

Note that while you can artificially grow a database to any arbitrary size in pages (up to 16TB), GAM-type pages will not actually be created for an additional GAM interval until non-empty data pages (page type non-zero) are added to that GAM. This prevents the number of IAM pages from getting out of hand in a database in a pathological case where you have hundreds of thousands of allocation units. IAM pages are only created in the GAM if the GAM actually has data for that allocation unit.

As a corollary to this, IAM, data, and index pages are only ever allocated for a database object when there is data added to them. So if you create a table but the table has no rows, no data pages will be created for the table until rows are added - it will be strictly a schema-only change until then.

Each page of the 5 types uses the same format as a data page, and has two slots for two records of different schema. Slot 0 is the header record (94 bytes) and slot 1 is the bitmap record. Slot 0 has three fixed length columns:

40 bytes = unknown use
Start Page - 6 bytes - The PageID:FileID of the starting page for this GAM, usually 1:0 until the size of the database file grows over 4GB.
Single Page Allocation Array - 48 bytes - this is only used for IAM pages. The bytes are 0 for GAM and SGAM pages. This array is a list of 8 PageID:FileID page locators corresponding to the individual pages allocated to the object. As mentioned before, SQL server will prefer to allocate a uniform extent for an allocation unit, but to do that for very small tables would waste space. So, when an allocation unit is first created, only single pages from mixed extents are allocated until the first 8 pages are allocated. After that, uniform extents are allocated instead of individual pages. If not all of the 8 single page slots are used by an allocation unit, then the remaining slots will contain 0:0. The single page allocation array is only used for the first IAM page in a chain: later IAM pages will allocate only full extents. Note also that SQL Server does not go back and automatically defragment the allocation unit by merging the first 8 single pages into a uniform extent when the allocation unit gets large enough - they stay located where they are.

Slot 1 is the bitmap slot, with a single bitmap column of variable length (up to 7992 bytes), holding one bit per extent in the GAM. The bits are stored a byte array of little-endian bytes (the least significant bit in each byte corresponds to the first extent, and the most significant bit to the last extent). Looking at an actual IAM page bitmap will help to illustrate this. This IAM page belongs to an IN_ROW_DATA allocation unit, and in this example it has marked the extents that belong to this allocation unit. Running DBCC PAGE with printOpt = 3 showed the following output:

IAM: Extent Alloc Status Slot 1 @0x15B1C0C2

(1:0) - (1:536) = NOT ALLOCATED
(1:544) - = ALLOCATED
(1:552) - (1:608) = NOT ALLOCATED
(1:616) - (1:624) = ALLOCATED
(1:632) - (1:1016) = NOT ALLOCATED

In this example, the file has 1024 pages total (8MB), and the last extent in the file starts at 1:1016. This allocation unit has 3 uniform extents allocated to it at 1:544, 1:616, and 1:624. DBCC PAGE shows the output this way for brevity - it would be way too verbose to show each extent listed individually and a yes/no value indicating if the extent was allocated to this allocation unit or not. In this example, the corresponding bitmap would like like this in hexadecimal:

00000000 00000000 10600000 00000000

The above bitmap has 128 bits for 128 extents. Extents 68, 77, and 78 = 1 (allocated), and the rest are not allocated. Depending on the page type, a bit for a given extent has the following meaning:

GAM - 1 means that the extent is completely unallocated
SGAM - 1 means that the extent has been allocated as a mixed extent and has some pages unallocated.
IAM - 1 means that the extent has been allocated as a uniform extent to the allocation unit that the IAM page belongs to
DCM - 1 means that the page has changed since the last full backup (useful when doing a differential backup)
BCM - 1 means that the page has had bulk-logged (minimally-logged) operations done to it since the last full backup.

Note that bits in the IAM, GAM, and SGAM pages are all mutually exclusive - for a given extent, a bit should never be set (owned) by more than one mapping page. If a bit is set for the same extent for multiple mapping pages, that indicates database corruption.

A typical database can have hundreds of IAM pages, but only one GAM and SGAM page per 4GB of database file size. Given that the IAM page in the example used only 120 bytes out of 8k, one might ask why it that SQL server stores allocation data in bitmap format rather than as a linked list. It would seem that having hundreds of mostly empty IAM pages is wasteful.

To answer this, we must remember three things:

SQL Server requires IAM pages for separate allocation units to reside on separate pages
The IAM, GAM, and SGAM pages all have the same format
That a 2GB file with 32000 extents would still only need 1 IAM page per allocation unit, and because of the first restriction, a simple list would use little space per page, but would still require 1 page per allocation unit, and so no space would be saved.

Because no space would be saved, and keeping the same format between the IAM, GAM, and SGAM pages simplifies processing, and the IAM pages don't really use much overhead in the database file, there has been no need to optimize the format of an IAM page. This is the mostly likely reason that it still uses a bitmap and not a list.

Finally, note that for the DCM and BCM pages, data is maintained in these pages even if the database has been set to use the Simple recovery model.

PFS Pages

Page Free Space (PFS) pages are data pages with only one data record. This data record has only one fixed length column, no NULL bitmap, and no variable length columns. The single fiixed length column is an 8088 byte array, and it stores one byte per page to track a number of statistics for that page.

The first PFS page is at *:1 in each database file, and it stores the statistics for pages 0-8087 in that database file. There will be one PFS page for just about every 64MB of file size (8088 bytes * 8KB per page). A large database file will use a long chain of PFS pages, linked together using the LastPage and NextPage pointers in the 96 byte header. Note that the PFS pages are NOT synoptic with the other mapping pages. All of the other mapping type pages refer to GAM intervals, but the PFS page range is not any kind of multiple of a GAM interval (8088 bytes vs 7988 bytes) - the intervals are completely separate. Also note that because there is one byte per 8KB page, instead of only one bit per extent like most of the other map pages, there are roughly 64 times as many PFS pages in a large database as there GAM pages.

Each PFS status byte has the following bit fields:

Free Space - bits 0-2: 0 = completely empty (0_PCT_FULL), 1 = 1-50% full (50_PCT_FULL), 2 = 51-80% full (80_PCT_FULL), 3 = 80-95% full (95_PCT_FULL), 4 = 96-100% full (100_PCT_FULL). This field is only used for LOB and heap data pages, it is ignored for all other pages types. This is a speed optimization to be able to find pages with free space to hold LOB data or heap records during an INSERT or UPDATE.
Ghost Records - bit 3: 1 = page contains ghost records. This is a speed optimization to find ghost records rather than having to search the headers of all known data and index pages.
IAM Page - bit 4: 1 = page is an IAM page. IAM pages are allocated or deallocated in this way.
Mixed Page - bit 5: 1 = page belongs to a mixed extent.
Allocated - bit 6: 1 = page is allocated. Note that extents are allocated though the GAM and SGAM pages, but individual pages in a mixed extent are marked as allocated on the PFS page.
Unknown - bit 7: appears to be ignored and is always 0.

Log Sequence Numbers

LSN (Log Sequence Numbers) are stored internally as 10 byte values, and can be displayed in one of two ways - either as three fields in the form A:B:C, or as a 25-digit whole number (positive, no decimal places).

When an LSN value is stored as a 10-byte value, the storage structure is:

m_fSeqNo (A) - 4 bytes
m_blockOffset (B) - 4 bytes
m_slotId (C) - 2 bytes

When displayed as a 25-digit whole number, B and C are padded on the left with zeroes to make them 10 and 5 digits and hold all possible values for a 32 or 16 bit number like this:

ABBBBBBBBBBCCCCC

In the above, A is not left padded with zeroes, but B and C are. Writing an LSN in this way allows you to compare two LSN values with a numeric comparison to find out which one is smaller (and therefore which one happened first).

The meanings of the LSN fields are described here. The A value is the VLF file sequence number - every log record in a given VLF will have a particular m_fSeqNo, and this number increments as each new VLF is added to the log - it is the ID of the VLF. Note that because of this, a log file is limited to no more than 2^31 VLFs.

The B value is the block offset within the VLF, which translates to a byte offset within the VLF of 512 bytes * m_blockOffset to get to a particular block within the log file. The slot ID is the record index within the block, and it always counts from 1 upward.

Datetime Data Type

The datetime data type is a packed byte array which is composed of two integers - the number of days since 1900-01-01 (a signed integer value), and the number of clock ticks since midnight (where each tick is 1/300th of a second), as explored on this blog and this Microsoft article.

This gives the interesting result that a zero datetime value with all bytes zero is equal to 1900-01-01 at midnight. It also tells us that the datetime structure is a very inefficient way to store time (the datetime2 data type was created to address this concern), except that it is excellent at defaulting to a reasonable zero point, and that the date and time parts can be split apart very easily by SQL server.

Note that while it is capable of storing days up to the year plus or minus 58 million, it is limited by rule to only go between 1753-01-01 and 9999-12-31. And note that while the clock ticks part is a 32-bit number, in practice the highest value used will be 25919999. Since the datatime clock ticks are 1/300ths of a second, while they display accuracy to the millisecond, the will actually be rounded to the nearest 0, 3, 7, or 10 millisecond boundary in all conversions and comparisons.

Schema Objects / Type

Schema objects are objects with a permanent name which are addressable using the 4-part syntax server.database.schema.object. Since objects are uniquely identified by name regardless of type, you cannot have two objects with the same name, even if they are different types (view vs. procedure, for example), unless they belong to different schema. SQL Server 2008 defines 26 different object types:

Only 12 of these types originally existed in SQL 7.0.
3 types were newly added in SQL 2000 (all of which were for UDFs).
9 types were newly added to 2005 (mostly dealing with CLR integration, but also including synonyms and service queues).
2 new types appeared in SQL 2008 (plan guides and table types).

Every object type has a char(2) type code:

AF - AGGREGATE_FUNCTION (2005) - A user-defined aggregate function using CLR code.
C - CHECK_CONSTRAINT - A constraint on a single table to validate column values.
D - DEFAULT_CONSTRAINT - A short SQL procedure to run to generate a default value for a column when a new row is added, and the value for the column is not specified.
F - FOREIGN_KEY_CONSTRAINT - A constraint between two tables which enforces referential integrity.
FN - SQL_SCALAR_FUNCTION (2000) - A user-defined-function (UDF) which returns a single value.
FS - CLR_SCALAR_FUNCTION (2005) - CLR based scalar function
FT - CLR_TABLE_VALUED_FUNCTION (2005) - CLR based table valued function
IF - SQL_INLINE_TABLE_VALUED_FUNCTION (2000) - SQL inline table-valued function where the function is a single SQL statement. The advantage of this over a multi statement UDF is that the query optimizer can 'understand' how the inline function works and know how many rows of data to expect ahead of time so that the query optimizer knows what to do without having to use temporary results in a table variable and/or a RECOMPILE or a FORCE ORDER option.
IT - INTERNAL_TABLE (2005) - An automatically created persistent table similar to an index which holds results such as XML nodes, transactional data, service queue messages, etc... These tables will usually have a unique number appended to the name, and are transparent for the most part to your code, except that they take up space in the database, and need to be there to enable the functionality they provide.
P - SQL_STORED_PROCEDURE - A stored procedure which is capable of running any T-SQL, and can return an integer, output parameters, and multiple result sets.
PC - CLR_STORED_PROCEDURE (2005) - CLR based stored procedure
PG - PLAN_GUIDE (2008) - A query hint that can exist separately from the query that it is targeted to optimize, useful when you need to modify the query optimization on a vendor database where you are not allowed to change the objects directly (usually because they might be overwritten during the next upgrade of the product).
PK - PRIMARY_KEY_CONSTRAINT - A primary key on a table
R - RULE - an old-style stand-alone rule which has been deprecated in favor of CHECK_CONSTRAINT starting with SQL 2000.
RF - REPLICATION_FILTER_PROCEDURE (2005) - A replication-filter procedure
S - SYSTEM_TABLE - A system base table
SN - SYNONYM (2005) - An alias name for a database object
SQ - SERVICE_QUEUE (2005) - An asynchronous message queue
TA - CLR_TRIGGER (2005) - CLR based DML trigger
TF - SQL_TABLE_VALUED_FUNCTION (2000) - A user-defined-function (UDF) which returns a table.
TR - SQL_TRIGGER - DML trigger
TT - TYPE_TABLE (2008) - A user defined table type which can be passed as a parameter to a function or procedure
U - USER_TABLE - A user-defined table
UQ - UNIQUE_CONSTRAINT - Unique constraint on a set of columns
V - VIEW - A user-defined view
X - EXTENDED_STORED_PROCEDURE - An extended stored procedure

In addition to these types, SQL Server 2012 (Denali) will define a new object type called SEQUENCE, which makes generation of identity columns more adjustable by centralizing the logic in database object.

File Header Page

The first page in each database file is the file header page, and there is only one such page per file. Reading this page can be done either with the standard DBCC PAGE command, or with the DBCC FILEHEADER command which is also undocumented, see example below:

DBCC FILEHEADER ('[dbname]', [FileID])

DBCC FILEHEADER ('testDB', 1)

The file header page has a format identical to a data page, but the table is a system table whose schema seems to be hard-coded into SQL server, rather than defined in any other table. This data page has only a single row, and most of the columns are in the variable length columns section.

The page/tornBits checksum in the page header acts as a security checksum for this page - if you try to manually edit the header data in a binary editor and then attach the file in SSMS, the UI will tell you that the file is not a primary database file.

The first 16 bytes of the file header are fixed length (16 bytes). The remaining fields are all optional / variable length using the same structure as the variable length columns section of a data row. Therefore, rather than listing the offsets of each of these fields, the field index instead is given (there are 46 such fields in SQL 2008 R2, numbered 0x0 - 0x2d. To lookup the position of field N, read the offset at 0x12+N*2 and the offset just before it, to get the start to end position of the field in the file header.

The fixed length file header fields, followed by the offsets for the variable length fields are as following:

0x00: 2 bytes - value 0x0030
0x02: 2 bytes - value 0x0008
0x04: 4 bytes - value 0
0x08: 4 bytes - value 47
0x0C: 4 bytes - value 0
0x10: NumberFields - 2 bytes - count of number fields
0x12: FieldEndOffsets - 2*(NumFields) - offset to the end of each field in bytes relative to the start of the page. The last offset is the end of the overall file header structure

The variable length fields are (by column index):

0: BindingID - 16 byte GUID - used to make sure this file is really part of this database, according to Paul Randal (but how?)
1: ?? - 0 bytes
2: FileIdProp - 2 bytes - the FileID value for this file
3: FileGroupID - 2 bytes - not sure what this is, value usually 1
4: Size - 4 bytes - number of 8KB pages in this database file
5: MaxSize - 4 bytes - maximum number of pages allowed for this file (-1 for unlimited size).
6: Growth - 4 bytes - the number of pages to automatically grow the database file by, or a whole number percent value for auto-growth, depending on Status bit 0x100000.
7: Perf - 4 bytes - value 0
8: BackupLSN - 10 bytes
9: FirstUpdateLSN - 10 bytes
10: OldestRestoredLSN - 10 bytes
11: FirstNonLoggedLSN - 0 bytes
12: MinSize - 4 bytes
13: Status - 4 bytes - bitmask for state of the database: 0x2 = regular database file. 0x100000 = grow the database file by a percent rather than by a fixed number of pages.
14: UserShrinkSize - 4 bytes
15: SectorSize - 4 bytes - the disk sector size in bytes, usually 512 bytes.
16: MaxLSN ? - 10 bytes
17: MaxLSN structure - 28 bytes - MaxLSN (10 byte LSN, repeated), zero value (??), MaxLsnBranchID - (16 byte GUID)
18: FirstLSN - 10 bytes
19: CreateLSN - 10 bytes
20: DifferentialBaseLSN - 10 bytes
21: DifferentialBaseGuid - 16 byte GUID
22: FileOffsetLSN - 10 bytes
23: FileIDGuid - 16 byte GUID
24: RestoreStatus - 4 bytes
25: RestoreRedoStartLSN / RedoStartLSN - 10 bytes
26: ?? - 0 bytes
27: LogicalName - 2*num chars - the logical name of this database file
28: RestoreSourceGuid - 16 byte GUID
29-34: HighestUpdateLSN, ReplTxfTruncationLsn, TxfBackupLsn, TxfLogContainerSize, ??, ?? - 0 bytes
35: MaxLsnBranchID - 16 byte GUID
36-37: SecondaryRedoStartLSN, SecondaryDifferentialBaseLSN - 0 bytes
38: ReadOnlyLSN - 10 bytes
39: ReadWriteLSN - 10 bytes
40: ??, 28 bytes, value 0
41: RestoreDiffBaseLSN - 10 bytes
42: RestoreDifferentialBaseGuid - 16 byte GUID
43: RestorePathOriginLSN, RestorePathOriginGUID - 28 bytes (?)
44: 8 bytes
45: remaining bytes - padding?

Boot Page

Each database has a single boot page at 1:9, which has much the same format as the file header page - a data page with a single record. However, this record has no NULL bitmap or variable length column data - it is composed entirely of fixed length columns, which makes parsing it very easy. Here are the columns by offset relative to the start data record:

0x4: dbi_version - 2 bytes - the file version of the database. For example, this is 661 for SQL 2008 R2. In general, database files can be loaded by any version of SQL server higher than the version that they were first introduced. Restoring an earlier version of a database file will first convert it to the most current version that the storage engine uses. Thus, you cannot take a database file from a later version of SQL server and load it into an earlier version of SQL server. However, there are other ways to move data, such as replication and T-SQL statements.
0x6: dbi_createVersion - 2 bytes - the file version of the database when it was first created.
00x8: ?? - 28 bytes - value = 0
0x24: dbi_status - 4 bytes
0x28: dbi_nextid - 4 bytes
0x2C: dbi_crdate - 8 bytes - the date that the database was created (not the original creation date, but the last time it was restored).
0x34: dbi_dbname - 128 bytes - nchar(128) - the name of the database in Unicode, padded on the right with space characters.
0x134: ?? - 4 bytes, value = 6
0x138: dbi_dbid - 2 bytes - the database ID. Since this is only 2 bytes, there can be only 2^15 databases per SQL instance.
0x13A: 2 bytes, value = 0x64
0x13C: dbi_maxDbTimestamp - 4 bytes
00x140: ?? - 16 bytes - value = zero
0x150: dbi_checkptLSN - 10 bytes
00x15A: ?? - 2 bytes - value = 6
0x15C: dbi_differentialBaseLSN - 10 bytes
0x166: dbi_dbccFlags - 2 bytes
0x168: ?? - 24 bytes - value 0
0x180: ?? - 4 bytes, value 0x2682 = 9858
0x184: ?? - 4 bytes - value 0
0x188: dbi_collation - 4 bytes
00x18C: ?? - 12 bytes, value 0 with one 0x61 byte
0x198: dbi_familyGuid - 16 bytes
0x1A8: dbi_maxLogSpaceUsed - 4 bytes
00x1AC: ?? - 16 bytes - value 0
0x1BC: dbi_recoveryForkNameStack - 56 bytes - array of two 28-byte entry records = entry[0..1]. Each entry holds an LSN, two bytes of padding = 0, and a GUID value.
0x1F4: dbi_differentialBaseGuid - 16 bytes
0x204: dbi_firstSysIndexes - 6 bytes - the FileID:PageID page locator for the first sysindex page - this is the first data page of the sysallocunits (objectID = 7) table, which is the global table directory that lists all the allocation units in the database by their ObjectID, and theit first data page and IAM page. In other words, this is the table that is the entry point to describe all other tables and indexes in the database.
0x20A: dbi_createVersion - 2 bytes - repeated ?
0x20C: ?? - 12 bytes - value 0
0x218: dbi_versionChangeLSN - 10 bytes
00x222: ?? - 94 bytes - value 0
0x280: dbi_LogBackupChainOrigin - 10 bytes
0x28A: ?? - 26 bytes, value 0
0x2A4: dbi_modDate - 8 bytes
0x2AC: dbi_verPriv - 4 bytes
00x2B0: ?? - 4 bytes, value 0
0x2B4: dbi_svcBrokerGUID - 16 bytes
00x2C4: ?? - 28 bytes, value 0
0x2E0: dbi_AuIdNext - 8 bytes
00x2E8: (padding) - 700 bytes

Instance Header

Each SQL server instance has one instance header page at 1:10 in the first database file of the master database. Todo... continue analysis of this page...

Dynamic Management Views (DMVs) and System Base Tables

SQL Server 2005 and higher no longer allow direct access to view system base tables. Instead DMVs (system views) are supplied that query these tables and show a relevant view of these tables to users. However, DBCC PAGE can still be used even in SQL 2008 to view the contents of the system base tables directly.

Microsoft describes the purposes of the various system base tables here.

System Table: sysallocunits (ObjectID = 7)

The sysallocunits table is the entry point containing the metadata that describes all other tables in the database. The first page of this table is pointed to by the dbi_firstSysIndexes field on the boot page. The records in this table have 12 fixed length columns, a NULL bitmap, and a number of columns field. Most of the columns in this table can be seen through the DMVs sys.system_internals_allocation_units, sys.partitions, sys.dm_db_partition_stats, and sys.allocation_units. The names in parenthesis are the columns as they appear in the DMV. The structure of this table is as follows:

auid (allocation_unit_id / partition_id) - 8 bytes - the unique ID / primary key for this allocation unit. The top 16 bits of this ID is 0-4, 255 or 256. The mid 32 bits is the Allocation Unit's ObjectID (which is an auto-increment number), and the lowest 16 bits appear to always be 0.
type (type) - 1 byte - 1 = IN_ROW_DATA, 2 = LOB_DATA, 3 = ROW_OVERFLOW_DATA
ownerid (container_id / hobt_id) - 8 bytes - this is usually also an auid value, but sometimes not.
status - 4 bytes - this column is not shown directly in any of the DMVs.
fgid (filegroup_id) - 2 bytes
pgfirst (first_page) - 6 bytes - page locator of the first data page for this allocation unit. If this allocation unit has no data, then this will be 0:0, as will pgroot and pgfirstiam.
pgroot (root_page) - 6 bytes - page locator of the root page of the index tree, if this allocation units is for a B-tree.
pgfirstiam (first_iam_page) - 6 bytes - page locator of the first IAM page for this allocation unit.
pcused (used_pages) - 8 bytes - used page count - this is the number of data pages plus IAM and index pages (PageType IN (2,10))
pcdata (data_pages) - 8 bytes - data page count - the numbr of pages specifically for data only (PageType IN (1,3,4))
pcreserved (total_pages) - 8 bytes - reserved page count - this is the total number of pages used plus pages not yet used but reserved for future use. Reserving pages ahead of time is probably done to reduce locking time when more space needs to be allocated.
dbfragid - 4 bytes - this column is not shown in the DMV

System Table: syschobjs (ObjectID = 34)

The sysschobjs table is the underlying table for the sys.objects table. It has a NULL bitmap and one variable length column. Note that there is also a sys.system_objects table with object IDs less than 0, but the objects shown in that view are not in this table.

id (object_id) - 4 bytes - the unique ID for the object. This will be the same as the allocation unit's object ID for system objects, but otherwise it will be a random number between 100 and 2^31.
name (name) - the name of the object, Unicode, variable length. The (name, schema) must be unique among all objects in a database.
nsid (schema_id) - 4 bytes - the schema ID of this object.
nsclass - 1 byte - this is not shown in the DMV
status - 4 bytes - this is not shown in the DMV
type (type) - 2 bytes, char(2) - this is the type of the object
pid (parent_object_id) - 4 bytes - if this object belongs to a table, then this is the ObjectID of the table it belongs to.
pclass - 1 byte
intprop - 4 bytes
created (create_date) - 8 bytes, datetime - the time the object was first created.
modified (modify_date) - 8 bytes, datetime - the time the schema for this object was last modified.

System Table: syscolpars (ObjectID = 41)

The sys.syscolumns and sys.columns DMVs show data in the syscolpars table, which is a list of every column defined in any table, view, input parameter to a procedure or function, or output column of a table valued function in the database. The columns in parenthesis are what the sys.sycolumns view shows.

id (id) - 4 bytes - the ObjectID of the table or view that owns this object
number (number) - 2 bytes - for functions and procedures, this will be 1 for input parameters, it is 0 for output columns of table-valued functions and tables and views
colid (colid) - 4 bytes - the unique id of the column within this object, starting at 1 and counting upward in sequence. (id, colid, number) is unique among all columns in the database.
name (name) - variable length, nvarchar - the name of this column.
xtype (xtype) - 1 byte - an ID for the data type of this column. This references the system table sys.sysscalartypes.xtype.
utype (xusertype) - 4 bytes - usually equal to xtype, except for user defined types and tables. This references the system table sys.sysscalartypes.id
length (length) - 2 bytes - length of this column in bytes, -1 if this is a varchar(max) / text / image data type with no practical maximum length.
prec (prec) - 1 byte
scale (scale) - 1 byte
collationid (collationid) - 4 bytes
status (status) - 4 bytes
maxinrow - 2 bytes
xmlns - 4 bytes
dflt - 4 bytes
chk - 4 bytes
idtval - variable length?

Things to note: colid is a 16 bit number, meaning that a table can have no more than 2^15 columns, and a procedure or function can have no more than this many parameters (though in practice the limit is far below this many).

System Table: sysobjvalues (ObjectID = 60)

The sysobjvalues system base table is a general storage area for metadata about an object which is not set based (that is to say, scalar single valued attributes on an object). For object types such as functions, procedures, views, and default constraints, the most important such attribute is the T-SQL code which is the 'guts' of the object and defines how it runs.

Previously, these attributes used to be stored in the system table 'syscomments', and there is now a DMV called sys.syscomments which provides this same view for backwards compatibility. However, this DMV is only for compatibility with old code, and new code should use the DMV sys.sql_modules to get details about 'programmable' object types.

The sysobjvalues table has a NULL bitmap and two variable length columns:

valclass - 1 byte
objid (object_id) - 4 bytes
subobjid - 4 bytes
valnum - 4 bytes
value - 6 bytes - seems to have object specific data indicating how the imageval should be interpreted
imageval (definition) - image (up to 1GB). For non-CLR procedure, function, view, trigger, and constraint object types (C, D, FN, IF, P, PG, TF, TR, V), this is the T-SQL code of the object as varchar data.

One thing that we can see from this definition is that the body of a procedure or function is not practically limited in size, since imageval can be up to 1GB, and very complex expressions are possible in the T-SQL even of simple default and check constraint objects (though this would usually be bad for performance).

System Table: sysscalartypes (ObjectID = 50)

The sysscalartypes table holds a row for every built-in and user-defined data type. It can be read using the DMVs sys.systypes and sys.types. It has the following columns:

id - 4 bytes - the unique id for this built-in type or UDT.
schid - 4 bytes - the schema that owns this data type.
name - nvarchar - the name of the data type.
xtype - 1 byte - the same as the xtype values in the syscolpars table - equal to the id for built-in types.
length - 2 bytes - the length of the data type in bytes
prec - 1 byte
scale - 1 byte
collationid - 4 bytes
status - 4 bytes - status flags about the type. If 0x1 is set, the type does not allow NULLs (default is to allow NULLs)
created - datetime, 8 bytes
modified - datetime, 8 bytes
dflt - 4 bytes
chk - 4 bytes

System Table: sysidxstats (ObjectID = 54)

The sysidxstats system table has one row for every index or heap in the database, including indexes on tables returned with table valued functions and user defined table types. It is visible with through the DMV sys.indexes and sys.sysindexes. This table is clustered by (id, indid). This table also includes automatically created statistics on columns that don't have indexes - these statistics have automatically generated names of the form _WA_Sys_[ObjectID-as 8 digit hex number]_[random 8 digit hex number] - the query optimizer uses them to select the best excution plan when a join includes unindexed columns that only have statistics. The format of this table is:

id (object_id) - 4 bytes - the object_id of the table or view that this index belongs to
indid (index_id) - 4 bytes - the index_id (1 for the clustered index, larger numbers for non-clustered indexes)
name (name) - nvarchar - the name of the index. This will be programmatically generated for function generated table and user defined table types. It will be NULL for heaps.
status - 4 bytes - Note - this is NOT the same as the column sys.sysindexes.status. 0x10 = pad index turned on (is_padded).
intprop - 4 bytes
fillfact (fill_factor) - 1 byte - the fill factor for the index in percent (0-100), defaults to 0.
type (type) - 1 byte - 0 for heap, 1 for clustered index, 2 for non-clustered index
tinyprop - 1 byte
dataspace - 4 bytes - appears to be 1 (PRIMARY) for permanent tables, and 0 for return tables of functions and user defined table types (NO data space).
lobds - 4 bytes
rowset - 8 bytes - appears to be the sysallocunits.container_id that this index belongs, visible through the DMV sys.partitions.hobt_id.

Todo... analyze

System Table: sysiscols (ObjectID = 55)

The sysiscols is a table with only 22 bytes of fixed length data per row, and it defines all indexes and statistics in system and user tables, both clustered and non-clustered indexes. Heaps (index id = 0) are not included in this table. The data in this table is visible via the DMV sys.index_columns (indexes only, statistics columns are not included). This table is clustered by (idmajor, idminor, subid):

idmajor (object_id) - 4 bytes - the object_id of the object that the index is defined on
idminor (index_id) - 4 bytes - the index id or statistics id for each object_id
subid (index_column_id) - 4 bytes - when an index contains multiple columns, this is the order of the columns within the index. For statistics, this appears to always be 1
status - 4 bytes - bit mask: 0x1 appears to always be set, 0x2 for index, 0x4 for a descending index column (is_descending_key).
intprop (column_id) - 4 bytes - appears to the column (syscolpars.colid)
tinyprop1 - 1 byte - appears to be equal to the subid for an index, 0 for a statistic.
tinyprop2 - 1 byte - appears to always be 0

Page Verification

Every database has a database level option called PAGE_VERIFY, and it can have one of three possible values - NONE, TORN_PAGE_DETECTION, and CHECKSUM. However, there is one limitation - prior to SQL server 2008, the tempdb database can only use the option NONE (this was corrected in SQL server 2008, where tempdb can use any PAGE_VERIFY option). This option can be set in SSMS through Database, properties, Recovery, Page Verify. The setting can be queried through the sys.databases.page_verify_option_desc view column. It can be set in T-SQL through ALTER DATABASE [dbname] SET PAGE_VERIFY [value]. It is not known yet how the value is stored by the storage engine though, as it does not appear to be written in the file header page.

This database level option sets the checksum algorithm that will be used to verifying that read/writes to/from disk are not the source of corruption. When the setting is something other than NONE, SQL server will use the algorithm to compute a checksum when writing a page to disk, and will write two data fields to the 96-byte header of the page: FlagBits will have a 0x100 (TORN_PAGE_DETECTION) or 0x200 (CHECKSUM) bit set, and the TornBits field will store the checksum. If this page is later read from disk, the FlagBits value will tell SQL server how to verify the checksum. If the FlagBits is set to NONE, or the database option is set to NONE, the checksum comparison will be skipped. Otherwise, the algorithm selected by the page will be used to compute the checksum for the data on the disk and compare to the TornBits value. If the two don't match, then something bad happened between the write and the read, and data got corrupted. SQL server will assume that this corruption was not user generated and was due to a fault in the I/O subsystem, and will raise a critical error telling you that database corruption was detected.

The TORN_BITS_DETECTION algorithm is the weaker of the two, and it was all that was available in SQL server 2000. It computes a 2-bit checksum for each of the 16 512-byte sectors on the page, and joins them all together to get a 32-bit checksum. In contrast, the CHECKSUM algorithm which was added in SQL server 2005 takes a 32-bit checksum where all 32-bits depend on the full contents of the 8KB page.

The CHECKSUM option is the stronger of the two because it has only a 1 in 2^32 chance of failing to detect corruption, whereas TORN_BITS_DETECTION has at best a 1 in 4 chance of failing to detect corruption for a single sector. The CHECKSUM option is slightly more CPU intensive to compute, but in comparison to the delay required to read a page from disk, the overhead is minimal. The CHECKSUM option is the default with SQL server 2005 and higher.

Paul Randal posted an article around common misconceptions about page verification. Once you know how page verification actually works, it is easy to see why these are misconceptions:

Changing a database's PAGE_VERIFY option does not immediately switch all pages to use that option. A page is only updated to a particular verification algorithm when it gets written to disk. It would be nice if SSMS or T-SQL did this for you, but consider what would have to happen for this to be allowed: you would need to confirm that you intended to re-read and re-write all pages to disk, since that would at the very least take a fair amount of time to do and slow down a production database. Also, if any existing pages had corruption, the operation should skip over those pages and issue warnings about those pages - since they would need to be repaired, and the flags about corruption not removed.

Third, consider what PAGE_VERIFY does. It really only protects you from I/O corruption from the point that you turn it on going forward, and only where the corruption is caused by the I/O subsystem (it might not detect errors from bad memory or a bad CPU, for example, but if you have that, then you've got worse problems). It does not tell you anything about corruption that you may already have had in your database before turning the option on. All that page verification does is compare the page contents before and after a write/read cycle to make sure they haven't changed. To detect corruption in general, the DBCC CHECK* commands would have to be run, since they make sure that the database is logically consistent.

So, as a best practice, you should almost always immediately turn on page verification to the maximum possible level for all databases. But don't expect that it will catch corruption which may already be present in the database - it will not - it will only place a guard against future corruption due to I/O errors.

All things considered though, it would still be nice to have that option as a DBCC command to let you immediately 'change the guards' on all pages.
Changing a database from TORN_BITS_DETECTION to PAGE_CHECKSUM does not lose the protection already in place under the lower strength protection, but setting the option to NONE will, because SQL server will ignore the checksums already present on the pages, and will (it is supposed) write new pages without checksums turned on in the FlagBits. Each page remembers what kind of protection it has turned on. If you go from the lower to the higher strength, you will be begin to have a mix of pages with the lower and higher strength protection, with the higher protection added to a page each time it is written to disk. This is the best you can hope for at present, and this is what SQL server does for you.
The page checksum option is not a CRC - it cannot correct single bit errors - it was not designed to do that. Microsoft could have programmed it that way, but I/O corruption is not like network corruption. I/O errors usually cause large sections of data (entire 512-byte sectors) to be changed, and so a CRC with an error correction attempt would be useless.
Page verification does not detect corruption until a page is actually read into memory. It does not detect corruption as soon as the database is attached. After all, if you have a 2TB database file, should SQL server actually make you wait for 5 minutes while it does a page verification on every page in the database before it lets you attach the database? - probably not. And if it did, what would that do for you? I/O errors can happen on any read, not just the first one. If SQL server reads in all the pages, does the checksums, and then flushes them all out of memory again because you don't have 2TB of RAM, then the next time the page is read from memory just before it is used, the verification still has to be done again, because something could have gone wrong in the meantime. Instead, SQL server only does the check as the data is requested. Now, if you wanted to explicitly check all the checksums to verify they were correct, that would probably be handled by a DBCC command run as a special case, or as part of a scheduled maintenance task run after business hours.
The PAGE_VERIFY option for a database is not automatically raised to CHECKSUM when you attach or restore a database into SQL server 2005 or higher. It would be nice if SQL server did this for you, but the reasoning by the development team was that there might be rare cases where the current algorithm was chosen intentionally, and that you as the user might not want it changed, so they chose not to change it for you. However, if you don't have a good reason not to upgrade PAGE_VERIFY to CHECKSUM, do it immediately.

Search This Blog

Microsoft SQL Server Database Structure

Microsoft SQL Server Database File Structure (SQL Server 2008 R2)

Comments

Post a Comment