Saturday, September 24, 2016

Tuning Very Large SQL-Server Databases - Part 1 (Physical Storage Optimization)

This is the first in a series of three SQL-Server best practice tuning guides to help you achieve optimal performance, speed, and efficiency from your Very Large SQL-Server Databases.  Although these techniques will apply to Microsoft SQL Server, they will generally also apply to other relational database systems (RDMBS) like MySQL, Oracle, DB2, Postgres and others.

The series begins with a focus on optimizing physical data-storage requirements (i.e., space reduction), since it is perhaps the most important overall tuning consideration for large databases.

Tuning Very Large SQL-Server Databases — Part 1

SQL-Server Performance Optimization and SQL-Server Speed Optimization should always be a primary consideration during the database design phase, regardless of the size of the relational database — with the exception perhaps of "tiny" or "trivial" databases where future growth is guaranteed not to be an issue. And, very large databases (in excess of a Gigabyte of data, or millions of rows of data in any particular table) definitely require the SQL-Server database designer / administrator to consider the ramifications of proper (or improper) database design. Perhaps this seems obvious, but given my many years of experience and encounters with large corporate databases, what should be obvious is anything but obvious to many developers, designers, and administrators that implement these massive databases.

Most of this discussion applies to all modern Microsoft SQL-Server versions (SQL 2008, 2012, 2014, 2016) as well as versions as early as SQL-Server 7.0, 2000, 2005. The newest versions of SQL-Server, especially the full-featured Enterprise editions, may also offer you additional features of relevance like built-in data-compression. But, even fancy features like data-compression are no substitute for proper design.

Reduce Physical Storage Requirements

The first thing I always make sure of when designing databases, and when tuning large SQL-Server databases for performance, is that physical storage space is not being wasting with data-types that are overly precise or excessively large for their intended function, and that data is properly normalized. If you don't understand the concepts of normalization, please study up on it before reading further (or pretending you can do your job as a SQL-Server developer or DBA).

Keep in mind that this discussion is all to be considered in the context of very large databases, where there will be millions of rows of data in one or more tables. In fact, you may be working with tens of millions or hundreds of millions of rows of data. As such, every byte of storage saved on a per-row basis will potentially result in Megabytes saved (in reduced physical storage), or perhaps even gigabytes saved in the overall table and database size.

I will now present some very simple rules for reducing physical storage requirements and some (storage space) savings-opportunity examples to consider:
  • Do not use NVarchar or NChar (vs. Varchar and Char respectively) unless you need Unicode character support. Using NVarchar (or NCHAR) where a standard Varchar/Char will suffice will double the storage requirements for a column, since Unicode characters map to double-byte pairs (to accommodate all those foreign language characters and extended symbols, etc).

    Savings Opportunity: I have encountered NVarChar use on columns where there is an Alphanumeric "TransactionID" on millions of rows of data, but where that AlphaNumeric value is most certainly only containing the 10 digits and the standard 26 letters of the English alphabet. Instant space savings!
  • Use SmallDateTime instead of DateTime, unless you really need the precision. SmallDateTime fields are accurate to the nearest minute, whereas full DateTime types are accurate to 3.33 millisecond intervals. And, that extra precision comes with the price tag of double the storage — SmallDateTime requires 4 Bytes instead of 8 Bytes for DateTime.

    Savings Opportunities: if you see fields like "AccountingPeriodBeginDate", it should be obvious that an accounting period is going to start on a particular day boundary, and SmallDateTime will more than suffice. I also regularly see fields like "LastChangeDate" on tables in large databases where, upon inquiring, nobody can provide any reason why any accuracy beyond a particular minute is needed, but yet the fields are defined as full DateTime fields.
  • Use the smallest Integer data-types that will safely and comfortably handle all possible values in an integer-data column. SQL-Server provides a few datatypes for this, including bigint, int, smallint, and tinyint. Each has their storage limitations and storage-space requirements. TinyInt takes just one byte, but can only store values ranging from 0-255 (i.e., 2^8 values), whereas SmallInt is two bytes in length, and can store 2^16 values, Int is 4-bytes and holds 2^32 values, and BigInt is 8-bytes and holds 2^64 values (see the pattern?)

    Savings Opportunity: look for columns that have an obviously small number of values, but whose column-definitions are grossly oversized. E.g., if you see a field like (accounting) "Quarter" defined as an Int, TinyInt can easily hold the four possible values (Quarter 1, 2, 3, or 4) and saves 3bytes/row. Or, perhaps you'll see UserID defined as an Int where there's no way in the next 100 years the application will have over 10 or 20,000 users — this sure looks like a good situation to downsize to SmallInt and save half the space!
  • As with Integer types, the money data-types come in a large and small size — Money and SmallMoney respectively; the first of which takes 8-bytes to store, and the latter which takes only 4 bytes. SmallMoney values are limited to roughly +/-214,000.0000 compared to hundreds of trillions for Money.

    Savings Opportunity: if you have per-unit Costs, Charges, or the like stored in your database columns where the price is never going to even begin to approach that $214,000 value, use SmallMoney and cut your storage by half. E.g., if you are storing the price of items sold in your convenient-store locations, or even the price-per-gallon of gasoline you sell, this limitation should suffice for the foreseeable future.
  • Decimal and Numeric data-types provide a bit more complexity when analyzing the proper "best" size for the fields. See the SQL-Server Books Online to size appropriately, since the maximum total number of decimal digits that can be stored, both to the left and to the right of the decimal point, will determine the storage-bytes required. You want to size your columns in a way that they will always be large enough to hold the values you will place in them, without over-sizing them.
  • Variable-length fields where fixed-length should be used. Varchar fields require 2 extra bytes in addition to the the defined maximum field-width in order to store a value which holds the characters (used) count. Varchars are great for when there really is heavy variation in the actual length of data being stored. But, they do not make sense if you'll always be filling all bytes in the field (or even all bytes minus two).

    Savings Opportunity: perhaps one of the most ridiculous things you will encounter is a one or two character field that is defined as Varchar(1) or (2). This is one sure way to make a one-character field take 3-bytes of storage per row. I have seen this repeatedly with things like "Flag" fields or "Code" fields people design into table structures. I personally think the RDBMS should not allow a user to even define Variable-width columns of just one or two characters.

    Note: SQL-Server has one oddity to watch out for to ensure that Char fields are truly treated as fixed-length fields — as stated in the Books Online: "If SET ANSI_PADDING is OFF when either CREATE TABLE or ALTER TABLE is executed, a char column that is defined as NULL is handled as Varchar"... just so you are aware (note: I find this wacky to say the least - if I define something as Char, I really mean Char and not Varchar)
  • Look for obvious cases where proper data-normalization is being ignored. Quite often, you'll encounter massive data-repetition in tables with millions or tens/hundreds of millions of rows. The key to speed is to make that repetitive data take up as little space as possible (without employing some data-compression algorithm that is going to kill performance on save/retrieves - in case you were thinking about such a thing).

    Savings Opportunity: start by looking for just plainly blatant ignorant denormalization. I have seen so many massive tables where things like "LastUpdateUser" is a 10 or 20 character (if not Unicode character) field in databases where those last-update "Users" are a finite group of individuals that can be counted on both hands. Talk about a waste! If you encounter this type of situation, it's time to create a "UpdateUsers" table (or whatever) with an Integer primary-key (in this example, a TinyInt) and a field called "UserName" or whatever, and replace the "LastUpdateUser" on the massive-rows table with "LastUpdateUserID" that is a TinyInt column (foreign-key to the new UpdateUsers Table). This will save a ton of space! And, there is nearly zero cost in joining to such a small table later if you really need to get the name associated with the ID.

    Another Savings Opportunity: there is another technique I have employed to tune very large tables (in excess of 50 million rows/table) that may get a bit tougher to initially grasp. But, this technique can be a huge space and processing-time saver. It involves a bit of data-analysis, and thinking "out of the box". Basically, I will look for data-patterns that repeat, and whose unique (pattern) values are finite though not instantly apparent — especially when the patterns span multiple columns on the same table. I've routinely discovered that the unique values from a combination of few fields may only number in the thousands, even though the total table rows exceeds 50 or 100 million. When I find these, and establish the reason for the pattern (and ensure my logic will accommodate future conditions too), I will pull the unique field-combinations into a "lookup" table and create a small FK (SmallInt or Int quite often) to point to those combinations (and replace the column(s) whose data I have moved to a lookup table). Why? I'll get into this in the next section about why to do any of this space-tuning, but the bottom line is speed and efficiency. If you have a particular situation you need tuning help with, contact me.

If you begin with the above space-savings guidelines, chances are you'll find a way to make some significant storage-space reductions in your SQL-Server databases. And, you will have made the first steps towards performing some serious speed-optimizations in your database as well.

One note of caution when making any database changes: be sure you understand the ramifications beyond the scope of the database. Many of the changes I've outlined will be more or less transparent to other applications, depending on what (languages) the applications are developed in, and how the applications access the data (note: I always recommend that applications do not have direct access to the data-tables, but rather access data through stored procedures — for many reasons, one of which is to allow simplified maintenance to the underlying database tables without impacting other applications).

If you don't know what side-effects your changes could cause for Graphical User Interfaces (GUI applications) or reporting applications, and so on are, then be sure to involve someone that does. E.g., When changing Int fields to SmallInt, a Borland (aka CodeGear) Delphi based program most likely won't care, but C# data-bindings may have issues on an ASP.NET page or such. So, be careful. From what I have seen, Delphi as well as C# all seem fine with SmallDateTime in place of full DateTime fields in many cases, but it can also depend on how customized an application is.

Computing / Estimating Storage Requirements

To realize how much potential for storage-space savings there is, you will want to be able to quickly rough-out some approximations based on the columns in table(s), the number of bytes used per-column, and the number of rows per table.

In theory, you should be able to quickly estimate a Table's (physical storage) size just by knowing the number of rows, and multiplying that by the total of all columns' storage requirements. This can be put into a simple formula something like this (I didn't put ALL data types in here, but you should get the idea), where you just multiple the number of rows in a table by:
(NVarChar-field-widths-in-aggregate * 2 + #-of-Nvarchar-fields * 2) +
(VarChar-field-widths-in-aggregate + #-of-varchar-fields*2) +
(NChar-field-widths-in-aggregate * 2) +
(char-field-widths-in-aggregate) +
BigInts*8 + Int's*4 + SmallInts*2 + TinyInts*1 +
DateTimes*8 + SmallDateTimes*4 +
Money*8 + SmallMoney*4 +
(Total-bit-fields-DIV-8 + 1)
This calculation does not take into consideration Keys or Indexes, but it will give you the approximate data-space requirements. And, it's a good way to quickly estimate reductions in storage space too. For the data-types I didn't cover, see the Microsoft SQL-Server books-online, which give a description of how much each data-type takes for storage - see the "Data Types" (Transact SQL) section.

Continue with Part 2 of SQL-Server Database Tuning now.

by Mike Eberhart (originally published: May-2007; updated Sep-2016)

Continue to read this Software Development and Technology Blog for computer programming, software development, and technology Techniques, How-To's, Fixes, Reviews, and News — focused on Dart Language, SQL Server, Nvidia CUDA, VMware, TypeScript, SVG, other technology tips and how-to's, and my varied political and economic opinions.

No comments: