The Physical Design, Specify your database implementation before moving forward

After 4 months of articles about data modeling, we're getting close to a functional database. I started this data modeling series with "Data Modeling" (April 2000), which discussed how to gather project requirements. I followed up with "Process Modeling" (May 2000), which reviewed process modeling and demonstrated my own variation of data-flow diagramming to illustrate what happens as data moves through the system. Then in "Entity Modeling" (June 2000), I developed a concept model, or entity relationship diagram (ERD), of the database. Last month, in "Logical Modeling" (July 2000), I translated that ERD into a logical model, which is a closer representation of the evolving database. This month, I develop a physical design that takes us one more step toward a working database.

A physical design is a specification for database implementation. At this phase of development, you must know the database platform you're going to use—perhaps SQL Server 7.0 on Windows NT Server 4.0, Microsoft Access on the desktop, Oracle on the mainframe, or some other platform. To create a physical design, you pull together all the specifications and models you've created so far, then modify and optimize them for your target platform. For example, you need to modify all column properties, including data types, so that they're specific to your target environment. You can add extra columns that don't appear in the conceptual or logical models, such as flag columns and timestamp columns that facilitate data processing. You also need to size the database, analyze data volume and use, incorporate any replication or distribution plans, and select candidate columns for indexing. If you're thinking ahead, you'll also determine user roles (groups), logins, and security permissions; requirements for data retention (archiving plans); and failover and backup and recovery plans.

Creating a Physical Design

The physical design is a composite of models that, taken together, form a complete or near-complete specification for implementing a database. So what does a physical design look like? Part of it, the piece we ordinarily call the "physical model," looks like an expanded ERD, as Figure 1, page 62, shows. Other parts, such as the Data Volume Analysis model and the Security Matrix, also look familiar.

The physical model in Figure 1 is an expanded version of the conceptual model (or ERD) that I created a few months ago. I carefully reviewed the logical model that I built last month, presenting it to colleagues, critics, and customers for feedback. When I found discrepancies between the conceptual and the logical models or found that the logical model wouldn't support the requirements, I adjusted the conceptual model accordingly and forward-engineered to a new version of the logical model.

If you compare the ERD from June's "Entity Modeling" with the physical model in Figure 1, you'll notice a few differences. The biggest change is that the physical model doesn't have EMPLOYEE and AUTHOR tables, which are subtype entities of PERSON on the ERD.

I initially categorized authors as having attributes that referenced how they were to be paid (PayToOrderOf, PayToTaxID, PayToAddress) and employees as needing only social security numbers (SSNs). On further investigation, I found that some employees were paid by TaxID and some authors were paid by SSN. In addition, one employee wanted his paycheck made out to someone other than himself and sent to an address other than his home address. I learned that, in reality, virtually no differences exist between the attribute sets for AUTHOR and those for EMPLOYEE. Thus, in the physical model, I combined the AUTHOR and EMPLOYEE tables into the PERSON table and added a subtype discriminator of PersonType, which carries the value "A" if the person is an author, "E" if the person is an employee, or "AE" or "EA" if the person is both (exact usage will be up to the client).

In the physical model, I also added three columns to each table: LastUpdate (defaults to getdate()), ByWhom (defaults to system_user), and Timestamp (timestamp data type). The physical model specifies foreign key columns, unlike the conceptual model, which just implied foreign keys. In addition, for the physical model, I converted the portable data types from the ERD and the logical models to SQL Server-specific data types. Notice that some columns have an indexing indicator next to them, identifying them as good candidates for indexing.

The Data Volume Analysis Model

When you have your physical model in place, you're ready to use the Data Volume Analysis model to calculate the amount of physical storage your database will require.Figure 2 shows the Data Volume Analysis model, which looks like an abbreviated ERD except that it has no attributes, only record counts.

The record counts are estimates, based on the average number of records that you expect to occur and that you expect to store in the database. You should already have this information, either gleaned from your clients or from your own business experience. For example, you know that you have 50 employees and about 150 authors to manage. In the next 3 years, you expect to double the number of employees and quadruple the number of authors. That brings the total number of people records that you'll need to store in your database to about 700. You anticipate carrying information about 10 publishers in the PUBLISHER table, and that number should stay stable or increase only slightly over 3 years. Thus, the PERSON_PUBL table, which associates people and publishers, will contain 7000 records.

To estimate how much disk space you'll need for that many records, multiply each table's record count by the total size of the record in bytes. You can get the record size in bytes by adding up the field sizes in the physical model. For example, each record in the PERSON table is 400 bytes long. With SQL Server 7.0, you need to add at least 10 bytes per record to account for administrative overhead. You can plug the numbers into this little formula:

                              (RecSizeBytes * EstNoRecs) / BlockSize =  Num_Blocks_Storage

Using a block size of 8096 bytes for SQL Server 7.0 and rounding up to the next integer value (SQL Server doesn't let you store part of a record on a block), you'll need 36 blocks to store the data in the PERSON table. To give you a starting point for figuring out how much disk space you'll need (and whether you want to use file groups or striped sets to enhance performance), perform the following calculations for each table. To get the total space requirements for the user data, add the number of blocks needed for each table in the physical model. Add 10 percent of this total number of blocks for the system and database catalogs (start with a minimum of 25MB), and add 35 percent of the user data size for the transaction log. Then add space for indexes, various log files, snapshot and archive relations, and database backup files. This additional space might be as little as 20 percent of the user data size, or as much as 100 percent, depending on how many indexes you want to create, how many snapshot and archive relations you plan to set up, and so on. The final value will help you select the best physical storage devices and estimate the cost of both immediate and long-term data storage. By knowing roughly how much user data you'll have, you can also better estimate the type and cost of your backup and recovery schemes.

Data Use Analysis

The Use Path Analysis model, or Transaction Map, lets you visually map transaction processes and estimate read/write activity against the database. You already know what the processes are because you've developed a process model (the data-flow diagram—DFD—from May's "Process Modeling"). You take each major process from the DFD and create a Transaction Map from it, using the tables from the physical model to help you itemize the number of read and write operations per transaction. Figure 3 shows a Transaction Map for generating invoices and packing slips for publications shipped to stores. You start reading the map at the entry point, Publication. The company ships one PUBLICATION (name, volume, ISSN) to one STORE, resulting in one PUBLICATION_STORE shipping record. The output is one invoice/packing slip. The most meaningful numbers are at the bottom of the Transaction Map. Per transaction, you have three read operations and one write operation. An estimate of average per-hour activity is 100 transactions, whereas peak per-hour activity is 1000 transactions. (You know this information from the requirements analysis.) The total number of reads per average activity period is 300, and the total number of writes is 100. Peak periods put more load on the server, which will average 3000 reads and 1000 writes per period.

You need to create a Transaction Map for each major process. Analyze the numbers and the tables you're using as input. Compare predicted activity with the numbers from the Data Volume Analysis model to confirm plans for using file groups, striped sets, or some other file-placement approach. After building your Transaction Map models, you should have a handle on server load (peak-period read/write activity) and job mix (to avoid contention between two processes vying for the same input data). You should also know the kind of server hardware you need.

Authorization Plans, Et Cetera

Think you're done building the physical design? Not yet. No physical design is complete without at least a rudimentary plan for table permissions. "The Security Matrix" (March 2000) introduces the concept of authorization matrixes, which identify who belongs to which roles and which roles have particular permissions. This visual technique is easy to interpret and easy to maintain, and it documents your authorization plans. In our publishing company scenario, you can use the security matrix to specify, for example, that the InfoSystems and Developers roles have read, write, and references permissions on the PERSON table. You can give the Human Resources and Managers roles read/write permissions and Editorial Staff, Production, and MarketingSales roles read permissions only. Use the matrix to assign permissions for each table in the database.

You also still need to plan for data retention. How long are you going to keep data that you've captured in the database? You need to base your data retention plans on the information you gathered in the requirements analysis. Companies usually keep some tables indefinitely, archive other tables periodically, and delete other tables completely after a certain time period.

Before you implement your database, you need to devise a disaster-avoidance plan. Many people call this a disaster-recovery or failover plan, but I prefer to be proactive on the issue, so I call it a disaster-avoidance plan. Now, before database implementation, is when you need to put the basics in place. An integral part of disaster avoidance might be using a RAID array for your database or mounting your database on a clustered server. In addition, you want to plan your backup and recovery schemes and allocate budget, staff, and whatever else you need to make sure that these backup and recovery functions will be in place when you're ready to implement your design. Last, you need to incorporate any database replication or distribution processes into your plan so that you're prepared to implement replication or a distributed database solution.

The physical design is your last chance to check your database model against the business requirements before you push to a live database. Take the time to make sure you've addressed all your database's physical implementation needs and requirements before you start implementing the design.