Category Archives: Basics

Learning the Relational Model – Implementation of Relational Model (Part IV)

In this fourth part of the series we will briefly discuss how the relational model has been implemented in relational databases.

We are now familiar with the conceptual relational model. The conceptual model deals with unordered sets. In practical aspect, specifically in computer, we cannot physically represent an unordered set. So a practical implementation is a bit different from the relational model.

  • A named relation (relvar) is implemented as Table. A table has a predefined internal structure. The structure has variance in implementation by the different RDBMS vendors
  • An attribute is formed by a name, data type (can be different for different vendors), and constraints (enforcing domain integrity) and is called a column. The lists of columns form the heading of the relation (table).
  • A tuple is implemented as a row. Each row in a table also has fixed format (still vendor specific) and has some order (“… row has some order” is actually a misnomer. A row is usually identified by a row pointer or RID. Though the pointers or RIDs have some order, due to the implementations of data manipulation statements and data concurrency models, we cannot expect an order for row when we fetch data from a table)

Data is stored in a table and in respective columns as a series of rows.

How do you manipulate the data? How do you save and retrieve information from tables? We have already a seen a language to retrieve information from the relational model; i.e. relational algebra! But for the practical purposes, a new special purpose language is required to query the relational model. So IBM has developed a query language called SEQUEL (Structured English Query Language) based upon the relational algebra for their System R database system that is considered to be the first implementation of Codd’s relational model. This name was later changed to SQL (Structured Query Language).

SQL is still based on relational model and relational algebra (but not fully). However some features are added to the language. For example, based on set theory, the elements in a set are unique. When applying some combinations of relational operators on relations, the resulting relation may contain duplicate tuples. Avoiding duplicates is not practical since this may degrade performance of database applications. So another mathematical concept called Multiset (bag) is introduced to solve this problem. A multiset is very much similar to set; the difference is elements in a multiset may not necessarily be unique. The SQL language is then extend to support multisets and introduced more operators to manage the multisets and other real world functionalities. Some of them are;

  • Rename operator – to rename and attribute (column) name
  • Extended projection operator – new attributes are creating from existing attributes of relations. This can be arithmetic operations or applying function on attributes.
  • Aggregate operator (grouping operator)  –produces new relation by summarizing attributes. Usually this grouping is followed by applying aggregate functions on attributes such as sum, average, count, minimum, and maximum to get new insight of data.
  • Duplicate elimination (δ) – Returns relation by elimination all but exactly one copy of each tuple. SQL is using DISTINCT and UNION language operators to eliminate duplicates
  • Outer join – The outer join is defined in such a way that every tuple from either of the original relations appears in some way in the joined relation. The outer join operator addresses the fact that some tuples in a relation are not represented in a natural join involving the relation.  This is because there is no matching tuple in the relation being joined. This is called ‘dangling tuple’
  • Data modification operators – These are used for inserting, updating and deleting data.

Based on these basic and extended operators, the SQL language is standardized by ANSI and ISO and still expanding to add new language features. Some features include, Common Table Expressions, read-only INFORMATION_SCHEMA tables for metadata (data dictionary) querying, temporal data retrieval etc. Each RDBMS vendor is actually may or may not implement all the ANSI/ISO version of SQL and add their own features to the SQL language implementation for user’s ease-of-use. So this is up to the individual to select a particular RDBMS based on the requirement and the features.

There is no commercial RDBMS product that we can call as truly or fully relational model complaint. There are some implementation challenges that hinder the RDBMS product become fully relational. Hope I can write another post on this subject in the future.

One more important point is relational algebra and SQL is declarative; i.e. you are asking what you want from the data stored instead of how you want to retrieve the data stored. This is a big paradigm change for a procedural language developer. Thinking in sets and relation is important when working with RDBMS and SQL, though procedural elements are included in RDBMS by many vendors in the form of cursors and loops.

Series Summary

  • RDBMS products are built on top of strong mathematical foundations on set and predicate theory
  • By applying basic and extended relational operators we can retrieve many complex information from data stored in a relational model
  • SQL is used to query and manipulate data from RDBMS is based on strong mathematical language called relational algebra and extended to support practical applications
  • SQL is based on multiset (bag) theory and hence SQL is a multiset theoretic language
  • A multiset result can be converted to a set by applying duplicate elimination operators like DISTINCT or UNION. However performance may affect

The below table lists the relational algebra operators and SQL syntax similar to that

 No. Relational Operator SQL syntax
1 SELECT  SELECT * FROM Employee WHERE Dept_ID = 18
2 PROJECT  SELECT JoinDate, F_Name FROM Employee WHERE Dept_ID = 18
3 RENAME SELECT F_Name AS FirstName FROM Employee
4 UNION SELECT F_Name FROM Contract_Emp UNION ALL SELECT F_Name FROM Permenant_Emp
5 INTERSECT SELECT F_Name FROM Employee WHERE Skill = ‘C#’ INTERSECT SELECT F_Name FROM Employee WHERE Skill = ‘SQL’
6 DIFFERENCE/MINUS SELECT F_Name FROM Employee WHERE Skill = ‘C#’ EXCEPT SELECT F_Name FROM Employee WHERE Skill = ‘SQL’
7 PRODUCT SELECT Y.YearID, S.SeasonFROM Season S CROSS JOIN Year Y
8 JOIN SELECT E.Name, E.EmpID, E.DeptName, D.ManagerName FROM Employee E INNER JOIN Dept D ON D.DeptName = E.DeptName
9 DIVISON There is no SQL operator that translates to a DIVISION operator. However we can implement relational division using many SQL statements available. A full discussion of relational division is out of the scope of this article. Read Joe Celko’s excellent article on this subject here (https://www.simple-talk.com/sql/t-sql-programming/divided-we-stand-the-sql-of-relational-division/)
10 Extended Project SELECT F_Name + ‘ ‘ + L_Name AS EmpName, BasicPay * (12.0 / 100.0) AS TravelAllowance FROM Employee
11 Aggregate SELECT DeptName, AVG(Salary) AS AverageSalary FROM EmpSalary GROUP BY DeptName
12 Duplicate removal SELECT DISTINCT Dept_ID FROM Employee or SELECT EmpName FROM Permenant_Emp UNION SELECT EmpName FROM ManagementCouncil

Conclusion

If you learn the relational model and relational algebra and the operators, you can design good databases and make the queries simple and portable. When you execute a query, the RDBMS’ query optimizer is converting the SQL dialect to a relational algebra tree and retrieving the data from the underlying physical tables based on the algebra tree (though I’ve simply said it, the query optimization and creating efficient and simple algebra tree is a big topic and still researches are going on in this area). Hope you enjoyed the series.

Advertisements

Leave a comment

Filed under Basics, Theory

Learning the Relational Model – The Model (Part III)

In the previous two parts we have seen the concept of Set theory and predicate logic. In this third part we will discuss how these two theories combined to form the relational model.

The path to relational model

Here is a fictional conversation between two people met in the cafeteria of the office floor.

“Excuse me! I’m Fred. Do you work in Accounts?”

“No I’m in Purchase Dept.”

“I haven’t seen you before. I’m in Marketing”

“I’m Mark and I just joined last week on 16th as the manager”

“Oh! So how do you feel about the new job?”

… and the conversation continues…

There some data points hiding in the above conversation. There is a set of department like Accounts, Purchase, and Marketing etc. and there is a set of employee like Fred, Mark etc. Every employee is related to any one of the department from the set of departments. Each employee has a name and a data of join.

In natural language we make assertions about entities of interest by statements of fact—or, in logic, by propositions. For example, this is a proposition:

“The employee with ID number 215 is named Mark, works in Purchase department, and was hired on December 16th, 2013.”

Generalized forms of propositions are predicates. For example, this is a predicate :

P: The employee with ID number (Emp#) is named (Name), works in department (Dept#), and was joined on (Joindate).

The four terms in parentheses are placeholders or parameters that correspond to the four values in the preceding proposition. We can substitute values to the parameters. When you substitute parameters with specific values, a predicate reduces to an individual proposition. For the above predicate let me substitute the below parameter values respectively:

 (215; Mark; Purchase; December 16th, 2013)

You can immediately see that the above set of parameters form a tuple! (We have discussed about tuple in Part I). Let us call this set as Employee relation E. The relation E can be a set of tuple as below.

E =       {

            (215; Mark; Purchase; December 16th, 2013),

            (107; Fred; Marketing; April 08th, 2005),

            (183; Bob; Sales; October 10th, 2010),

            }

Each tuple’s values when replaced with the parameters in the above predicate satisfy the proposition or we can say the proposition is true for the above relation E. In other words a relation E expressing the predicate P would be the set of all sets of values which satisfy the predicate truthfully (in our enterprise or universe).

So how do you define your data model? Just describe a business problem, find predicates, and write them down— the data to the predicate parameters forms a set – you have your data model. Now we can relate the set theory and predicate logic to describe data.

 However there are still missing pieces to make the relational model. Is the 215 for the parameter Emp#, a numeric value or a literal text value? Is the Joindate December 16th, 2013 represents a date or a text? So a formal definition of type of each parameter is required now.

 Type and Domain

 The type that usually called as a data type is a classification identifying the various types of data like integer, real valued, string, date, Boolean (representing true or false) etc., the operations that can be done on values of that type; the meaning of the data; and the way values of that type can be stored. For example an int data type in a relational database system represents an exact numeric data type and this type supports addition, subtraction, multiplication, and division on the values of that type.

 In addition to data type, there can be a possible list of value applicable to support by the data type or the business requirement. A data domain refers to all the unique values which a data element may contain. For example the Emp# parameter can have values ranging from 1 to 500 for a medium organization. So type of Emp# is int and domain ranges from 1 to 500.

Type is the basic building block of relational model. Domain is usually confused with the term data type. To make the distinction let us consider an example of Age. The age can be represented using the data type int. However the domain of age for a company can have values ranging from 18, the minimum age for employment and 60, the retirement age.

The parameters Emp#, Name, Joineddate etc. we can call it attribute names. Each attribute name has a data type and a domain. A set of an attribute name, corresponding data type, and associated domain is called an attribute. So an attribute A for storing gender information can be represented by the below set:

 A = { (Gender, CHAR, {‘M’, ‘F’}) }

 In other words an attribute is an ordered set of attribute name, type and domain.

Each attribute is associated with a value. The value can be a scalar value (single value) and can have a complex values. For example the value of attribute Gender will be ‘M’.

A set of attribute is called a heading. So the heading for Employee E can be represented as the below set:

 HE =     {

             (Emp#, int, {1,2,3,4…..498,499,500}),

             (Name, char, {alphabets and space}),

             (Dept, char, {Purchase, Marketing, Sales, IT, Admin}),

             (Joindate, date, {01/01/2002 to 12/31/2022})

            }

Note that the header mathematically represents the predicate.

A set of n-Tuple is called a body: so the body B of employee can be represented by the below set:

B = {E1, E2, ….. En}

A relation consists of a heading and a body. The heading of the relation is also the heading of each of its tuples. A named relation is called a relational variable or relvar.

 Here is a pictorial representation of the relation named R, we have learned so far.

Image 

Strength of the relational model

 Several times I have mentioned that the relational model’s power lies in the mathematical foundations of set theory and predicate logic. From the granular level to higher levels this is visible.

 Attribute           = a set of attribute name, type and domain

Tuple                = a set of attributes and values

Body                = a set of tuples

Heading           = a set of attributes

Relation            = a set of heading and body

Based on predicate logic;

Relation            = AND (^) of tuples;

The employee with ID 215 is named Mark, works in department Purchase, was joined on December 16th, 2013 AND

The employee with ID 107 is named Fred, works in department Marketing, was joined on April 8th , 2005 AND

The employee with ID 283 is named Bob, works in department Sales, was joined on October 10th, 2010 AND

…………

This can also be extended to any level, viz. a tuple is the logical AND of all the attributes and values, an attribute is a logical AND of attribute name, type and domain etc.

 If each tuple represents an axiom (a statement of truth or a fact about our universe) we can combine these axioms with mathematical operators, and using logical rules of inference and we can derive new facts that are mathematically provable.

Extracting information from relational model 

E.F Codd introduced two different languages to query the relational model. One is relational algebra, which is a procedural language and the second one is relational calculus, which is a declarative language. In relational algebra a collection of relational operators are applied to the relations, resulting in another relation. The following are the basic relational operators 

Set Operators 

  1. Union (U) – it takes two relations, say, Bike B and Car C, both of which must have the same set of attributes. The union, B U C, consists of all the tuples in B and all the tuples in C and result in a relation Vehicle V. 
  1. Cartesian Product (X) – Cartesian product of A X B, result in a relation in which all possible ordered pairs (a,b) where a is a member of A and b is a member of B. 
  1. Intersection (∩) – The intersection, A ∩ B, consists of those tuples that are both in A and B. 
  1. Difference (-) – The difference, A – B, consists of those tuples in A which are not in B.

 Relational Operators

   5. Selection (σ) – (also called restrict) operator is used to choose a subset of tuples.

          σDept = Sales (E), will return a relation from Employee E in which Dept is Sales. 

  1. Projection (π) – when applied to a relation, the projection operator do the following
  • Removes all attributes from the relation that do not appear in the attribute list
  • Reorder the attribute based on the order in attribute list
  • Eliminate any duplicate rows 

     πEmp#, Name(Employee), will return a relation from Employee E, in which only the Emp# and Name available in that order 

  1. Join (|><|) – Join is a binary operator that combines Cartesian product and Selection operator in a single operation. The result of A |><| B is combination of tuples a from A and b from B where certain conditions are satisfied. 
  1. Division (/) – Extract rows whose column values match those in the second table, but only returns columns that don’t exist in the second table. Suitable for queries that include “for all”. Example, let A represents all the assignments, C represented completed assignments of students, then C / A represents the students who completed “all” the assignments.

 By combining these basic operators we can retrieve information from relational model.

Next

In the next part of this series we will how the above described concepts are actually implemented in relational database systems.

 

Leave a comment

Filed under Basics, Theory

Learning the Relational Model – Predicate Logic (Part II)

In the first part we have seen the concept of Set theory and relational theory. In this part we are going to learn about the concept of predicate logic.

Part II – Predicate Logic

What is logic?

Logic is a language for reasoning and is a collection of rules we use when doing reasoning. In logic we are interested in true or false of statements, and how the truth/falsehood of a statement can be determined from other statements. There are various types of logic such as logic of sentences (propositional logic), logic of objects (predicate logic), logic involving uncertainties, logic dealing with fuzziness, temporal logic etc.

Propositional logic is logic at the sentential level. The smallest unit we deal with in propositional logic is a sentence. Sentences considered in this logic are not arbitrary sentences but are the ones that are true or false. This kind of sentences is called proposition. For example, “Grass is green”, and “2 + 5 = 5” are propositions. The first proposition has the truth value of “true” and the second “false”. But “Close the door”, and “Is it hot outside?” are not propositions. Also “x is greater than 2”, where x is a variable representing a number, is not a proposition, because unless a specific value is given to x we cannot say whether it is true or false, nor do we know what x represents.

Let P be the proposition “It is snowing”, Q be the proposition “I will go the beach”, and R be the proposition “I have time”, then the English sentence, “I will go to the beach if it is not snowing” is restated as “If it is not snowing, I will go to the beach”. This can be represented as symbols as below:

¬P -> Q

The symbols are called connectives and below is the list of connectives and there meaning.

¬ – NOT

^ – AND

v – OR

-> – IF THEN (or IMPLY)

<->  – IF AND ONLY IFF

Similarly, “It is not snowing and I have time only if I will go to the beach” is restated as “If it is not snowing and I have time, then I will go to the beach”, and it is translated as (¬P ^ R)  -> Q.

Predicate Logic

 

The propositional logic is not powerful enough to represent all types of assertions that are used in computer science and mathematics, or to express certain types of relationship between propositions such as equivalence. For example, the assertion “x is greater than 1”, where x is a variable, is not a proposition because you cannot tell whether it is true or false unless you know the value of x. Thus the propositional logic cannot deal with such sentences. However, such assertions appear quite often in mathematics and we want to do inference on those assertions.

A predicate is a verb phrase template that describes a property of objects, or a relationship among objects represented by the variables.

Consider the following sentences:

  1. “John gives the book to Mary”
  2. “Jim gives a loaf of bread to Tom”
  3. “Jane gives a lecture to Mary”

We can derive a template here. The template is “……gives……to…..”. This template is called a predicate and it describes a relationship among three objects. The predicate can be written as G(x, y, z). The x, y, and z are the variables. So G(John, Book, Mary) represents “John gives the book to Mary and G(Jane, lecture, Mary) represents “Jane gives a lecture to Mary”.

I’m omitted some other aspects of predicate logic like, quantifiers, well-formed formula (wff) etc. since these are not relevant in our discussions on relational model at present.

Next

So we have covered two fundamental theories of relational model viz. set theory and predicate logic. In the next section we will merge these concepts to build the relational model

Leave a comment

Filed under Basics, Theory

Learning the Relational Model – Basic Set Theory (Part I)

This blog series on Relational Model is my attempt to understand the foundation of relational model of database. My intention is to learn and understand how Edgar F Codd designed the relational model, the basis of the model and how this is implemented in the relational database.

Introduction

Edgar F Codd built the foundation of relational model of database above two fundamental mathematical concepts viz. Set theory and predicate logic. The basic set theory is explained in this part of the series.

Sets and Relations

A set is a collection of distinct objects. For example a fruit basket is a collection of fruits and can be represented as B = {Apple, Orange, Grape}. Set of employees in a particular firm E = {Alice, Bob, Charlie}. There can be empty sets as well. For example an empty fruit basket, B = {}.

We know that the algebra of numbers include many operators like addition (+), subtraction (-), multiplication (X), division (/) etc. Likewise algebra of sets includes its own collection of useful operators. Like the operators of arithmetic, some of the set operators combine two sets and yield a set. These are the set operators.

  1. Union

Let A and B be sets with the same universe U. The union of A and B, denoted AB, is the set containing those elements of U that are either elements of A or elements of B (or elements of both). The union of {1, 2, 3} and {2, 3, 4} is the set {1, 2, 3, 4}

  1. Intersection

Let A and B be sets with the same universe U. The intersection of A and B, denoted A∩B, is the set containing those elements of U that are both elements of A and elements of B. The intersection of {1, 2, 3} and {2, 3, 4} is the set {2, 3}

  1. Set Difference

Let A and B be sets with the same universe U. The set difference of A and B, denoted A\B, is the set containing those elements of U that both elements of A and non-elements of B. The set difference {1,2,3} \ {2,3,4} is {1}

  1. Cartesian Product

Let A and B be sets with the same universe U. The Cartesian product of A and B, denoted by A X B, is the set whose members are all possible ordered pairs (a,b) where a is a member of A and b is a member of B. The Cartesian product of {1, 2} and {red, white} is {(1, red), (1, white), (2, red), (2, white)}.

  1. Power Set

Power set of a set A is the set whose members are all possible subsets of A. For example, the power set of {1, 2} is { {}, {1}, {2}, {1,2} }

 How can we apply this to data? What it says is everything can be collected to a set. That is a set of employees, a set of customers, a set of orders, a set of addresses, a set of products etc. If we can define an entity as a set of data, then we can also apply the set operators! For example, we can derive a set of vehicle by applying the union operator on a set of cars and set of bikes. We can find a set of both English speaking and Hindi speaking people by applying the intersect operator on the set of English speaking people and the set of Hindi speaking people. Since set operations are also associative and commutative we can combine the sets in many combinations and get different result sets (I am not going in detail to the associative and commutative properties of sets, since we have already studied this for numbers in high school classes).

Multiset

Based on the definition, set do not allow duplicates. However the idea of multiset has been introduced to allow more than one similar object (or duplicate object). Multiset is also called bag. Multiset and operations on multisets are very much similar to set. I will explain the practical importance of multiset later in this series.

From Set to Relation

A relation in mathematics is an association between various objects. Relation can also be represented by a set. A relation is a set of ordered pair and the pair has certain association. For example, let us consider a set of employee, E = {Alice, Bob, Charlie}, and a set of department D = {Sales, Purchase, Marketing}. So a relationship “is of department”, R can be represented as R = {(Bob, Marketing), (Alice, Sales), (Charlie, Sales)} and this can be read in English as ‘Bob is of department Marketing’, ‘Charlie is of department Sales’ etc. There are two elements in the result set, employee name and department name, a set of double. This can be triple, quadruple, quintuple, sextuple, septuple, etc. to n-tuple. This is generalized to a term called tuple. So in the above relation (Bob, Marketing) is a tuple and (Alice, Sales) is another tuple and so (Charlie, Sales).

Next

In the next part of the series we will see the predicate logic, the next fundamental concept of relational model.

 

Leave a comment

Filed under Basics, Theory

How to decide the precision and scale of a decimal column – Part 3

In this third and final part of the series, I am discussing about the arithmatic operations and rounding (though briefly) on decimals.

When doing arithmetic operations on decimal numbers, you should be careful since the resulting decimal number will have different precision and scale.

For example, execute the below script

DECLARE @D1 AS DECIMAL (3, 1) = 99.9
DECLARE @D2 AS DECIMAL (3, 1) = 99.9
SELECT @D1 + @D2, CAST(@D1 + @D2 AS VARBINARY)

Image

The addition of two decimal numbers having a precision 3 resulted in a decimal number having precision 4 as evident from the hexadecimal conversion of the result. The precision and scale will change dramatically, if the operation involved is multiplication or division as below.

DECLARE @D1 AS DECIMAL (3, 1) = 99.9
DECLARE @D2 AS DECIMAL (3, 1) = 99.9

SELECT @D1 * @D2 AS MULTIPLICATION, CAST(@D1 * @D2 AS VARBINARY)
SELECT @D1 / @D2 AS DIVISION, CAST(@D1 / @D2 AS VARBINARY)

Image

Try changing the precision and scale of @D1, and @D2 variables (without changing the assigned value), and check the resultant precision and scale.

So far, you don’t have any issue.  Suppose ARITHABORT or ANSI_WARNINGS settings are ON. In that case if you try to assign this result to a variable with precision 3, you will get arithmetic overflow error.

DECLARE @D1 AS DECIMAL (3, 1) = 99.9
DECLARE @D2 AS DECIMAL (3, 1) = 99.9
DECLARE @D3 AS DECIMAL (3, 1)

SELECT @D1 + @D2, CAST(@D1 + @D2 AS VARBINARY)
— This will cause Arithmetic overflow error, If
— SET ANSI_WARNINGS ON, — or
— SET SET ARITHABORT ON
SELECT @D3 = @D1 + @D2

How to avoid this overflow error? First method is just set the above settings OFF. However this is not recommended.

The second method is assigning the result to a variable with properly defined precision and scale and use explicit cast. This is the recommended method. For example,

DECLARE @D1 AS DECIMAL (3, 1) = 99.9
DECLARE @D2 AS DECIMAL (3, 1) = 99.9
DECLARE @AddResult AS DECIMAL (4, 1)
DECLARE @MulResult AS DECIMAL (7, 2)

SELECT
@AddResult = CAST (@D1 + @D2 AS DECIMAL (4, 1))
,@MulResult = CAST (@D1 * @D2 AS DECIMAL (7, 2))
SELECT @AddResult AS ‘ADD’, @MulResult AS ‘MUL’

So how do you decide the precision and scale of the result for each arithmetic operation? Microsoft provides the below formula in Books Online:

Operation Result precision Result scale
e1 + e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
e1 – e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
e1 * e2 p1 + p2 + 1 s1 + s2
e1 / e2 p1 – s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1)
e1 {UNION| EXCEPT

| INTERSECT

} e2

max(s1, s2) + max(p1-s1, p2-s2) max(s1, s2)

e1 is expression1 and e2 is expression2; p1 and s1 are precision and scale of e1 respectively, and p2 and s2 are precision and scale of e2 respectively. Using these formulas we can easily calculate the resulting precision and scale.

One more example let us say a product has a unit price value stored as DECMAL (9, 4) and a fractional quantity purchased as DECIMAL (5, 2). What will be the precision and scale to store the total amount (i.e. total amount = unit price X quantity). Apply the formula from the above table for e1 * e2; the result precision will be 9 + 5 + 1 = 15, and scale will be 6. However if you need the total amount in four decimal places you need to round this accordingly. Let us verify this.

DECLARE @UnitPrice AS DECIMAL (9, 4) = 452.1250
DECLARE @Quantity AS DECIMAL (5, 2) = 3.15
DECLARE @TotalAmount AS DECIMAL (15, 4)

SELECT @TotalAmount = CAST(@UnitPrice * @Quantity AS DECIMAL (15, 4))

Please note that these operations require rounding of decimal numbers without losing the precision. SQL Server will do this rounding internally based on the algorithms mentioned in Part 1.

Conclusion

In this series (Part 1, Part 2 & Part 3), I have explained how a decimal number is represented and how arithmetic operations are impacting the precision and scale of the resultant decimal number. I hope, a clear understanding on these details will help you to design better and error-free codes.

Leave a comment

Filed under Basics

How to decide the precision and scale of a decimal column – Part 2

In part 1 of this series here , we have discussed how a decimal number can be Imageexpressed in terms of coefficient, exponent, and sign. We have also discussed how the precision and scale mapped to the coefficient and exponent.

Now let us take a look at how the decimal is physically represented in SQL Server. For that purpose I am creating a table and insert some rows to it.

CREATE TABLE DecimalTest
(
ColID VARCHAR(10),
ColDecimal DECIMAL (5, 4)
)
GO

INSERT INTO DecimalTest VALUES (‘ROW1’, 1.2345)
INSERT INTO DecimalTest VALUES (‘ROW2’, -1.2345)

To see how the decimal is represented in a data page; we are going to use two undocumented DBCC commands, IND and PAGE. The DBCC IND will give the details of file number and page number of the table we are interested. DBCC PAGE will show the actual content of the page (I’m not going in details to the various options of IND and PAGE since these are out of scope of the current discussion).

DBCC IND (‘TestDB’, ‘DecimalTest’, -1)
GO
The output of the command is given below

 

Image

We are concerned about the PageType 1 and corresponding to this, the PageFID is 1 and PagePID is 302. This is our page to investigate. Run the below DBCC PAGE command.
DBCC TRACEON(3604);
GO
DBCC PAGE (‘TestDB’, 1, 302, 1)
GO
DBCC TRACEOFF(3604);
GO
The part of the output we are interested is given below. The data inside the blue rectangle is the decimal values for each row. This is represented in hexadecimal format.

Image

Each two digit represents a byte in hexadecimal format. So, the decimal value is stored here in five bytes. The left most first byte is the sign byte. 01 indicates the decimal value is positive and 00 indicates the decimal value is negative. The remaining four bytes (39 30 00 00) are the actual integer value (coefficient). This is a byte-swapped format and so to get the actual hexadecimal representation we need to swap each byte like: 00 00 30 39. When this is converted to decimal equivalent we get 12345. So the first row stores +12345 and second row stores -12345. To find out the actual decimal value stored we need to know the scale. However this is not stored in the data page. This is because the precision and scale is a metadata and can be easily get by querying the system metadata tables.

SELECT c.name, c.precision, c.scale
FROM sys.columns c
INNER JOIN
sys.types t
ON t.user_type_id = c.user_type_id
WHERE
t.name = ‘decimal’
AND c.object_id = OBJECT_ID(N’DecimalTest’)

Image

So the decimal value stored in each row can be calculated by multiplying the value stored in the row by 10-4.

In SQL Server the bytes required to store each decimal value (coefficient) can vary based on the precision. The Books Online summarized this in the following table.

Precision

Storage bytes 

1 – 9

5

10-19

9

20-28

13

29-38

17

 What about decimal variables? Let us check that next.

DECLARE @D1 AS DECIMAL (5, 4)
DECLARE @D2 AS DECIMAL (5, 4)

SET @D1 = 1.2345
SET @D2 = -1.2345

SELECT CAST(@D1 AS VARBINARY) AS PosDecimal
,CAST(@D2 AS VARBINARY) AS NegDecimal

Image

This is very much similar to the row representation in the data page. However the precision and scale is also encoded to the variable data itself as indicated by the blue rectangle.

In the next part we will discuss the arithmetic operations carried out on decimal numbers

Leave a comment

Filed under Basics

How to decide the precision and scale of a decimal column – Part 1

There are so many confusions arise when we talk about decimal data type among

Imagenovice developers.  In SQL Server forums also I have seen many persons asking questions related to decimal types. The confusion is actually because of interpreting the values of precision and scale defined for the decimal data type. For example DECIMAL (5, 4) can be easily interpreted by them as 00000.0000, that is, five zeros before the decimal point and four zeros after the decimal point.

 

Actually, this is not the case. This three part series explain the decimal data type, the precision and scale and how this is used in T-SQL.

Basics

A good understanding of decimal arithmetic will help you to make good decisions when designing columns with decimal data types. This post is mostly based on IEEE 754 standard for floating point arithmetic.

Any number (a finite number) can be represented by three parameters as follows,

  1. Sign – in which 0 indicates a positive number and 1 indicates a negative number
  2. Coefficient – is an integer number that can be zero or positive, and
  3. Exponent – is a signed integer that represents a power of ten that has to be multiplied by the coefficient

An abstract representation of a decimal number

Let me arbitrarily take an integer number 123450. Dividing the number by 10 yields the decimal number 12345.0. Similarly, dividing the number by 100 yields the decimal number 1234.50. We can formulate this as,

123450 / 10     = 123450 / 101 = 12345.0

123450 / 100   = 123450 / 102 = 1234.50

123450 / 1000             = 123450 / 103 = 123.450

We can also write this as;

123450 X 10-1 = 12345.0

123450 X 10-2 = 1234.50

123450 X 10-3 = 123.450, etc.

The integer number 123450 is the coefficient and the numbers -1, -2, -3 etc are called the exponents. So to represent the decimal number123.450, we need the coefficient 123450 and the exponent -3. And also a sign bit 0 for representing a positive decimal. So a numerical value of a finite number can be represented by:

(-1) sign X coefficient X 10exponent

And also there are 6 significant digits in the coefficient. Many abstract arithmetic operations are defined on the decimal number such as abs, add, subtract, multiply, divide, min, max to name a few. One problem arises during the arithmetic operation on decimal number. The abstract model does not specify a limit to the value of coefficient. However a computer system represents data in a precise format (bytes). So during the arithmetic operations like multiply and divide, the number of significant digits in a resulting coefficient may increase. Now we are forced to round the resultant decimal to fit in to the desired significant digits.

Various algorithms are used by the computer system, if the result from a decimal arithmetic operation needs to be rounded. IEEE 754 defines five algorithms for rounding, that are, round-down, round-half-up, round-half-even, round-ceiling, round-floor. I’m not going in detail to the rounding algorithms here.

SQL Server representation

In the previous section, we have seen how a decimal number can be represented by an abstract model provided by IEEE 754 standard. The ISO/ANSI SQL standard also defines exact numeric data types, based on that model and this is what SQL Server has been implemented.

The domain (the range of values) of decimal data type in SQL is defined by two terms: the precision and the scale. Precision (P) is a positive integer that determines the number of significant digits. This is same as the significant digit of the coefficient in the abstract decimal model. For example a precision of 1 means only one digit can be stored as the coefficient, that is, any single digit from -9 to +9. The following table summarizes the precision:

Precision          Coefficient        Coefficient

                        (Range From)   (Range To)

1                      -9                     +9

2                      -99                   +99

3                      -999                 +999

4                      -9999               +9999

5                      -99999             +99999

6                      -999999           +999999

The scale (S) is a non-negative integer in SQL. To get the exact decimal value, the coefficient value must be multiplied by 10-S. Did you notice anything similar to the abstract model of decimal with this? Yes, the scale is the exponent. Note that a scale of 0 represents an integer value in SQL.

The scale (S) is a non-negative integer in SQL. To get the exact decimal value, the coefficient value must be multiplied by 10-S. Did you notice anything similar to the abstract model of decimal with this? Yes, the scale is the exponent. Note that a scale of 0 represents an integer value in SQL.

For example, DECIMAL (5, 1) represents precision 5, and scale 1 and that indicate the number has five digits in total in which one digit is after the decimal point. Scales (S) can be range from 0 to precision. The below table summarizes various scales for a precision of 5 as an example:

Precision : 5, Coefficient: 99999

Scale    10-S      CoefficientX10-S           CoefficientX10-S

                        (Range From)               (Range To)

0          10-0      -999999                       +999999

1          10-1      -99999.9                      +99999.9

2          10-2      -9999.99                      +9999.99

3          10-3      -999.999                      +999.999

4          10-4      -99.9999                      +99.9999

5          10-5      -9.99999                      +9.99999

So the DECIMAL (5, 4) in the conversion at the start of this post turns out to be in the range of -9.9999 and +9.9999. So we cannot assign a value of 100.0000 to this.

In the next part we will see how SQL Server internally represents decimal numbers.

Leave a comment

Filed under Basics