Translate

Normalizing Data in 3NF Form

Deepesh Joseph
February 2009

The main purpose of normalization is to reduce data redundancy and avoid inconsistent data. Normalization leads to separation of unrelated entities into separate entities. In effect, normalization leads to clean database design. Since we do not store redundant data, we save storage space and save resources to maintain (update, delete) redundant data.

But, there are instances where we do not need to fully normalize data. The example provided in question is an excellent example to explain why do we allow de-normalized data. Looking at the customer address data, it is desirable to design city, state, country and postal codes as separate entities since they could be represented by thier own unique identifiers (state_id, country_id, zip_id) and that multiple customers may belong to same country, state, city and zip. Suppose we did design these as separate entities and try to retrieve data for the following problem --

"Generate report of all customer belonging to 'US' and who reside in 'FLORIDA's 'TAMPA' city in '33601' postal code."

The query would be something like --

"SELECT
c.customer_first_name, c.customer_middle_name, c.customer_last_name
FROM
customer c, customer_address ca, address_city act, address_zip az, address_state as, adress_country ac
WHERE
c.customer_id=ca.customer_id and c.city_id=act.city_id AND c.zip_id=az.zip_id
AND c.state_id = as.state_id and c.country_id = ac.country_id AND ac.country_name = 'US'
AND as.state_name = 'FLORIDA' AND act.city_name = 'TAMPA' AND az.zip_code = '33601'"
Notice the joins (c.customer_id=ca.customer_id and c.city_id=act.city_id etc) required in the SQL to retrieve the required information. SQLs joins are considered to be very expensive when there is huge amount of data, say we have a tera byte of data within customer table. The four additional joins is going to be very expensive and will lead to unacceptable system response time.

If we de-normalize data and allow country, state, city and zip data to reside within customer_address table, then we could rewrite the above query as --

"SELECT
c.customer_first_name, c.customer_middle_name, c.customer_last_name
FROM
customer c, customer_address ca
WHERE
c.customer_id=ca.customer_id AND ca.country_name = 'US'
AND ca.state_name = 'FLORIDA' AND ca.city_name = 'TAMPA' AND ca.zip_code = '33601'"

After de-normalization, the query would run much faster. So, in effect, normalizing data into 3NF form is not always practical.
First