DWH

What is BI?

Business Intelligence refers to a set of methods and techniques that are used by organizations for tactical and strategic decision making. It leverages methods and technologies that focus on counts, statistics and business objectives to improve business performance.

The objective of Business Intelligence is to better understand customers and improve customer service, make the supply and distribution chain more efficient, and to identify and address business problems and opportunities quickly.Warehouse is used for high level data analysis purpose. It is used for predictions, time series analysis, financial Analysis, what -if simulations etc. Basically it is used for better decision making.

What is a Data Warehouse? 
Data Warehouse is a "Subject-Oriented, Integrated, Time-Variant Nonvolatile collection of data in support of decision making".
In terms of design data warehouse and data mart are almost the same.
In general a Data Warehouse is used on an enterprise level and a Data Marts is used on a business division/department level.  
Subject Oriented:
Data that gives information about a particular subject instead of about a company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular time period.
Non-volatile:
Data is stable in a data warehouse. More data is added but data is never removed. 
What is a datamart? 
Datamart is usually sponsored at the department level and developed with a specific  details or subject in mind, a Data Mart is a  subset of data warehouse with a focused objective.  
What is the difference between a data warehouse and a data mart? 
 In terms of design data warehouse and data mart are almost the same.
In general a Data Warehouse is used on an enterprise level and a Data Marts is used on a business division/department level.
A data mart only contains data specific to a particular subject areas.  

Difference between data mart and data warehouse
Data Mart
Data Warehouse
Data mart is usually sponsored at the department level and developed with a specific issue or subject in mind, a data mart is a data warehouse with a focused objective.
Data warehouse is a “Subject-Oriented, Integrated, Time-Variant, Nonvolatile collection of data in support of decision making”.
A data mart is used on a business division/ department level.
A data warehouse is used on an enterprise level
A Data Mart is a subset of data from a Data Warehouse. Data Marts are built for specific user groups.
A Data Warehouse is simply an integrated consolidation of data from a variety of sources that is specially designed to support strategic and tactical decision making.
By providing decision makers with only a subset of data from the Data Warehouse, Privacy, Performance and Clarity Objectives can be attained.
The main objective of Data Warehouse is to provide an integrated environment and coherent picture of the business at a point in time.
What is fact less fact table? 
A fact table that  contains  only    primary keys from the dimension tables,  and that  do not contain any measures   that type of fact table is called fact less fact table . 
What is a Schema? 
Graphical Representation of the datastructure.
First Phase in implementation of Universe        
What are the most important features of a data warehouse? 
DRILL DOWN, DRILL ACROSS, Graphs, PI charts, dashboards and TIME HANDLING.To be able to drill down/drill across is the most basic requirement of an end user in a data warehouse. Drilling down most directly addresses the natural end-user need to see more detail in an result. Drill down should be as generic as possible becuase there is absolutely no good way to predict users drill-down path.        
What does it mean by grain of the star schema? 
In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail provided by a star schema.
It is usually given as the number of records per key within the table. In general, the grain of the fact table is the grain of the star schema.      
 What is a star schema? 
 Star schema is a data warehouse schema where there is only one  "fact table" and many denormalized dimension tables.
Fact table contains primary keys from all the dimension tables and other numeric columns columns of additive, numeric facts.
What is a snowflake schema? 
Unlike Star-Schema, Snowflake  schema contain normalized dimension tables in a tree like structure with many nesting levels.
Snowflake schema is easier to maintain but queries require more joins.
What is the difference between snow flake and star schema?
Star Schema
Snow Flake Schema
The star schema is the simplest data warehouse scheme.
Snowflake schema is a more complex data warehouse model than a star schema.
In star schema each of the dimensions is represented in a single table .It should not have any hierarchies between dims.
In snow flake schema at least one hierarchy should exists between dimension tables.
It contains a fact table surrounded by dimension tables. If the dimensions are de-normalized, we say it is a star schema design.
It contains a fact table surrounded by dimension tables. If a dimension is normalized, we say it is a snow flaked design.
In star schema only one join establishes the relationship between the fact table and any one of the dimension tables.
In snow flake schema since there is relationship between the dimensions tables it has to do many joins to fetch the data.
A star schema optimizes the performance by keeping queries simple and providing fast response time. All the information about the each level is stored in one row.
Snowflake schemas normalize dimensions to eliminated redundancy. The result is more complex queries and reduced query performance.
It is called a star schema because the diagram resembles a star.
It is called a snowflake schema because the diagram resembles a snowflake.

What is Fact and Dimension?
A "fact" is a numeric value that a business wishes to count or sum.  A "dimension" is essentially an entry point for getting at the facts. Dimensions are things of interest to the business.
A set of level properties that describe a specific aspect of a business, used for analyzing the factual measures.
What is Fact Table?
A Fact Table in a dimensional model consists of one or more numeric facts of importance to a business.  Examples of facts are as follows:
·        the number of products sold
·        the value of products sold
·        the number of products produced
·        the number of service calls received
What is Factless Fact Table?
Factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information.
Common examples of factless fact tables include:

·         Identifying product promotion events (to determine promoted products that didn’t sell)

·         Tracking student attendance or registration events

·         Tracking insurance-related accident events

Types of facts?
There are three types of facts:

·         Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.

·         Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.

·         Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

What is Granularity?
Principle: create fact tables with the most granular data possible to support analysis of the business process.
In Data warehousing grain refers to the level of detail available in a given fact table as well as to the level of detail provided by a star schema.
It is usually given as the number of records per key within the table. In general, the grain of the fact table is the grain of the star schema.
 Facts: Facts must be consistent with the grain.all facts are at a uniform grain.

·         Watch for facts of mixed granularity

·         Total sales for day & montly total

Dimensions: each dimension associated with fact table must take on a single value for each fact row.

·         Each dimension attribute must take on one value.

·         Outriggers are the exception, not the rule.

What is slowly Changing Dimension? 
Slowly changing dimensions refers to the change in dimensional attributes over time.
An example of slowly changing dimension is a Resource dimension where attributes of a particular employee  change over time like their designation changes or dept changes etc.
What is Conformed Dimension?
Conformed Dimensions (CD): these dimensions are something that is built once in your model and can be reused multiple times with different fact tables.   For example, consider a model containing multiple fact tables, representing different data marts.  Now look for a dimension that is common to these facts tables.  In this example let’s consider that the product dimension is common and hence can be reused by creating short cuts and joining the different fact tables.Some of the examples are time dimension, customer dimensions, product dimension.
What is Junk Dimension?
A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes. A good example would be a trade fact in a company that brokers equity trades.
When you consolidate lots of small dimensions and instead of having 100s of small dimensions, that will have few records in them, cluttering your database with these mini ‘identifier’ tables, all records from all these small dimension tables are loaded into ONE dimension table and we call this dimension table Junk dimension table.  (Since we are storing all the junk in this one table) For example: a company might have handful of manufacture plants, handful of order types, and so on, so forth, and we can consolidate them in one dimension table called junked dimension table
It’s a dimension table which is used to keep junk attributes
What is De Generated Dimension?
An item that is in the fact table but is stripped off of its description, because the description belongs in dimension table, is referred to as Degenerated Dimension.  Since it looks like dimension, but is really in fact table and has been degenerated of its description, hence is called degenerated dimension..
Degenerated Dimension: a dimension which is located in fact table known as Degenerated dimension
Dimensional Model:
A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures.
Data modeling
There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other.
Conceptual Data Model
Features of conceptual data model include:

·         Includes the important entities and the relationships among them.

·         No attribute is specified.

·         No primary key is specified.

At this level, the data modeler attempts to identify the highest-level relationships among the different entities.
Logical Data Model
Features of logical data model include:

·         Includes all entities and relationships among them.

·         All attributes for each entity are specified.

·         The primary key for each entity specified.

·         Foreign keys (keys identifying the relationship between different entities) are      specified.

·         Normalization occurs at this level.

At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database.
In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable).
The steps for designing the logical data model are as follows:
1.     Identify all entities.
2.     Specify primary keys for all entities.
3.     Find the relationships between different entities.
4.     Find all attributes for each entity.
5.     Resolve many-to-many relationships.
6.     Normalization.
Physical Data Model
Features of physical data model include:

·         Specification all tables and columns.

·         Foreign keys are used to identify relationships between tables.

·         Demoralization may occur based on user requirements.

·         Physical considerations may cause the physical data model to be quite different from the logical data model.

At this level, the data modeler will specify how the logical data model will be realized in the database schema.
The steps for physical data model design are as follows:
1.     Convert entities into tables.
2.     Convert relationships into foreign keys.
3.     Convert attributes into columns.
1.     Http://www.learndatamodeling.com/dm_standard.htm
2.     Modeling is an efficient and effective way to represent the organization’s needs; It provides information in a graphical way to the members of an organization to understand and communicate the business rules and processes. Business Modeling and Data Modeling are the two important types of modeling.
The differences between a logical data model and physical data model is shown below.
Logical vs Physical Data Modeling
Logical Data Model
Physical Data Model
Represents business information and defines business rules
Represents the physical implementation of the model in a database.
Entity
Table
Attribute
Column
Primary Key
Primary Key Constraint
Alternate Key
Unique Constraint or Unique Index
Inversion Key Entry
Non Unique Index
Rule
Check Constraint, Default Value
Relationship
Foreign Key
Definition
Comment
Type 1 Slowly Changing Dimension
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table:
Customer Key
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
Disadvantages:
-      All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.
-      Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.
Type 2 Slowly Changing Dimension
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The newe record gets its own primary key.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the table:
Customer Key
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.
Type3 Slowly Changing Dimension In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:

·         Customer Key

·         Name

·         Original State

·         Current State

·         Effective Date

After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003):
Customer Key
Name
Original State
Current State
Effective Date
1001
Christina
Illinois
California
15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.


Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.
What is Staging area why we need it in DWH?
If target and source databases are different and target table volume is high it contains some millions of records in this scenario without staging table we need to design your informatica using look up to find out whether the record exists or not in the target table since target has huge volumes so its costly to create cache it will hit the performance.
 
If we create staging tables in the target database we can   simply do outer join in the source qualifier to determine insert/update this approach will give you good performance.
 
It will avoid full table scan to determine insert/updates on target.
And also we can create index on staging tables since these tables were designed for specific application it will not impact to any other schemas/users.
 
While processing flat files to data warehousing we can perform cleansing. 
Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. During data cleansing, records are checked for accuracy and consistency.
 
·         Since it is one-to-one mapping from ODS to staging we do truncate and reload.
·        We can create indexes in the staging state, to perform our source qualifier best.
·        If we have the staging area no need to relay on the informatics transformation to known whether the record exists or not.
Data cleansing
Weeding out unnecessary or unwanted things (characters and spaces etc) from incoming data to make it more meaningful and informative
Data merging
Data can be gathered from heterogeneous systems and put together
Data scrubbing
Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incomplete or duplicated before the data is passed to end user.
Data scrubbing is aimed at more than eliminating errors and redundancy. The goal is also to bring consistency to various data sets that may have been created with different, incompatible business rules.

ODS (Operational Data Sources):
My understanding of ODS is, its a replica of OLTP system and so the need of this, is to reduce the burden on production system (OLTP) while fetching data for loading targets. Hence its a mandate Requirement for every Warehouse.
So every day do we transfer data to ODS from OLTP to keep it up to date?
OLTP is a sensitive database they should not allow multiple select statements it may impact the performance as well as if something goes wrong while fetching data from OLTP to data warehouse it will directly impact the business.
ODS is the replication of OLTP.
ODS is usually getting refreshed through some oracle jobs.enables management to gain a consistent picture of the business.
What is a surrogate key? 
   A surrogate key is a substitution for the natural primary key. It is  a unique identifier or number ( normally created by a database sequence generator ) for each record of a dimension table that can be used for the primary key to the table. 
A surrogate key  is useful because  natural keys may change.        
 What is the difference between a primary key and a surrogate key?
A primary key is a special constraint on a column or set of columns. A primary key constraint ensures that the column(s) so designated have no NULL values, and that every value is unique. Physically, a primary key is implemented by the database system using a unique index, and all the columns in the primary key must have been declared NOT NULL. A table may have only one primary key, but it may be composite (consist of more than one column).

A surrogate key is any column or set of columns that can be declared as the primary key instead of a "real" or natural key. Sometimes there can be several natural keys that could be declared as the primary key, and these are all called candidate keys. So a surrogate is a candidate key. A table could actually have more than one surrogate key, although this would be unusual. The most common type of surrogate key is an incrementing integer, such as an auto increment column in mysql, or a sequence in Oracle, or an identity column in SQL Server.

1 comment: