Databases

Relational Database

A database model that facilitates relationships between different data points.

Info

Referential Integrity

Referential integrity is a property of data stating that all its references are valid.
In the context of relational databases, it requires that if a value of one attribute of a relation references a value of another attribute, then the referenced value must exist.

Definitions

Basics

Entity: A real-world object (expr: rectangular box)
- Strong Entity: When all the tuples can be identified using the primary key
- Weak Entity: When the entity doesn't have sufficient attributes to form a primary key
Attribute: Characterstic of an entity (a field, a column)
Relationship: Association between two or more entities (expr: diamond)
Cardinality: Number of times an entity of an entity set participates in a relationship set
- One-to-One
- One-to-Many
- Many-to-One
- Many-to-Many
Specialization: Top-down approach in which one entity is divided into two or more sub-entities based on its characterstics
Generalization: Bottom-up approach in which common properties are combined to form one entity
Aggregation: Abstraction process through which relationships are represented as higher-level entity sets
Participation Constraint: Specifies the maximum or minimum number of relationship instances in which any entity can participate
- Total Participation: Each entity of an entity set participates in at least one relationship
- Partial Participation: Some entity of the entity set may not participates in any relationship

Keys

Candidate Key: The minimal set of attributes that can determine a tuple uniquely
Super Key: The set of attributes that can determine a tuple uniquely.
- A candidate key is always a super key but not vice-versa
Primary Key & Alternate Key: Among various candidate keys, one is taken as Primary Key and others are Alternate Keys
Foreign Key: A set of attributes in a table that is used to refer to the primary key or alternative key of the same or another table

Functional Dependency

It is a constraint that specifies the association/relationship between a set of attributes. - ie. one set can accurately determine the value of another set

Normalization

Used to eliminate the following anomalies:
- Insertion Anomaly
- Deletion Anomaly
- Updation Anomaly
- Join Anomaly
To achieve high integrity in the db and make it more maintainable:
- 1 NF: if a relation does not contain any composite attribute (attribute which is the combination of other attributes). High redundancy.
- 2 NF: if a relation is in 1 NF and it doesn't contain any partial dependency.
  - Seperate data that is only partially dependent on the primary key into its own entity
- 3 NF: if a relations is in 2 NF and it doesn't contain any transitive dependency.
  - Remove data from entities that is not directly dependent on the primary key.
- Boyce-Codd Normal Form: if the LHS of every FD is super key.

Partial Dependency

Partial dependency is a concept in database normalization, specifically in the context of the second normal form (2NF). A partial dependency occurs when a non-prime attribute (an attribute that is not part of any candidate key) is functionally dependent on part of a candidate key rather than the whole key.

To illustrate this, let's consider a table that is not in 2NF due to partial dependencies. Suppose we have a CourseEnrollment table that keeps track of which students are enrolled in which courses, including some additional information about the courses.

CourseEnrollment Table

StudentID	CourseID	CourseName	Instructor	Grade
101	CSE101	Computer Science	Dr. Smith	A
102	CSE101	Computer Science	Dr. Smith	B
101	MAT202	Mathematics	Dr. Jones	A

In this table:

The primary key is the combination of StudentID and CourseID.
CourseName and Instructor are attributes of the course, while Grade is an attribute of the student's performance in the course.

Identifying Partial Dependencies

StudentID + CourseID -> CourseName
StudentID + CourseID -> Instructor
StudentID + CourseID -> Grade

Here, CourseName and Instructor are functionally dependent on only CourseID, which is part of the composite primary key. This means there is a partial dependency, as these attributes do not depend on the whole primary key (StudentID + CourseID).

Resolving Partial Dependencies

To remove partial dependencies and achieve 2NF, we need to separate the table into two tables: one for courses and one for enrollments.

Courses Table

CourseID	CourseName	Instructor
CSE101	Computer Science	Dr. Smith
MAT202	Mathematics	Dr. Jones

CourseEnrollment Table

StudentID	CourseID	Grade
101	CSE101	A
102	CSE101	B
101	MAT202	A

In these tables:

Courses table contains the CourseID, CourseName, and Instructor, where CourseID is the primary key.
CourseEnrollment table contains the StudentID, CourseID, and Grade, where the combination of StudentID and CourseID is the primary key.

By splitting the original table into these two tables, we've removed the partial dependencies. Now, CourseName and Instructor are fully dependent on the primary key of the Courses table (CourseID), and Grade is fully dependent on the primary key of the CourseEnrollment table (StudentID + CourseID). This ensures that the database design adheres to the second normal form (2NF).

Database Design Goals

Prime goal is to:

To have zero redundancy in the system
Loss-less joins
Dependency Preservation

Advantages of DBMS over traditional file system

Downs of file system:

Difficulty in accessing data
Data isolation - multiple files and formats
Integrity problems
Atomicity of updates
Security problems
Concurrent access by multiple users

DBMS advantages:

Duplicity of data does not happen as data is stored into a single database and in one place
Database can easily be shared with multiple users simultaneously
Backup and Recovery facility is there
Integrity Constraints are enforced
We can change the structure of data without affecting the stucture of any of the application programs

File Structure

logical relation between records and it defines how file records are mapped into disk blocks (memory).
A db is a collection of files, each file is a collection of records and each record contains a sequence of fields.

Info

Blocking Factor is the average number of records per block.

Q/A

1) Difference between primary key and unique constraints.

Primary key cannot have NULL value while unique constraints can have NULL values.
There is only 1 primary key in the table while there can be multiple unique constraints.

2) What is database normalization?

It is a process of analyzing the given relation schemas based on their functional dependencies and primary keys to achieve the following desirable properties:
- Minimizing Redundancy
- Minimizing the Insertion, Deletion, And Update Anomalies
Relation schemas that do not meet the properties are decomposed into smaller relation schemas that could meet desirable properties.
It avoids duplicate data and helps in dividing the large db into smaller tables and linking them using relationships.

3) Difference between HAVING and WHERE clause.

HAVING is used to specify a condition for a group or an aggregate function used in SELECT statement.
WHERE clause selects before grouping while HAVING clause selects rows after grouping.
WHERE cannot contain aggregate functions.

4) What is a Join?

Used to combine data from two or more tables, based on a common field b/w them.
Types of Joins:
- INNER JOIN
- LEFT JOIN
- RIGHT JOIN
- FULL JOIN

5) What is an identity?

A column that automatically generates numeric values.
A start and increment value can be set, but most DBs leave these at 1.

6) What is a view?

A virtual table based on the result-set of an SQL statement.
Create it using:

CREATE VIEW view_name AS
SELECT column_name
FROM table_name
WHERE condition

Can represent subset of data
Take very little space to store; as the db contains only the definition of view not the actual data
Can provide extra security

7)What is a trigger?

A code associated with insert, update or delete operations.
The code is executed automatically whenever the associated query is executed on a table.

8) What is a stored procedure?

Like a function that contains a set of operations compiled together.
Can be reused again and again and also support parameters.

Info

Trigger cannot be called directly like stored procedure. They are only associated with queries.

9) What is a transaction?

A Database Transaction is a set of database operations that must be treated as a whole, which means either all operations are executed or none of them. An example can be a bank transaction from one account to another account.

10) What are indexes?

A database index is a data structure that improves the speed of data retrieval operations on a database.
These indexes need extra space on the disk, but they allow faster search according to different frequently searched values.

11) What is a Live Lock?

Live lock in a database context refers to a situation where a database system is unable to make progress on certain transactions because the transactions are continually being restarted (in a cycle).
Example:
- Consider two transactions, T1 and T2, that need access to the same set of resources (e.g., rows in a table).
- T1 locks resource A and needs resource B.
- T2 locks resource B and needs resource A.
- Both transactions detect potential conflict and back off or roll back.
- They both retry and repeat the same pattern without making progress.
SOLUTION: Exponential Backoff

12)Difference b/w DROP, TRUNCATE and DELETE commands.

DROP deletes the table; cannot be rolled back.
TRUNCATE: only the data is deleted; structure is preserved; can be rolled back
DELETE: same as TRUNCATE but slower; it can delete 1 or more specific rows from the table

13) Difference b/w UNION and UNION ALL.

UNION: removes duplicate rows and picks the rows which are distinct after combining the data from the tables
UNION ALL: doesnot remove the duplicate rows, it just picks all the data from the tables

14) What is Correlated Subquery?

A Subquery is also known as a nested query i.e. a query written inside some query. When a Subquery is executed for each of the rows of the outer query then it is termed as a Correlated Subquery.

15) What are the different levels of abstraction in the DBMS?

3 levels:
- Physical Level: Lowest level which states how data should be stored in the db
- Logical Level: which states the type of data and relationship among the data that is stored in the db
- View Level: Highest level of abstraction which shows/states only a part of the db

16) What integrity rules exist in the DBMS?

2 major rules:
- Entity Integrity: Value of a primary key cannot be NULL
- Referential Integrity: either the value of a Foreign key is a NULL value or it should be the primary key of any other relation.

17) Command for temporary tables.

CREATE GLOBAL TEMPORARY TABLE

18) Main goal of RAID technology.

Redundant Array of Independent Disks
RAID is a method of combining several hard disk drives into one logical unit (two or more disks grouped together to appear as a single device to the host system).
RAID technology was developed to address the fault-tolerance and performance limitations of conventional disk storage. It can offer fault tolerance and higher throughput levels than a single hard drive or group of independent hard drives.

19) What’s the difference between materialized and dynamic view?

Materialized views are disk-based and are updated periodically based upon the query definition.
Dynamic views are virtual only and run the query definition each time they are accessed.

20) In a table, rows are repeated (or duplicate rows). Write a query to find the repeated row.

SELECT name, section FROM table
GROUP BY name, section
HAVING COUNT(*) > 1;

21) Query to find the 2nd highest salary of an employee.

SELECT max(salary) FROM employees 
WHERE salary <> (SELECT max(salary) FROM employees);

22) In a table are two columns: Student and Marks. Find all the students, whose marks are greater than average marks.

SELECT student, marks
FROM table
WHERE marks > SELECT AVG(marks) FROM table;