Unit 1 - Notes

INT306

Unit 1: Introduction to Database

1. Purpose of Database Systems

A Database Management System (DBMS) is a software application that interacts with users, applications, and the database itself to capture and analyze data. The primary purpose of a DBMS is to provide a systematic and efficient method of defining, storing, retrieving, and manipulating data.

Limitations of File Processing Systems (Pre-DBMS)

Before DBMS, organizations relied on file processing systems where data was stored in discrete files. This approach had several major drawbacks:

  • Data Redundancy and Inconsistency: The same data is stored in multiple places (e.g., student address in both the library file and the registrar file). Updates might occur in one file but not another, leading to inconsistency.
  • Difficulty in Accessing Data: Writing a new program to carry out each new task is inconvenient. There is no standardized query language.
  • Data Isolation: Data is scattered in various files, often in different formats, making it difficult to write new application programs to retrieve appropriate data.
  • Integrity Problems: Constraints (e.g., account balance > 0) are buried in program code rather than explicitly stated. Adding new constraints is difficult.
  • Atomicity Problems: Systems are vulnerable to failure. Partial updates (e.g., money deducted from Sender but not credited to Receiver during a crash) leave the database in an inconsistent state.
  • Concurrent Access Anomalies: Multiple users updating data simultaneously can result in incorrect data if not managed properly.
  • Security Problems: Not every user should have access to all data. Enforcing security constraints in file systems is ad-hoc and difficult.

Benefits of DBMS

  • Control of Redundancy: Centralized control minimizes duplication.
  • Data Integrity: Enforces constraints to ensure data validity.
  • Data Security: Restricts unauthorized access.
  • Backup and Recovery: specialized subsystems handle crashes and restoration.

2. Components of DBMS

A DBMS is a complex system composed of five major components often referred to as the "DBMS Environment":

  1. Hardware:

    • The physical electronic devices (computers, storage devices like hard disks, I/O devices) that store and run the software.
    • Ranges from a single PC to a network of mainframes.
  2. Software:

    • The DBMS itself (e.g., MySQL, Oracle, PostgreSQL).
    • Operating System (OS).
    • Network software (if the DBMS is distributed).
    • Application programs (C++, Java, Python scripts) accessing the data.
  3. Data:

    • The most important component. It includes:
      • Operational Data: The actual entities and attributes (e.g., Student names, Grades).
      • Metadata: Data about data (schema, constraints, structure), stored in the Data Dictionary.
  4. Procedures:

    • The instructions and rules that govern the design and use of the database.
    • Includes instructions for logging in, starting/stopping the DBMS, making backups, and handling failures.
  5. People:

    • Database Administrators (DBA): Responsible for authorizing access, monitoring use, and acquiring resources.
    • Application Programmers: Write programs that interact with the database.
    • End Users: Naive users (interact via forms), Sophisticated users (write SQL queries), and Specialized users (write specialized database applications).

3. Applications of DBMS

Databases are ubiquitous in modern technology. Key application areas include:

  • Banking: For customer information, accounts, loans, and banking transactions.
  • Airlines: For reservations, schedule information, and assigning crew to flights.
  • Universities: For student information, course registrations, and grades.
  • Telecommunications: For keeping records of calls made, generating monthly bills, and maintaining balances for prepaid calling cards.
  • Finance: For storing information about holdings, sales, and purchases of financial instruments (stocks/bonds).
  • Sales/E-commerce: For customer, product, and purchase information.
  • Manufacturing: For management of the supply chain and tracking production of items in factories (Inventory management).
  • Human Resources: For information about employees, salaries, payroll taxes, and benefits.

4. Three-Tier DBMS Architecture (ANSI-SPARC)

This architecture divides the system into three levels of abstraction to hide the complexity of data storage from users.

1. Physical Level (Internal Level)

  • Lowest level: Describes how the data is actually stored.
  • Deals with complex low-level data structures, allocation of storage space, data compression, and encryption techniques.
  • Managed by the Database Administrator.

2. Logical Level (Conceptual Level)

  • Middle level: Describes what data is stored in the database and what relationships exist among those data.
  • Describes the entire database in terms of a small number of relatively simple structures (tables, entities).
  • Hides the details of physical storage structures.
  • Used by application programmers and DBAs.

3. View Level (External Level)

  • Highest level: Describes only part of the entire database.
  • Simplifies the interaction with the system for end-users.
  • Provides different views for different users (e.g., a payroll employee sees salary data, but a project manager sees only work hours).
  • Ensures security by hiding sensitive data.

5. Data Independence

Data independence is the ability to modify the schema definition in one level without affecting the schema definition in the next higher level.

1. Physical Data Independence

  • The capacity to change the internal schema (Physical Level) without having to change the conceptual schema (Logical Level).
  • Example: Changing the storage device from a Hard Drive to an SSD, or changing the hashing algorithm used for storage, should not require changing the logical table structure.
  • Ease: Relatively easy to achieve.

2. Logical Data Independence

  • The capacity to change the conceptual schema without having to change the external schemas or application programs.
  • Example: Adding a new column to a table (e.g., "Date of Birth") should not break an application that only reads "Name" and "ID".
  • Ease: More difficult to achieve than physical independence.

6. Database Schema and Instance

Database Schema

  • The overall design or structure of the database.
  • It is static; it changes very infrequently (only during design updates).
  • Analogy: Similar to variable declarations and type definitions in a programming language.
  • Types:
    • Physical Schema: Database design at the physical level.
    • Logical Schema: Database design at the logical level.

Database Instance

  • The collection of information/data stored in the database at a particular moment.
  • It is dynamic; it changes every time data is inserted, updated, or deleted.
  • Analogy: Similar to the value of a variable at a specific instant during program execution.

7. Data Modeling

A Data Model is a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.

Categories of Data Models

  1. Relational Model: Uses tables to represent data and relationships. Most widely used (SQL).
  2. Entity-Relationship (ER) Model: Uses a collection of basic objects (entities) and relationships. Primarily used for database design.
  3. Object-Based Data Model: Extends the ER model with concepts of encapsulation, methods, and object identity.
  4. Semistructured Data Model: Permits the specification of data where individual data items of the same type may have different sets of attributes (e.g., JSON, XML).
  5. Legacy Models:
    • Hierarchical Model: Tree-like structure (Parent-Child).
    • Network Model: Graph-like structure.

8. Entity-Relationship (ER) Model

The ER model is a high-level conceptual data model. It represents the real world as Entities and the Relationships between them.

Key Components

A. Entity and Entity Set

  • Entity: A "thing" or object in the real world that is distinguishable from other objects (e.g., a specific student, a specific car).
  • Entity Set: A set of entities of the same type (e.g., all students).
  • Representation: Rectangle.

B. Attributes

Properties used to describe an entity.

  • Simple: Cannot be divided (e.g., Age).
  • Composite: Can be divided into sub-parts (e.g., Name First, Middle, Last).
  • Single-valued: One value for a specific entity (e.g., Student ID).
  • Multi-valued: Multiple values for an entity (e.g., Phone Numbers). Representation: Double Oval.
  • Derived: Calculated from other attributes (e.g., Age derived from DOB). Representation: Dashed Oval.
  • Key Attribute: Uniquely identifies an entity (e.g., SSN). Representation: Underlined text in an oval.

C. Relationship and Relationship Set

  • Relationship: An association among several entities (e.g., Student Enrolled in Course).
  • Representation: Diamond.

Mapping Cardinalities

Expresses the number of entities to which another entity can be associated via a relationship set.

  1. One-to-One (1:1): An entity in A is associated with at most one entity in B (e.g., Driver License to Person).
  2. One-to-Many (1:N): An entity in A is associated with any number of entities in B (e.g., Mother to Children).
  3. Many-to-One (N:1): Many entities in A are associated with one entity in B (e.g., Students to College).
  4. Many-to-Many (N:N): Entities in A are associated with many in B and vice versa (e.g., Students to Courses).

Strong vs. Weak Entity

  • Strong Entity: Has a primary key. Represented by a single rectangle.
  • Weak Entity: Does not have a primary key of its own; depends on a strong entity (owner). Represented by a double rectangle.

9. Relational Model

The Relational Model uses a collection of tables to represent both data and the relationships among those data. It is the theoretical basis for SQL.

Basic Terminology

  • Relation: A table with columns and rows.
  • Attribute: A named column of a relation (e.g., Name, ID).
  • Domain: The set of permitted values for each attribute (e.g., integer, string).
  • Tuple: A row in a relation (represents a single record).
  • Degree: The number of attributes (columns) in a relation.
  • Cardinality: The number of tuples (rows) in a relation.

Relational Integrity Constraints

Rules to ensure data accuracy and consistency.

  1. Domain Constraints:

    • Values appearing in a column must be drawn from the domain associated with that column.
    • Example: Age must be an Integer; Name must be a String.
  2. Key Constraints:

    • Super Key: A set of one or more attributes that collectively identify a tuple uniquely.
    • Candidate Key: A minimal super key (no unnecessary attributes).
    • Primary Key: A candidate key chosen by the database designer as the principal means of identifying tuples. It cannot be NULL.
  3. Entity Integrity Constraint:

    • The Primary Key cannot be Null. Each row must be identifiable.
  4. Referential Integrity Constraint:

    • Specified between two relations (tables).
    • A Foreign Key in Table A must match a Primary Key in Table B, or be Null.
    • Purpose: Ensures relationships between tables remain consistent (prevents "dangling" pointers).