Unit 6 - Notes

INT306 7 min read

Unit 6: NoSQLDatabases

1. Introduction to NoSQL Databases

NoSQL (Not Only SQL) databases are non-tabular databases that store data differently than relational tables. They come in a variety of types based on their data model, primarily document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads.

2. SQL vs NoSQL

Understanding the distinction between traditional Relational Database Management Systems (SQL) and NoSQL is fundamental to modern database architecture.

Feature SQL (Relational Databases) NoSQL (Non-Relational Databases)
Data Model Relational (Tables, Rows, Columns). Document, Key-Value, Graph, Wide-Column.
Schema Rigid/Predefined. Must define tables and columns before inserting data. Flexible/Dynamic. Documents in the same collection can have different structures.
Scaling Vertical Scaling (Scaling up by adding more CPU, RAM to a single server). Horizontal Scaling (Scaling out by adding more servers/nodes to a cluster).
Transactions Strict ACID (Atomicity, Consistency, Isolation, Durability) properties. Generally follows BASE (Basically Available, Soft state, Eventual consistency), though many now support ACID.
Queries Standardized using SQL (Structured Query Language). Unstructured or specialized query languages (e.g., MQL for MongoDB).
Best Used For Complex queries, multi-row transactions, legacy systems. Hierarchical data, rapid development, highly scalable web applications, real-time analytics.

3. JSON Databases and JSON Representation

JSON Databases

A JSON (JavaScript Object Notation) database is a type of NoSQL document database that stores data in JSON or JSON-like formats (like BSON in MongoDB). JSON databases allow developers to store data in the same format they use in their application code, eliminating the need for complex Object-Relational Mapping (ORM) layers.

JSON Representation of Part of a Dataset

JSON data is represented as key-value pairs. Values can be strings, numbers, booleans, arrays, or even nested JSON objects.

Example Dataset Representation:
Imagine an e-commerce dataset containing a user and their order history. In a relational DB, this would require three tables (Users, Orders, OrderItems). In a JSON database, it is represented as a single, nested hierarchical document:

JSON
{
  "_id": "user_98765",
  "name": "Jane Doe",
  "email": "jane.doe@example.com",
  "isActive": true,
  "address": {
    "street": "123 Tech Lane",
    "city": "Silicon Valley",
    "zipCode": "94000"
  },
  "orders": [
    {
      "orderId": "ORD-001",
      "date": "2023-10-15T08:30:00Z",
      "totalAmount": 1250.50,
      "items": [
        { "product": "Laptop", "quantity": 1, "price": 1200.00 },
        { "product": "Wireless Mouse", "quantity": 1, "price": 50.50 }
      ]
    }
  ]
}

4. Introduction to MongoDB

MongoDB is a highly popular open-source NoSQL document database. It stores data in flexible, JSON-like documents called BSON (Binary JSON), meaning fields can vary from document to document and data structure can be changed over time.

Key Features:

  • Document-Oriented: Data is stored in BSON.
  • High Performance: Provides high performance data persistence with embedded data models reducing I/O database operations.
  • Rich Query Language: Supports CRUD operations, data aggregation, text search, and geospatial queries.
  • High Availability: Provides replica sets for automatic failover and data redundancy.
  • Horizontal Scalability: Uses sharding to distribute data across multiple machines.

5. Structure of MongoDB

MongoDB structure maps closely to traditional SQL concepts but uses different terminology to reflect its flexible nature.

  • Database: A physical container for collections. (Equivalent to SQL Database).
  • Collection: A group of MongoDB documents. It does not enforce a schema. (Equivalent to an SQL Table).
  • Document: A set of key-value pairs. Documents have dynamic schema. (Equivalent to an SQL Row).
  • Field: A key-value pair in a document. (Equivalent to an SQL Column).
  • _id (Primary Key): Every document requires an _id field that acts as a primary key. If not provided, MongoDB automatically generates a unique 12-byte ObjectId.

6. Working with MongoDB

Interacting with MongoDB primarily involves CRUD (Create, Read, Update, Delete) operations. Below are standard operations using the Mongo shell.

Create (Insert)

JAVASCRIPT
// Inserting a single document into the 'employees' collection
db.employees.insertOne({
    name: "John Smith",
    department: "Engineering",
    salary: 85000
});

// Inserting multiple documents
db.employees.insertMany([
    { name: "Alice Johnson", department: "HR", salary: 60000 },
    { name: "Bob Brown", department: "Engineering", salary: 90000 }
]);

Read (Find)

JAVASCRIPT
// Retrieve all documents
db.employees.find();

// Retrieve documents with a specific condition
db.employees.find({ department: "Engineering" });

// Find with projection (return only name and salary, exclude _id)
db.employees.find({ department: "Engineering" }, { name: 1, salary: 1, _id: 0 });

Update

JAVASCRIPT
// Update a single document
db.employees.updateOne(
    { name: "John Smith" }, // Filter
    { $set: { salary: 95000 } } // Update action
);

// Update multiple documents
db.employees.updateMany(
    { department: "Engineering" },
    { $inc: { salary: 5000 } } // Increase salary by 5000
);

Delete

JAVASCRIPT
// Delete a single document
db.employees.deleteOne({ name: "Alice Johnson" });

// Delete all documents matching a condition
db.employees.deleteMany({ department: "HR" });

7. Index Creation & Performance Comparison using EXPLAIN

Index Creation

Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan (scanning every document in a collection) to find matches. Indexes are special data structures that store a small portion of the collection's data set in an easy-to-traverse form.

JAVASCRIPT
// Create an ascending index on the 'department' field
db.employees.createIndex({ department: 1 });

// Create a compound index on 'department' and 'salary' (descending)
db.employees.createIndex({ department: 1, salary: -1 });

Performance Comparison using EXPLAIN

The .explain() method provides detailed execution statistics for a query. By using the "executionStats" verbosity mode, we can compare query performance before and after creating an index.

Step 1: Query WITHOUT Index

JAVASCRIPT
db.employees.find({ salary: { $gt: 80000 } }).explain("executionStats");

  • Look at: executionStats.totalDocsExamined
  • Result: The database scans all 1,000,000 documents in the collection to find 100 matches. totalDocsExamined = 1,000,000. executionTimeMillis might be high (e.g., 400ms). The stage will show COLLSCAN.

Step 2: Create Index

JAVASCRIPT
db.employees.createIndex({ salary: 1 });

Step 3: Query WITH Index

JAVASCRIPT
db.employees.find({ salary: { $gt: 80000 } }).explain("executionStats");

  • Look at: executionStats.totalDocsExamined
  • Result: The database uses the index. totalDocsExamined = 100. executionTimeMillis drops drastically (e.g., 2ms). The stage will show IXSCAN (Index Scan) followed by FETCH.

8. Introduction to DynamoDB

Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale.

Key Concepts of DynamoDB:

  • Tables, Items, and Attributes: Similar to Collections, Documents, and Fields in MongoDB.
  • Partition Key: A simple primary key composed of one attribute. DynamoDB uses an internal hash function on this key to determine physical data storage.
  • Sort Key: Combined with a partition key to create a composite primary key. It allows storing multiple items with the same partition key but sorted by the sort key.
  • Provisioned vs. On-Demand Capacity: You can provision exact read/write capacity units per second, or choose on-demand where AWS automatically scales and bills based on actual requests.
  • Global Secondary Indexes (GSIs): Allows querying the table using attributes other than the primary key.

9. Serverless Cloud Databases

A Serverless Cloud Database abstracts all database infrastructure management (provisioning, patching, scaling, and maintenance) away from the developer.

Key Characteristics:

  • Auto-Scaling: Automatically scales compute and storage resources up or down based on application workload. It can even scale to zero when not in use.
  • Pay-as-you-go Pricing: You are billed only for the exact compute time and storage used, rather than a pre-provisioned server instance.
  • High Availability & Fault Tolerance: Inherently built-in across multiple availability zones by the cloud provider.

Examples:

  • Amazon DynamoDB On-Demand
  • Amazon Aurora Serverless (Relational, but serverless)
  • MongoDB Atlas Serverless
  • Google Cloud Firestore

10. Vector Databases

Vector databases are an emerging class of databases specifically designed to store, manage, and search vector embeddings (high-dimensional mathematical representations of data like text, images, or audio).

Why do we need them?
Traditional databases search for exact keyword matches (lexical search). Vector databases enable semantic search—finding data that means the same thing, even if the keywords differ. They are the backbone of modern Generative AI and Large Language Models (LLMs).

Core Concepts:

  • Vector Embeddings: A machine learning model (like OpenAI's embedding model) converts data into an array of floats (e.g., [0.12, -0.45, 0.89, ...]). Similar items are located close to each other in vector space.
  • Similarity Search / Distance Metrics: Queries are converted into vectors, and the database finds the closest stored vectors using metrics like Cosine Similarity, Euclidean Distance, or Dot Product.
  • Approximate Nearest Neighbor (ANN): Because calculating exact distances across millions of vectors is slow, vector DBs use ANN algorithms (like HNSW - Hierarchical Navigable Small World) to find highly probable matches in milliseconds.

Use Cases:

  • Retrieval-Augmented Generation (RAG) for AI Chatbots.
  • Image and Audio similarity search.
  • Recommendation engines.

Examples: Pinecone, Milvus, Weaviate, Qdrant, and vector extensions to existing databases like pgvector for PostgreSQL.