Unit 6 - Notes

INT306 6 min read

Unit 6: NoSQL Databases

1. Introduction: SQL vs NoSQL

The transition from RDBMS (Relational Database Management Systems) to NoSQL (Not Only SQL) represents a shift from rigid, schema-based storage to flexible, scalable data management designed for modern web-scale applications.

Key Differences

Feature	SQL (Relational)	NoSQL (Non-Relational)
Data Structure	Table-based with Rows and Columns.	Document, Key-Value, Wide-Column, or Graph-based.
Schema	Pre-defined (Static). Schema must be altered before inserting new data types.	Dynamic (Schemaless). Fields can be added on the fly; documents in the same collection can differ.
Scalability	Vertical Scaling (Scale Up): Increasing RAM/CPU of a single server.	Horizontal Scaling (Scale Out): Adding more servers (Sharding) to distribute load.
Relationships	Uses JOINs to connect tables.	Data is usually denormalized (embedded) or linked via references (no native complex JOINs).
Transactions	ACID (Atomicity, Consistency, Isolation, Durability) compliance is standard.	Often follows BASE (Basically Available, Soft state, Eventual consistency), though many now support ACID.

A split-screen comparison diagram. On the left side, labeled "SQL / Relational", show three connecte... — AI-generated image — may contain inaccuracies

2. Introduction to MongoDB & Structure

MongoDB is the most popular document-oriented NoSQL database. It is open-source and provides high performance, high availability, and automatic scaling.

Core Concepts and Hierarchy

Database: A physical container for collections. A single MongoDB server can hold multiple databases.
Collection: A group of MongoDB documents. This is the equivalent of an RDBMS "Table". It does not enforce a schema.
Document: A set of key-value pairs. This is the equivalent of an RDBMS "Row". Documents utilize BSON (Binary JSON) format.
Field: A key-value pair in a document. Equivalent to a "Column".

BSON (Binary JSON)

While MongoDB allows developers to work with JSON, it stores data internally as BSON.

Efficiency: BSON is designed to be efficient for storage and scanning speed.
Data Types: BSON supports types not found in standard JSON, such as Date, ObjectId (primary key), and raw Binary data.

Architecture Features

Replica Sets: Multiple copies of data on different servers to ensure high availability and redundancy. If the primary node fails, a secondary node automatically becomes primary.
Sharding: The process of storing data records across multiple machines. It is MongoDB's approach to meeting the demands of data growth (Horizontal Scaling).

A block diagram illustrating the MongoDB storage hierarchy. The largest outer box is labeled "MongoD... — AI-generated image — may contain inaccuracies

3. DynamoDB & Serverless Cloud Databases

Amazon DynamoDB

DynamoDB is a fully managed, proprietary NoSQL database service provided by Amazon Web Services (AWS).

Type: Key-Value and Document store.
Architecture: It runs on AWS infrastructure and automatically distributes data and traffic over a sufficient number of servers.
Performance: It delivers single-digit millisecond performance at any scale. It uses SSDs solely.

Serverless Cloud Databases

"Serverless" does not mean there are no servers; it means the developer does not have to manage them.

Auto-scaling: The database automatically scales up or down based on request volume (throughput).
Pay-per-use: You are charged based on the read/write capacity units consumed, not for a fixed server size.
Zero Administration: No need to patch OS, install software, or configure replication manually.
Examples: AWS DynamoDB, Google Cloud Firestore, Azure Cosmos DB.

4. JSON Databases & Representation

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write and easy for machines to parse and generate.

JSON Syntax Rules

Data is in name/value pairs ("name": "value").
Data is separated by commas.
Curly braces {} hold objects.
Square brackets [] hold arrays.

JSON Representation of a Dataset

Below is a representation of an E-commerce dataset. Note the Embedding (Denormalization) where the address and orders are stored inside the user document, rather than in separate tables.

JSON

{
  "_id": "507f1f77bcf86cd799439011",
  "username": "john_doe",
  "email": "john@example.com",
  "is_active": true,
  "roles": ["customer", "subscriber"],
  "contact_details": {
    "phone": "555-0199",
    "address": {
      "street": "123 Main St",
      "city": "Metropolis",
      "zip": "10012"
    }
  },
  "recent_orders": [
    {
      "order_id": "ORD-999",
      "total": 45.50,
      "items": [
        {"product": "Wireless Mouse", "qty": 1},
        {"product": "Battery Pack", "qty": 2}
      ],
      "status": "delivered"
    }
  ],
  "joined_date": "2023-10-15T14:30:00Z"
}

5. Working with MongoDB (CRUD Operations)

Interaction with MongoDB is primarily done through the MongoDB Shell (mongosh) or drivers (Python, Node.js, Java).

1. Create (Insert)

Adding new documents to a collection.

JAVASCRIPT

// Insert a single document
db.users.insertOne({ name: "Alice", age: 25, city: "NYC" });

// Insert multiple documents
db.users.insertMany([
   { name: "Bob", age: 30 },
   { name: "Charlie", age: 35 }
]);

2. Read (Query)

Retrieving data using find().

JAVASCRIPT

// Find all documents
db.users.find();

// Find with filter (WHERE name = 'Alice')
db.users.find({ name: "Alice" });

// Comparison Operators: Find age > 26
// $gt = Greater Than, $lt = Less Than, $eq = Equal
db.users.find({ age: { $gt: 26 } });

3. Update

Modifying existing documents.

JAVASCRIPT

// Update the first matching document
// Uses Atomic Operators like $set to modify specific fields
db.users.updateOne(
   { name: "Bob" },        // Filter
   { $set: { age: 31 } }   // Update Action
);

4. Delete

Removing documents.

JAVASCRIPT

// Delete all users named Charlie
db.users.deleteMany({ name: "Charlie" });

6. Index Creation & Performance Comparison using EXPLAIN

In NoSQL, just like in SQL, indexes are crucial for performance. Without an index, MongoDB must perform a Collection Scan (scan every document) to find query matches.

Creating an Index

Indexes are created on specific fields to support queries.

JAVASCRIPT

// Create an index on the 'username' field (1 for ascending order)
db.users.createIndex({ username: 1 });

The `EXPLAIN` Command

The explain() method provides details on the execution plan of a query. It tells you whether the query used an index or scanned the whole collection.

Syntax:

JAVASCRIPT

db.users.find({ username: "john_doe" }).explain("executionStats");

Performance Comparison

Metric	Without Index (COLLSCAN)	With Index (IXSCAN)
Stage	`COLLSCAN` (Collection Scan)	`IXSCAN` (Index Scan)
totalDocsExamined	High (e.g., 1,000,000 if 1M docs exist)	Low (e.g., 1 - specific match)
nReturned	1	1
executionTimeMillis	High (e.g., 500ms)	Low (e.g., 2ms)

Interpretation:

COLLSCAN: The engine had to read every document in memory to check if it matched. Very slow for large datasets.
IXSCAN: The engine used the B-Tree index to jump directly to the record. Very fast.

A performance comparison diagram consisting of two parts (Top and Bottom). Top part labeled "Without... — AI-generated image — may contain inaccuracies

7. Vector Databases

Vector databases have emerged as a critical component of the modern AI/ML stack, distinct from standard NoSQL document stores.

Concept

Traditional databases query for exact matches (e.g., "Find user where ID = 5"). Vector databases are designed to store and query Vector Embeddings.

Embeddings: High-dimensional arrays of numbers (vectors) generated by AI models (like GPT-4 or BERT) that represent the semantic meaning of text, images, or audio.

How it Works

Vectorization: Raw data (text/image) is converted into a vector (e.g., [0.1, -0.5, 0.8, ...]).
Storage: The database stores this vector.
Similarity Search: When a user queries, the query is converted to a vector. The database searches for vectors that are mathematically "closest" to the query vector using algorithms like Cosine Similarity or Euclidean Distance.

Use Cases

Semantic Search: Searching for "Something to sit on" returns "Chair" even if the word "sit" isn't in the product description.
Recommendation Systems: Finding items with similar feature vectors.
GenAI Context (RAG): Providing relevant long-term memory to LLMs (Large Language Models).

Popular Vector Databases

Pinecone (Native Vector DB)
Milvus (Open Source)
MongoDB Atlas Vector Search (NoSQL + Vector capabilities)

Unit 5

Unit 6 - Notes

Table of Contents

Unit 6: NoSQL Databases

1. Introduction: SQL vs NoSQL

Key Differences

2. Introduction to MongoDB & Structure

Core Concepts and Hierarchy

BSON (Binary JSON)

Architecture Features

3. DynamoDB & Serverless Cloud Databases

Amazon DynamoDB

Serverless Cloud Databases

4. JSON Databases & Representation

JSON Syntax Rules

JSON Representation of a Dataset

5. Working with MongoDB (CRUD Operations)

1. Create (Insert)

2. Read (Query)

3. Update

4. Delete

6. Index Creation & Performance Comparison using EXPLAIN

Creating an Index

The EXPLAIN Command

Performance Comparison

7. Vector Databases

Concept

How it Works

Use Cases

Popular Vector Databases

The `EXPLAIN` Command