Unit6 - Subjective Questions
INT306 • Practice Questions with Detailed Answers
Differentiate between SQL (Relational) and NoSQL databases. Discuss their schema, scalability, and transaction models.
The differences between SQL and NoSQL databases can be categorized as follows:
1. Schema Architecture
- SQL: Follows a rigid, structured schema. Data must follow a defined structure (tables, columns, data types) before insertion. Changes to the schema are difficult once data is populated.
- NoSQL: Uses a dynamic or schema-less architecture. Data can be stored without a predefined structure, allowing for flexible modifications and varied data formats within the same collection.
2. Scalability
- SQL: Primarily supports Vertical Scaling (scaling up). This involves increasing the CPU, RAM, or SSD of a single server to handle more load.
- NoSQL: Designed for Horizontal Scaling (scaling out). This involves adding more servers to a distributed network (sharding) to handle increased traffic and data volume.
3. Transaction Model
- SQL: Follows ACID properties (Atomicity, Consistency, Isolation, Durability) strictly, ensuring high data integrity.
- NoSQL: Often follows the BASE model (Basically Available, Soft state, Eventual consistency), prioritizing availability and partition tolerance over immediate consistency (though many, like MongoDB, now support ACID transactions).
4. Query Language
- SQL: Uses Structured Query Language (SQL) which is standardized.
- NoSQL: Uses various data models (Document, Key-Value, Graph) and often proprietary APIs or JSON-based query methods.
Explain the structure of MongoDB. How are data organized in terms of Databases, Collections, and Documents?
MongoDB uses a hierarchy of data organization that differs from traditional relational databases. The structure consists of:
1. Database
- A physical container for collections. A single MongoDB server can hold multiple databases.
- Each database gets its own set of files on the file system.
2. Collection
- A collection is a group of MongoDB documents. It is the equivalent of an RDBMS table.
- Collections exist within a single database.
- Collections do not enforce a schema; documents within a collection can have different fields (polymorphism), though they usually share a similar purpose.
3. Document
- A document is a set of key-value pairs. It is the equivalent of an RDBMS row.
- MongoDB documents are stored in BSON (Binary JSON) format.
- Documents support dynamic schemas, meaning one document might have a field that another document in the same collection lacks.
Hierarchy Visual:
Server Databases Collections Documents (BSON).
What is DynamoDB? Describe its core components and primary key structure.
DynamoDB is a fully managed, serverless, key-value NoSQL database service provided by Amazon Web Services (AWS). It is designed to run high-performance applications at any scale.
Core Components:
- Tables: Similar to other databases, DynamoDB stores data in tables.
- Items: Each table contains zero or more items (similar to rows/records). An item is a group of attributes that is uniquely identifiable.
- Attributes: Each item is composed of one or more attributes (similar to columns/fields).
Primary Key Structure:
DynamoDB requires a primary key to uniquely identify each item. There are two types:
- Partition Key (Simple Primary Key): Composed of one attribute. DynamoDB uses the partition key's value as input to an internal hash function to determine where to store the item physically.
- Partition Key and Sort Key (Composite Primary Key): Composed of two attributes. The first is the Partition Key and the second is the Sort Key. All items with the same partition key are stored together and sorted by the sort key value.
Define Serverless Cloud Databases. What are the advantages of using a serverless approach for NoSQL databases?
Serverless Cloud Databases are database services where the cloud provider manages the underlying infrastructure, including capacity planning, installation, configuration, and patching. The developer does not provision or manage servers.
Key Characteristics:
- Auto-scaling: The database automatically scales storage and compute resources up or down based on request volume.
- Pay-per-use: Users are billed based on actual consumption (e.g., read/write units, storage used) rather than pre-allocated capacity.
Advantages:
- Reduced Operational Overhead: No need to manage OS patches, server maintenance, or manual scaling.
- Cost Efficiency: Eliminates the cost of idle resources. You pay only for what you use.
- High Availability: Built-in fault tolerance and multi-region replication are usually handled by the provider (e.g., AWS DynamoDB, Azure Cosmos DB).
- Elasticity: Handles sudden spikes in traffic seamlessly without manual intervention.
Discuss the concept of JSON Databases. Why is JSON the preferred format for NoSQL data representation?
JSON (JavaScript Object Notation) Databases store data in a format derived from JavaScript object syntax. Unlike the tabular format of SQL, JSON databases store data in semi-structured documents.
Why JSON is preferred for NoSQL:
- Flexibility (Schema-less): JSON allows nested structures (arrays and objects within objects), enabling complex data representation in a single record without needing joins.
- Language Independence: While derived from JavaScript, JSON is language-agnostic. Parsers exist for virtually every programming language (Python, Java, C#, etc.).
- Web Native: JSON is the standard format for API communication (REST/GraphQL). Storing data in JSON (or BSON) eliminates the need for complex Object-Relational Mapping (ORM) translations.
- Human Readable: The text-based format is easy for developers to read and debug.
- Lightweight: It has a lower overhead compared to XML, making data transmission faster.
Create a JSON representation for a dataset describing a 'University Course'.
The dataset should include:
- Course Name and Code
- Credits (integer)
- Professor details (Name, ID)
- Enrolled Students (List of objects with Student Name and ID)
Below is the JSON representation of the 'University Course' dataset:
{
"courseCode": "CS101",
"courseName": "Introduction to NoSQL Databases",
"credits": 4,
"department": "Computer Science",
"professor": {
"facultyId": "PROF_552",
"name": "Dr. Alan Turing",
"email": "alan.turing@university.edu"
},
"enrolledStudents": [
{
"studentId": "STU_001",
"name": "John Doe",
"grade": "A"
},
{
"studentId": "STU_002",
"name": "Jane Smith",
"grade": "B+"
}
],
"isActive": true
}
Explanation:
- Root Object: The curly braces
{}define the document. - Scalar Fields:
courseCode,creditsstore simple values. - Embedded Document:
professoris an object nested within the main document. - Array of Objects:
enrolledStudentsis an array[]containing multiple student objects, demonstrating the one-to-many relationship embedded in a single document.
Explain the importance of Indexing in MongoDB. How does it affect the performance of read and write operations?
Indexing in MongoDB is a data structure that holds a portion of the data set in an easy-to-traverse form. Without indexes, MongoDB must perform a collection scan, meaning it scans every document in a collection to select those that match the query statement.
Importance:
- Efficiency: Indexes drastically reduce the amount of work required to satisfy a query by limiting the number of documents the database inspects.
- Sorting: Indexes can be used to efficiently sort results without requiring massive memory usage.
Performance Impact:
- Read Operations (Select): Performance improves significantly. An indexed query maps directly to the location of the document, changing the time complexity from to .
- Write Operations (Insert/Update/Delete): Performance decreases slightly. Every time a document is inserted or updated, the associated indexes must also be updated. Therefore, having too many indexes can slow down write-heavy applications.
Describe the explain() method in MongoDB. How is it used to compare performance between an indexed and an unindexed query?
The explain() method in MongoDB provides information on the execution plan of a query. It returns details about the query path, whether an index was used, and execution statistics.
Syntax:
db.collection.find({ query }).explain("executionStats")
Performance Comparison Parameters:
When analyzing the output of explain(), we look at specific fields in the executionStats section to compare performance:
- totalDocsExamined:
- Unindexed: This number will equal the total number of documents in the collection (Collection Scan).
- Indexed: This number will equal the number of documents that actually match the query (IXSCAN).
- executionTimeMillis:
- Unindexed: High execution time due to scanning the full disk.
- Indexed: Low execution time as the B-Tree index is traversed.
- stage:
- Unindexed: Shows
COLLSCAN. - Indexed: Shows
IXSCAN.
- Unindexed: Shows
Conclusion: By observing the drastic drop in totalDocsExamined and executionTimeMillis after creating an index, explain() validates the performance optimization.
What are Vector Databases? Explain the concept of 'Embeddings' and how Vector Databases differ from traditional scalar databases.
Vector Databases are specialized databases designed to store, manage, and index high-dimensional vectors (lists of numbers). They are crucial for AI and Machine Learning applications, specifically for similarity searches.
Concept of Embeddings:
- Unstructured data (text, images, audio) is converted into numerical vectors using machine learning models (like Neural Networks). These vectors are called Embeddings.
- Embeddings capture the semantic meaning of the data. For example, the vector for "King" minus "Man" plus "Woman" might result in a vector close to "Queen".
Difference from Scalar Databases:
- Data Type: Scalar databases store exact values (Strings, Integers). Vector databases store arrays of floating-point numbers (e.g.,
[0.12, -0.98, 0.44, ...]). - Search Mechanism:
- Scalar: Uses exact matching or keyword matching (e.g.,
WHERE id = 5). - Vector: Uses Approximate Nearest Neighbor (ANN) search. It calculates the geometric distance (Euclidean, Cosine Similarity) between vectors to find items that are semantically similar, not just textually identical.
- Scalar: Uses exact matching or keyword matching (e.g.,
- Use Case: Scalar is for transactional apps; Vector is for recommendation engines, image search, and LLM (Large Language Model) memory.
Demonstrate Working with MongoDB by writing syntax for the following CRUD operations:
- Inserting a single document.
- Finding specific documents.
- Updating a field.
- Deleting a document.
1. Create (Insert)
To insert a single document into a collection named users:
javascript
db.users.insertOne({
name: "Alice",
age: 25,
city: "New York"
})
2. Read (Find)
To find all users who are 25 years old:
javascript
db.users.find({ age: 25 })
To find all documents (equivalent to SELECT *):
javascript
db.users.find({})
3. Update
To update the city of the user named "Alice" (uses the $set operator to avoid overwriting the whole document):
javascript
db.users.updateOne(
{ name: "Alice" }, // Filter
{ $set: { city: "Los Angeles" } } // Update Action
)
4. Delete
To delete the user named "Alice":
javascript
db.users.deleteOne({ name: "Alice" })
Compare BSON and JSON. Why does MongoDB use BSON for storage instead of standard JSON?
JSON (JavaScript Object Notation) is a text-based format that is human-readable. BSON (Binary JSON) is the binary-encoded serialization of JSON-like documents.
Comparison:
| Feature | JSON | BSON |
|---|---|---|
| Format | Text-based (String) | Binary-encoded |
| Readability | Human-readable | Machine-readable (efficient) |
| Data Types | Limited (String, Number, Boolean, Array, Object) | Richer (Date, Binary, Regular Expression, Integer vs Float) |
| Size | Can be bulky due to repeated field names in text | Optimized for space (though sometimes larger due to length prefixes) |
Why MongoDB uses BSON:
- Traversability: BSON contains length prefixes, allowing MongoDB to skip over fields to find a specific byte efficiently without scanning the whole document.
- Data Types: JSON does not distinguish between integers and floating-point numbers, nor does it support Dates natively (they are stored as strings). BSON supports these types directly, which is crucial for database precision.
- Speed: Binary encoding allows for faster encoding and decoding (serialization/deserialization) by the network and drivers.
Explain the concept of Sharding in MongoDB and how it relates to horizontal scaling.
Sharding is the method MongoDB uses to distribute data across multiple machines. It is the implementation of Horizontal Scaling.
Why Sharding is needed:
When datasets become extremely large or throughput becomes high, a single server may lack sufficient storage or CPU capacity. Vertical scaling (upgrading the server) has a physical limit and becomes expensive.
How Sharding Works:
- Shards: A shard is a single MongoDB instance (or replica set) that holds a subset of the total data. All shards together hold the complete dataset.
- Shard Key: MongoDB distributes data based on a shard key (a field in the document).
- Chunks: Data is divided into chunks based on ranges of the shard key.
- Balancer: MongoDB automatically moves chunks between shards to ensure an even load distribution.
Relationship to Horizontal Scaling:
Sharding allows the database to scale out indefinitely. As data grows, administrators simply add more servers (shards) to the cluster, and MongoDB redistributes the data.
In the context of Working with MongoDB, explain the use of Comparison Query Operators with examples (lt, $in).
Query operators in MongoDB provide flexibility to match specific conditions beyond simple equality.
1. $gt (Greater Than)
Selects documents where the value of a field is greater than a specified value.
Example: Find products costing more than 100.
javascript
db.products.find({ price: { $gt: 100 } })
2. $lt (Less Than)
Selects documents where the value is less than a specified value.
Example: Find users younger than 18.
javascript
db.users.find({ age: { $lt: 18 } })
3. $in (In Array)
Selects documents where the value of a field equals any value in a specified array.
Example: Find orders with status 'Shipped' or 'Delivered'.
javascript
db.orders.find({ status: { $in: ["Shipped", "Delivered"] } })
These operators can be combined. For example, to find a price between 50 and 100:
javascript
db.products.find({ price: { lt: 100 } })
Describe the Architecture of Vector Search. How does a query vector find its nearest neighbors in a large dataset?
Vector search architecture relies on mathematical operations to find similarity rather than exact matches.
Process:
- Vectorization: Both the raw data (stored in the DB) and the user's search query are converted into high-dimensional vectors (Embeddings) using the same machine learning model.
- Indexing (ANN - Approximate Nearest Neighbor): To avoid comparing the query vector with every stored vector (which is computationally expensive, ), vector databases use specialized indexes like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index).
- Distance Calculation: The database calculates the distance between the query vector and the candidate vectors using metrics like:
- Euclidean Distance (): Straight-line distance.
- Cosine Similarity: Measures the cosine of the angle between two vectors (focuses on direction/semantic similarity rather than magnitude).
- Dot Product: Projection of one vector onto another.
- Ranking: The results are ranked by proximity (shortest distance or highest similarity score) and returned to the user.
Discuss the Data Modeling Strategy in NoSQL: Embedding vs Referencing. When should you use one over the other?
In NoSQL (specifically Document stores like MongoDB), there are two ways to model relationships between data:
1. Embedding (Denormalization)
- Concept: Storing related data within a single document (nested objects or arrays).
- Example: A
Userdocument contains anaddressobject inside it. - Pros: Improved read performance. One query retrieves the parent and all related data (no joins needed).
- When to use: Use when the data represents a "contains" relationship (one-to-few), and the embedded data is not frequently updated in isolation or referenced by many other documents.
2. Referencing (Normalization)
- Concept: Storing related data in separate documents and linking them via unique IDs (similar to Foreign Keys in SQL).
- Example: A
Reviewdocument stores aproduct_idwhich points to aProductdocument. - Pros: Reduces data duplication. Ideal for large hierarchical data.
- When to use: Use when the relationship is "one-to-many" (where 'many' is very large) or "many-to-many", or when data is frequently updated and you want to avoid consistency issues across duplicates.
What are the limitations of NoSQL Databases compared to traditional SQL databases?
While NoSQL offers scalability and flexibility, it has limitations:
- Lack of Standardization: Unlike SQL, which is a standard language, NoSQL databases vary wildly (MongoDB, Cassandra, DynamoDB). Learning one does not guarantee knowledge of another.
- Consistency Issues: Many NoSQL databases follow the Eventual Consistency model (BASE) to achieve high availability. This makes them unsuitable for applications requiring strict real-time consistency (e.g., banking ledgers).
- Limited Query Capabilities: Complex joins and ad-hoc aggregations are often difficult or computationally expensive in NoSQL compared to the powerful
JOINandGROUP BYcapabilities of SQL. - Data Integrity: Most NoSQL databases rely on the application code to enforce integrity constraints (like foreign keys or data types), whereas SQL databases enforce this at the database level.
- Tooling and Community: While growing, the maturity of business intelligence (BI) and analytics tools for NoSQL is often lower than for SQL systems.
Explain the concept of Compound Indexes in MongoDB. How does the order of fields in the index definition affect query performance?
Compound Indexes are indexes that support queries on multiple fields.
Definition:
Example: db.collection.createIndex({ "lastname": 1, "age": -1 })
This creates an index where data is sorted first by lastname in ascending order (1), and then, within each lastname, by age in descending order (-1).
Impact of Field Order (Prefixes):
The order of fields matters significantly. An index on { A: 1, B: 1 } can support queries on:
- Field
Aalone. - Field
AAND FieldBtogether.
It cannot efficiently support queries on Field B alone. This is known as the Index Prefix rule. The query optimizer can only use the index if the query filters match the starting (prefix) fields of the index.
Sorting: The sort direction (1 vs -1) matters in compound indexes if you are sorting on multiple fields in different directions. For single-field indexes, direction doesn't affect selection.
How does DynamoDB manage Throughput Capacity? Explain Read Capacity Units (RCU) and Write Capacity Units (WCU).
DynamoDB manages performance and billing through provisioned throughput capacity, defined by RCUs and WCUs.
1. Read Capacity Unit (RCU):
- One RCU represents one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size.
- Math: If you need to read 10 items per second, and each item is 2KB.
- Strongly Consistent: .
- Eventually Consistent: .
2. Write Capacity Unit (WCU):
- One WCU represents one write per second for an item up to 1 KB in size.
- Math: If you need to write 5 items per second, and each item is 1KB, you need 5 WCUs. If items are 2KB, you need WCUs.
Modes:
- Provisioned Mode: You specify the RCU/WCU in advance (good for predictable traffic).
- On-Demand Mode: DynamoDB automatically scales RCU/WCU to match traffic (good for unpredictable traffic).
Derive the JSON structure for a Social Media Post that includes comments and likes. Explain why a Document store is suitable for this use case.
JSON Structure:
{
"postId": "POST_998877",
"author": "user_123",
"content": "Enjoying the new database course!",
"timestamp": "2023-10-27T10:00:00Z",
"tags": ["#learning", "#nosql", "#tech"],
"stats": {
"views": 150,
"shares": 10
},
"likes": [
{ "userId": "user_444", "timestamp": "..." },
{ "userId": "user_555", "timestamp": "..." }
],
"comments": [
{
"commentId": "c1",
"author": "user_999",
"text": "Great post!",
"replies": []
}
]
}
Suitability:
- Polymorphism: Different posts might have different structures (e.g., a photo post vs. a text post). JSON handles this natively.
- Locality: When rendering a feed, an application typically needs the post content, the first few comments, and the like count simultaneously. Storing them in a single document (Embedding) means only one database read is required per post, ensuring high performance for the news feed.
Analyze the role of Projection in MongoDB queries. Write a query to find all documents in a 'student' collection but return only the student's name and grade, excluding the _id.
Role of Projection:
Projection determines which fields are returned in the matching documents. By default, MongoDB returns all fields in a document. Projection allows you to limit the amount of data transferred over the network, which reduces latency and memory usage, especially if documents contain large fields (like embedded images or large text blocks).
Syntax:
db.collection.find( { query }, { projection } )
Query Example:
Assume we want all students (empty query {}).
We want to include name and grade (set to 1).
We want to exclude _id (set to 0). Note: _id is included by default unless explicitly excluded.
javascript
db.student.find(
{},
{
name: 1,
grade: 1,
_id: 0
}
)