Lets Fight to Learn: Introduction To Cassandra

Key Points about NoSQL:

NoSQL databases are not built primarily on tables, and generally do not use SQL for data manipulation.
NoSQL database systems are often highly optimized for retrieval and appending operations and often offer little functionality beyond record storage (e.g. key–value stores).
It also may not provide full ACID (atomicity, consistency, isolation, durability) guarantees, but still has a distributed and fault tolerant architecture.
NoSQL database management systems are useful when working with a huge quantity of data when the data's nature does not require a relational model.
The data can be structured, but NoSQL is used when what really matters is the ability to store and retrieve great quantities of data, not the relationships between the elements.
They don’t follow a fixed schema.
NoSQL has a distributed, fault-tolerant architecture.
Database typically scales horizontally and is used for managing large amounts of data, when the performance and real-time nature is more important than consistency .
These can be categorized into following categories:key–value stores, BigTable implementations, document store databases, and graph databases.

Relational database v/s NoSQL

No Query Language:

SQL is a standard query language used in relational databases. Cassandra has no query language. It does have an API that you access through its RPC serialization mechanism, Thrift.

No Referential Integrity:

Cassandra has no concept of referential integrity, and therefore has no concept of joins. In a relational database, you could specify foreign keys in a table to reference the primary key of a record in another table. But Cassandra does not enforce this. It is still a common design requirement to store IDs related to other entities in your tables, but operations such as cascading deletes are not available.

Secondary Indexes:

Here’s why secondary indexes are a feature: say that you want to find the unique ID for a hotel property. In a relational database, you might use a query like this:

SELECT hotelID FROM Hotel WHERE name = 'Clarion Midtown';

This is the query you’d have to use if you knew the name of the hotel you were looking for but not the unique ID. When handed a query like this, a relational database will perform a full table scan, inspecting each row’s name column to find the value you’re looking for. But this can become very slow once your table grows very large. So the relational answer to this is to create an index on the name column, which acts as a copy of the data that the relational database can look up very quickly. Because the hotelID is already a unique primary key constraint, it is automatically indexed, and that is the primary index; for us to create another index on the name column would constitute a secondary index, and Cassandra does not currently support this.

To achieve the same thing in Cassandra, you create a second column family that holds the lookup data. You create one column family to store the hotel names, and map them to their IDs. The second column family acts as an explicit secondary index.

Note: Support for secondary indexes is currently being added to Cassandra 0.7. This allows you to create indexes on column values. So, if you want to see all the users who live in a given city, for example, secondary index support will save you from doing it from scratch.

Sorting Is a Design Decision

In RDBMS, you can easily change the order in which records are returned to you by using ORDER BY in your query. The default sort order is not configurable; by default, records are returned in the order in which they are written. If you want to change the order, you just modify your query, and you can sort by any list of columns. In Cassandra, however, sorting is treated differently; it is a design decision. Column family definitions include a CompareWith element, which dictates the order in which your rows will be sorted on reads, but this is not configurable per query.

Where RDBMS constrains you to sorting based on the data type stored in the column, Cassandra only stores byte arrays, so that approach doesn’t make sense. What you can do, however, is sort as if the column were one of several different types (ASCII, Long integer, TimestampUUID, lexicographically, etc.). You can also use your own pluggable comparator for sorting if you wish.

Otherwise, there is no support for ORDER BY and GROUP BY statements in Cassandra as there is in SQL. There is a query type called a SliceRange, it is similar to ORDER BY in that it allows a reversal.By default Cassandra sorts the data as soon as you store it in the database and it remains sorted. This gives you an enormous performance boost

Denormalization

In relational database design, we are often taught the importance of normalization. This is not an advantage when working with Cassandra because it performs best when the data model is denormalized. It is often the case that companies end up denormalizing data in a relational database. There are two common reasons for this. One is performance. Companies simply can’t get the performance they need when they have to do so many joins on years’ worth of data, so they denormalize along the lines of known queries. This ends up working, but goes against the grain of how relational databases are intended to be designed, and ultimately makes one question whether using a relational database is the best approach in these circumstances.

A second reason that relational databases get denormalized on purpose is a business document structure that requires retention. That is, you have an enclosing table that refers to a lot of external tables whose data could change over time, but you need to preserve the enclosing document as a snapshot in history. The common example here is with invoices. You already have Customer and Product tables, and you’d think that you could just make an invoice that refers to those tables. But this should never be done in practice. Customer or price information could change, and then you would lose the integrity of the Invoice document as it was on the invoice date, which could violate audits, reports, or laws, and cause other problems.

When you distribute data over many machines, doing joins at read time is expensive in the general case (compared to what can be done on a single host), as you might have to join over data that is not stored on the same physical host. That is why Cassandra has always encouraged denormalization instead of joins.

Not using joins has however the drawback of making some simple patterns less elegant. Consider the case where you want to allow users to have multiple email addresses. In a relational database, the canonical way to do that would be a create an email_addresses table with a many-to-one relationship to users … which implies a join. So in Cassandra, you would traditionally denormalize that as muliple columns email1, email2, etc. While this is usually fine from a performance standpoint (because both adding new columns and having columns without values is virtually free in Cassandra), this is tedious to use, not very natural, and have a few drawback like forcing you to do a read before adding a new email address (to know which column name to use).

http://answers.oreilly.com/topic/2386-how-to-decide-between-rdbms-and-cassandra/

Cassandra Introduction:

PROS

The main problems that a NoSQL aims to solve typically revolve around issues of scale. When data no longer fits on a single MySQL server or when a single machine can no longer handle the query load, some strategy for sharing and replication is required.

The pitch behind most NoSQL databases like Cassandra is that because they were designed from the ground up to be distributed and to handle large data volumes, they can provide some combination of the following benefits that a simple installation of MySQL or Postgres can't easily offer:

Automatic sharding of data. New data gets automatically assigned to the appropriate node.
Automatic replication of data. Multiple nodes each store a copy of the data, up to a certain configured replication factor.
Schema-less data for simpler migrations. Schema changes for large tables can take a long time and lock the tables, blocking any writes. A database with only a loosely defined schema (like Casasndra and HBase's column families) or none at all in key/value stores should make this easier.
Automatic scalability by adding new nodes. Adding new nodes automatically re-partition the data for load balancing purposes.
Multiple nodes that can accept writes. Unlike a standard MySQL master/slave setup, multiple nodes in a NoSQL database can accept updates, thereby supporting much higher query throughput.

http://www.quora.com/What-are-the-problems-that-a-NoSQL-database-tries-to-solve

CONS

All data for a single row must fit (on disk) on a single machine in the cluster. Because row keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound.
A single column value may not be larger than 2GB. (However, large values are read into memory when requested, so in practice "small number of MB" is more appropriate.)
The maximum number of column per row is 2 billion.
The key (and column names) must be under 64K bytes.
Cassandra has two levels of indexes: key and column. But in super columnfamilies there is a third level of subcolumns; these are not indexed, and any request for a subcolumn deserializes _all_ the subcolumns in that supercolumn. So you want to avoid a data model that requires large numbers of subcolumns. Composite columns do not have this limitation.
Cassandra's public API is based on Thrift, which offers no streaming abilities -- any value written or fetched has to fit in memory. This is inherent to Thrift's design and is therefore unlikely to change. So adding large object support to Cassandra would need a special API that manually split the large objects up into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265. As a workaround in the meantime, you can manually split files into chunks of whatever size you are comfortable with -- at least one person is using 64MB -- and making a file correspond to a row, with the chunks as column values.

For more details please refer to :http://www.datastax.com/dev/blog/binary-protocol

Cassandra data model:

In Cassandra, you define column families. Column families can (and should) define metadata about the columns, but the actual columns that make up a row are determined by the client application. Each row can have a different set of columns. There are two types of column families:

Static Column Families:

A static column family uses a relatively static set of column names and is similar to a relational database table. For example, a column family storing user data might have columns for the user name, address, email, phone number and so on. Although the rows generally have the same set of columns, they are not required to have all of the columns defined. Static column families typically have column metadata pre-defined for each column.

Dynamic Column Families:

A dynamic column family takes advantage of Cassandra's ability to use arbitrary application-supplied column names to store data.A dynamic column family allows you to pre-compute result sets and store them in a single row for efficient data retrieval.Instead of defining metadata for individual columns, a dynamic column family defines the type information for column names and values, but the actual column names and values are set by the application when a column is inserted.

Column families can have following type of columns:

Standard Columns:

Column is a tuple containing a name, a value and a timestamp.A column must have a name, and the name can be static or it can be dynamically set when the column is created by your application.

Composite:

Composite columns comprise fully denormalized wide rows by using composite primary keys. You create and query composite columns using CQL 3.

Expiring:

A column can also have an optional expiration date called TTL (time to live). Whenever a column is inserted, the client request can specify an optional TTL value, defined in seconds, for the column. TTL columns are marked as deleted after the requested amount of time has expired. Once they are marked with a tombstone, they are automatically removed during the normal compaction and repair processes.

If you want to change the TTL of an expiring column, you have to re-insert the column with a new TTL. In Cassandra the insertion of a column is actually an insertion or update operation, depending on whether or not a previous version of the column exists.

Counter:

A counter is a special kind of column used to store a number that incrementally counts the occurrences of a particular event or process. For example, you might use a counter column to count the number of times a page is viewed.

Super:

Super columns read entire super columns and all its sub-columns into memory for each read request. This results in severe performance issues.Super columns are not supported in CQL 3.

Keyspaces:

keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together.

Replication is controlled on a per-keyspace basis, so data that has different replication requirements should reside in different keyspaces.

Concepts: CAP

The CAP theorem states that you have to pick two of Consistency, Availability, Partition tolerance: You can't have the three at the same time and get an acceptable latency.

http://www.datastax.com/resources/tutorials/data-consistency

Consistency: Reading/writing in latest piece of data.

Availability: If one or more node gets down in cluster then system continues to operated.

If I am loosing messages between couple of the node again system continues to operate

Read/Write Operations in Cassandra:

Cassandra is a peer to peer read/write anywhere architecture so any user can connect to any node in any data centre and read/write the data they need, with all writes being partitioned and replicated for them automatically throughout the cluster.

Write Operation:

Date is first written into commit log for durability purpose.Then written to memtable in memory.Once the memtable becomes full, Data get flushed into SStable.

Writes are atomic at row level, All columns are written or updated or none.

RDBMS styled transactions aren’t supported.

Write Strategies:

Any: A write must succeed on any available nodes.It doesn’t matter if that particular node is responsible for storing that type of data.

One: A write must succeed on any node responsible for storing that row(Either primary or replica).

Quorum: A write must succeed on a quorum of replica nodes( determined by (replication_factor/2)+1

Local Quorum: A write must succeed on a quorum of replica nodes in the same data centre as coordinator node.

Each Quorum: A write must succeed on quorum of replica nodes in all data centre.

All: A write must succeed on all replica nodes for a row key.

Hinted Handoffs:

Cassandra tries to write a row to all replicas for that row. If all replica nodes aren’t available. A hint get stored in one node for the update of down nodes for that node ones they are back.If no replicas are available for that row, use of ANY consistency level instructs the coordinator node to store hint and the raw data for the down node and update it once it’s back.

Read Operation:

Read Strategies:

One: Reads from the closest node holding the data.

Quorum: Returns a result from quorum of a servers with a most recent timestamp for data.

Local Quorum: Returns a result from quorum of a servers with a most recent timestamp for data for the same data centre as the coordinator node.

Each quorum: Returns a result from quorum of a servers with a most recent timestamp in all the data centres.

All: Returns a result from all replica node for row key.

Read Repair:

Ensures frequently read data remains constant in all nodes.

When data read operation get completed for the one node, It compares with data of other replica node for that row and if they are inconsistent , Write operation on out of data nodes get performed to update the row to reflect most recent changes.Read repair can be configured per column family and is enabled by default.

Lets Fight to Learn

Friday, April 26, 2013

Introduction To Cassandra

Key Points about NoSQL:

Relational database v/s NoSQL

No Query Language:

No Referential Integrity:

Secondary Indexes:

Sorting Is a Design Decision

Denormalization