Key Points about NoSQL:
- NoSQL databases are not built primarily on tables, and generally do not use SQL for data manipulation.
- NoSQL database systems are often highly optimized for retrieval and appending operations and often offer little functionality beyond record storage (e.g. key–value stores).
- It also may not provide full ACID (atomicity, consistency, isolation, durability) guarantees, but still has a distributed and fault tolerant architecture.
- NoSQL database management systems are useful when working with a huge quantity of data when the data's nature does not require a relational model.
- The data can be structured, but NoSQL is used when what really matters is the ability to store and retrieve great quantities of data, not the relationships between the elements.
- They don’t follow a fixed schema.
- NoSQL has a distributed, fault-tolerant architecture.
- Database typically scales horizontally and is used for managing large amounts of data, when the performance and real-time nature is more important than consistency .
- These can be categorized into following categories:key–value stores, BigTable implementations, document store databases, and graph databases.
Relational database v/s NoSQL
No Query Language:
SQL is a standard query
language used in relational databases. Cassandra has no query language.
It does have an API that you access through its RPC serialization
mechanism, Thrift.
No Referential Integrity:
Cassandra
has no concept of referential integrity, and therefore has no concept
of joins. In a relational database, you could specify foreign keys in a
table to reference the primary key of a record in another table. But
Cassandra does not enforce this. It is still a common design requirement
to store IDs related to other entities in your tables, but operations
such as cascading deletes are not available.
Secondary Indexes:
Here’s why secondary indexes are a feature: say that you want to find
the unique ID for a hotel property. In a relational database, you might
use a query like this:
SELECT hotelID FROM Hotel WHERE name = 'Clarion Midtown';
This
is the query you’d have to use if you knew the name of the hotel you
were looking for but not the unique ID. When handed a query like this, a
relational database will perform a full table scan, inspecting each
row’s name column to find the value you’re looking for. But this can
become very slow once your table grows very large. So the relational
answer to this is to create an index on the name column, which acts as a
copy of the data that the relational database can look up very quickly.
Because the hotelID is already a unique primary key constraint, it is
automatically indexed, and that is the primary index; for us to create
another index on the name column would constitute a secondary index, and
Cassandra does not currently support this.
To
achieve the same thing in Cassandra, you create a second column family
that holds the lookup data. You create one column family to store the
hotel names, and map them to their IDs. The second column family acts as
an explicit secondary index.
Note:
Support for secondary indexes is currently being added to Cassandra
0.7. This allows you to create indexes on column values. So, if you want
to see all the users who live in a given city, for example, secondary
index support will save you from doing it from scratch.
Sorting Is a Design Decision
In RDBMS, you can easily change the order in which records are returned to you by using ORDER BY
in your query. The default sort order is not configurable; by default,
records are returned in the order in which they are written. If you want
to change the order, you just modify your query, and you can sort by
any list of columns. In Cassandra, however, sorting is treated
differently; it is a design decision. Column family definitions include a
CompareWith element, which dictates the order in which your rows will be sorted on reads, but this is not configurable per query.
Where
RDBMS constrains you to sorting based on the data type stored in the
column, Cassandra only stores byte arrays, so that approach doesn’t make
sense. What you can do, however, is sort as if the column were one of
several different types (ASCII, Long integer, TimestampUUID, lexicographically, etc.). You can also use your own pluggable comparator for sorting if you wish.
Otherwise, there is no support for ORDER BY and GROUP BY statements in Cassandra as there is in SQL. There is a query type called a SliceRange, it is similar to ORDER BY
in that it allows a reversal.By default Cassandra sorts the data as
soon as you store it in the database and it remains sorted. This gives
you an enormous performance boost
Denormalization
In
relational database design, we are often taught the importance of
normalization. This is not an advantage when working with Cassandra
because it performs best when the data model is denormalized. It is
often the case that companies end up denormalizing data in a relational
database. There are two common reasons for this. One is performance.
Companies simply can’t get the performance they need when they have to
do so many joins on years’ worth of data, so they denormalize along the
lines of known queries. This ends up working, but goes against the grain
of how relational databases are intended to be designed, and ultimately
makes one question whether using a relational database is the best
approach in these circumstances.
A
second reason that relational databases get denormalized on purpose is a
business document structure that requires retention. That is, you have
an enclosing table that refers to a lot of external tables whose data
could change over time, but you need to preserve the enclosing document
as a snapshot in history. The common example here is with invoices. You
already have Customer and Product tables, and you’d think that you could
just make an invoice that refers to those tables. But this should never
be done in practice. Customer or price information could change, and
then you would lose the integrity of the Invoice document as it was on
the invoice date, which could violate audits, reports, or laws, and
cause other problems.
When
you distribute data over many machines, doing joins at read time is
expensive in the general case (compared to what can be done on a single
host), as you might have to join over data that is not stored on the
same physical host. That is why Cassandra has always encouraged
denormalization instead of joins.
Not
using joins has however the drawback of making some simple patterns
less elegant. Consider the case where you want to allow users to have
multiple email addresses. In a relational database, the canonical way to
do that would be a create an email_addresses table with a many-to-one
relationship to users … which implies a join. So in Cassandra, you would
traditionally denormalize that as muliple columns email1, email2, etc.
While this is usually fine from a performance standpoint (because both
adding new columns and having columns without values is virtually free
in Cassandra), this is tedious to use, not very natural, and have a few
drawback like forcing you to do a read before adding a new email address
(to know which column name to use).
Cassandra Introduction:
PROS
The main problems that a NoSQL aims to solve
typically revolve around issues of scale. When data no longer fits on a
single MySQL server or when a single machine can no longer handle the
query load, some strategy for sharing and replication is required.
The
pitch behind most NoSQL databases like Cassandra is that because they
were designed from the ground up to be distributed and to handle large
data volumes, they can provide some combination of the following
benefits that a simple installation of MySQL or Postgres can't easily
offer:
- Automatic sharding of data. New data gets automatically assigned to the appropriate node.
- Automatic replication of data. Multiple nodes each store a copy of the data, up to a certain configured replication factor.
- Schema-less data for simpler migrations. Schema changes for large tables can take a long time and lock the tables, blocking any writes. A database with only a loosely defined schema (like Casasndra and HBase's column families) or none at all in key/value stores should make this easier.
- Automatic scalability by adding new nodes. Adding new nodes automatically re-partition the data for load balancing purposes.
- Multiple nodes that can accept writes. Unlike a standard MySQL master/slave setup, multiple nodes in a NoSQL database can accept updates, thereby supporting much higher query throughput.
CONS
- All data for a single row must fit (on disk) on a single machine in the cluster. Because row keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound.
- A single column value may not be larger than 2GB. (However, large values are read into memory when requested, so in practice "small number of MB" is more appropriate.)
- The maximum number of column per row is 2 billion.
- The key (and column names) must be under 64K bytes.
- Cassandra has two levels of indexes: key and column. But in super columnfamilies there is a third level of subcolumns; these are not indexed, and any request for a subcolumn deserializes _all_ the subcolumns in that supercolumn. So you want to avoid a data model that requires large numbers of subcolumns. Composite columns do not have this limitation.
- Cassandra's public API is based on Thrift, which offers no streaming abilities -- any value written or fetched has to fit in memory. This is inherent to Thrift's design and is therefore unlikely to change. So adding large object support to Cassandra would need a special API that manually split the large objects up into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265. As a workaround in the meantime, you can manually split files into chunks of whatever size you are comfortable with -- at least one person is using 64MB -- and making a file correspond to a row, with the chunks as column values.
For more details please refer to :http://www.datastax.com/dev/blog/binary-protocol
Cassandra data model:
In
Cassandra, you define column families. Column families can (and should)
define metadata about the columns, but the actual columns that make up a
row are determined by the client application. Each row can have a
different set of columns. There are two types of column families:
Static Column Families:
A
static column family uses a relatively static set of column names and
is similar to a relational database table. For example, a column family
storing user data might have columns for the user name, address, email,
phone number and so on. Although the rows generally have the same set of
columns, they are not required to have all of the columns defined.
Static column families typically have column metadata pre-defined for
each column.
Dynamic Column Families:
A
dynamic column family takes advantage of Cassandra's ability to use
arbitrary application-supplied column names to store data.A dynamic
column family allows you to pre-compute result sets and store them in a
single row for efficient data retrieval.Instead of defining metadata for
individual columns, a dynamic column family defines the type
information for column names and values, but the actual column names and
values are set by the application when a column is inserted.
Column families can have following type of columns:
Standard Columns:
Column
is a tuple containing a name, a value and a timestamp.A column must
have a name, and the name can be static or it can be dynamically set
when the column is created by your application.
Composite:
Composite
columns comprise fully denormalized wide rows by using composite
primary keys. You create and query composite columns using CQL 3.
Expiring:
A
column can also have an optional expiration date called TTL (time to
live). Whenever a column is inserted, the client request can specify an
optional TTL value, defined in seconds, for the column. TTL columns are
marked as deleted after the requested amount of time has expired. Once
they are marked with a tombstone, they are automatically removed during
the normal compaction and repair processes.
If
you want to change the TTL of an expiring column, you have to re-insert
the column with a new TTL. In Cassandra the insertion of a column is
actually an insertion or update operation, depending on whether or not a
previous version of the column exists.
Counter:
A counter is a special kind of column used to store a number that
incrementally counts the occurrences of a particular event or process.
For example, you might use a counter column to count the number of times
a page is viewed.
Super:
Super
columns read entire super columns and all its sub-columns into memory
for each read request. This results in severe performance issues.Super
columns are not supported in CQL 3.
Keyspaces:
keyspace
is the container for your application data, similar to a schema in a
relational database. Keyspaces are used to group column families
together.
Replication
is controlled on a per-keyspace basis, so data that has different
replication requirements should reside in different keyspaces.
Concepts: CAP
The CAP theorem states that you have to pick two of Consistency, Availability, Partition tolerance: You can't have the three at the same time and get an acceptable latency.
Consistency: Reading/writing in latest piece of data.
Availability: If one or more node gets down in cluster then system continues to operated.
If I am loosing messages between couple of the node again system continues to operate
Read/Write Operations in Cassandra:
Cassandra
is a peer to peer read/write anywhere architecture so any user can
connect to any node in any data centre and read/write the data they
need, with all writes being partitioned and replicated for them
automatically throughout the cluster.
Write Operation:
Date is first written into commit log for durability purpose.Then
written to memtable in memory.Once the memtable becomes full, Data get
flushed into SStable.
Writes are atomic at row level, All columns are written or updated or none.
RDBMS styled transactions aren’t supported.
Write Strategies:
Any:
A write must succeed on any available nodes.It doesn’t matter if that
particular node is responsible for storing that type of data.
One: A write must succeed on any node responsible for storing that row(Either primary or replica).
Quorum: A write must succeed on a quorum of replica nodes( determined by (replication_factor/2)+1
Local Quorum: A write must succeed on a quorum of replica nodes in the same data centre as coordinator node.
Each Quorum: A write must succeed on quorum of replica nodes in all data centre.
All: A write must succeed on all replica nodes for a row key.
Hinted Handoffs:
Cassandra tries to write a row to all replicas for that row. If all
replica nodes aren’t available. A hint get stored in one node for the
update of down nodes for that node ones they are back.If no replicas are
available for that row, use of ANY consistency level instructs the
coordinator node to store hint and the raw data for the down node and
update it once it’s back.
Read Operation:
Read Strategies:
One: Reads from the closest node holding the data.
Quorum: Returns a result from quorum of a servers with a most recent timestamp for data.
Local
Quorum: Returns a result from quorum of a servers with a most recent
timestamp for data for the same data centre as the coordinator node.
Each quorum: Returns a result from quorum of a servers with a most recent timestamp in all the data centres.
All: Returns a result from all replica node for row key.
Read Repair:
Ensures frequently read data remains constant in all nodes.
When
data read operation get completed for the one node, It compares with
data of other replica node for that row and if they are inconsistent ,
Write operation on out of data nodes get performed to update the row to
reflect most recent changes.Read repair can be configured per column
family and is enabled by default.
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Big Data Hadoop Online Course Bangalore
ReplyDelete