Before starting to learn how to use Hadoop ecosystem, it is important to understand why we need Hadoop ecosystem in addition to traditional relational database systems. So, let’s understand what the limitations of RDBMS are.
Understanding RDBMS (relational database management systems) limits
First, let’s see the landscape of databases existing today such as SQL server, Oracle, or MySQL. As companies try to embark on big data projects nowadays, they run into limitations of current relational database systems. Such limitations are:
- The new world of big data is now facing huge datasets – terabytes or petabytes. Building databases with existing RDBMS in such big scale very difficult, complex, and expensive.
- Existing RDBMS are not necessarily built for dealing with data at the scale and the speed (if real-time data intake is needed).
- Sophisticated processing like machine learning
However, Hadoop ecosystem is not developed to replace those existing relational database systems, but solve a different set of data problems that the existing RDBS could not solve!
So, now let’s look at what we can choose as our database.
- Other fields: Before Hadoop ecosystem was broadly available, it was common to find information in file systems, even in XML.
- HDFS (Hadoop File System): HDFS is now developed as an alternative of whatever filesystem you are using.
- NoSQL (key/value such as Redis, columnstore such as MongoDB or graph database, etc.)
- RDBMS (MySQL, SQL Server, Oracle)
Even though some implementations of portions of Hadoop ecosystem could be categorized as NoSQL, it is important to understand that Hadoop is actually not a database. It is an alternative file system with a processing library. In Hadoop implementations, it is common to have a NoSQL implementation as well.
Also, relational databases are still around for the problems that Hadoop is not designed to solve. So, again, Hadoop is not a replacement for RDBMS, but an addition to it.
Hadoop and Hbase
Hadoop uses an alternative filesystem, HDFS (Hadoop File System). Hbase is a NoSQL database that is commonly used with Hadoop solutions. Hbase is a wide columnstore, which means that the database consists of one key and 1 to number of values. For instance, if I have a database system for saving customer information,
- In relational database:
- Have separate tables for customer id, customer name, customer gender,customer location, etc.
- In Hbase or NoSQL:
- Have key-value pairs
- Each instance can have a different number of attributes (columns)
e.g. [[name: Lynda, location: Irvine], [name: Keith]]
CAP theory and Hadoop
When we consider to chose a database system, we look at CAP theory to understand characteristics of different database systems. There are certain classes of databases that support each characteristic.
- The ability to combine two or more data modification operations as a unit.
- Example: ‘money transaction’. Say, we move money from a savings account to a checking account. We want both changes to occur successfully or neither.
- Availability (uptime)
- The ability to make copies of the data so that if one copy goes down in one location, the data will still be available in other locations.
- Partitioning (scalability)
- The ability to split the dataset across the different locations or machines so we can grow the amount of our dataset.
The traditional databases can support consistency and availability, but no partitioning. As mentioned earlier, however, it is important to accommodate scalability into database system because now the amount of data is growing larger and larger. So, this is where Hadoop comes into play.
Hadoop supports high ‘Partitioning‘ and ‘Availability‘. For Partitioning, you can run Hadoop storage on any commodity hardware. When you run Hadoop system on any server, it makes three copies of the data by default, so when one server goes down due to any problem, you can just replace the hardware with a new one. The Hadoop file system automatically manage that copy process so you can scale up a Hadoop cluster nearly infinitely.
What kind of data is right for Hadoop?
So, then what kind of data is right for Hadoop system? We can think of two kinds of data; one is LOB (Line of business) and the other is behavioral data.
- Line of business (LOB)? ==> No.
- e.g. transaction data. that needs ‘consistency’
- Not good fit for Hadoop
- Behavioral data ==> Yes