Today, Yelp held a tech talk in Columbia University about the data warehouse adopted by Yelp.
Yelp used Amazon Redshift as data warehouse.
There are several features for Redshift:
1. Massively Parellel Processing
2. SQL access
3. Column-based Datastore
Benefits are:
1. Data is structured, accessible and well documented.
2. Architecture allows for easy extensibility and sharing across teams.
3. Allows use of entire SQL-compatible tool ecosystem.
Details:
Massively Parellel Processing (MMP)
Traditional BigData always uses Hadoop + MapReduce. MapReduce‘s native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL(Structural Query Language). You can refer detail here.
Below is the structure for implementing MMP.
Similarly, Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment databases. (referrence: Introduction to MMP)
Typical query sentence in MMP
Column-based Datastore
Enables sparse table definitions
Enables compact storage
Improve scanning/filtering
(Benefits: wiki)
Column-based Datastore
In practice, row-oriented storage layouts are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column-oriented storage layouts are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).
Amazon Redshift and Massively Parellel Processing
原文:http://www.cnblogs.com/ireneyanglan/p/4856666.html