App42 provides lots of readymade APIs for developers and each API solves different problem of App development. To solve a different problem you need a different solution.App42 architecture uses hybrid solution for each of the Services on database layer. Some services are a good candidate for RDBMS however others are for NoSQL and some require In-Memory persistence.
App42 performs lots of Analytics on the data and also provides a service to App developers in the form of Marketing Automation. Implementing Analytics solution requires different persistence solution on DB layer. We chose Cassandra as our DB layer for implementation and fell deeply in love with it. There were other candidates like HBase and MongoDB for the solution however we decide to go ahead done with Cassandra and here are the reasons why.
1. Cassandra Scales linearly with massive write.
App42 analytics generates quite a lot of data when an event is generated. Events through a single app may result in thousands of insertions on the database. We process billions of events and we wanted to have a storage which can withstand very heavy write operations and scale. We were stuck with two options for our requirement here, one was Cassandra and other was HBase. Though MongoDB was also a candidate however due to write lock issue on database level and cascading poor insertion performance, it was out from the list at the very beginning of our selection process. Cassandra and HBase both are good with heavy write operations however we opted to go along with Cassandra looking at the benchmarks (http://planetcassandra.org/nosql-performance-benchmarks/) available in the market and considering the ease of managing the cluster. For us Cassandra was the perfect choice for heavy write load scenarios and it scales linearly as new machines are added in the ring.
2. Cassandra is an excellent choice for real-time analytic workloads
Due to its ability of supporting heavy write operations, it becomes naturally a good choice for Real Time Analytics. Thumb rule of performing real time analytics is that you should have your data already calculated and should persist in the database. If you know the reports you want to show in real time, you can have your schema defined accordingly and generate your data at real time. Batch mutation and Distributed Global Counter is something that we really liked while using Cassandra. if you are looking for similar kind of solution most likely Casssandra will suffice your needs.
3. Cassandra can be integrated with Hadoop, Hive and Apache Spark for batch Processing
As illustrated above Cassandra is a good candidate for real time analytics, however there might be scenarios where you might have to perform batch processing on the stored data. Cassandra can be easily integrated with Hadoop and Hive to achieve this. Also, on-demand in-memory analytics can be done through Apache Spark integration.
4. Tunable Consistency and CAP parameters.
Every database can provide two parameters out of Consistency (C) Availability (A) and Network Partitioning Tolerance (P) at a time according to CAP Theorem (http://en.wikipedia.org/wiki/CAP_theorem). It is impossible to achieve all at the same time. Cassandra allows you to configure and tune these parameters based on your priority. By default it is categorized under AP category.
There are many other features however these were certain points of considerations for us and we chose Cassandra based on that
Hope this post helps others who are thinking of Architecting their products which requires analytics over large amount of data and want be resilient against scalability.
If you have a requirement of Big Data Analytics for heavy write operation, Cassandra can stand out to be a perfect choice for you. Your feedback and suggestion on post are heartily welcome and you are free to reach out to us at support@shephertz.com for further query or feedback.