4 Reasons Perfect Market Bets on MongoDB

 

 

4 Reasons We Bet On MongoDB

By Jun Xu, Principal System Architect at Perfect Market, with Tracy Tan, Senior Data Warehouse Architect at Perfect Market, and Cliff Tsai, Director of Systems Development at Perfect Market – March, 2014

At Perfect Market, we use MongoDB in our Digital Publishing Suite (DPS). A few years ago, I wrote a blog post outlining our evaluation process, why we chose a NoSQL database and why we decided to use MongoDB in our content processing platform. MongoDB was such a good fit for the content processing platform that we expanded its use to our ad delivery platform (Perfect Ads), content recirculation platform (Perfect Links) and the social platform (Perfect Social).

For a technology startup with limited resources, broadly adopting a new DBMS means betting its own future on the DBMS. We are very fortunate that the DBMS we selected has been maturing along with Perfect Market. MongoDB Inc. (formerly 10gen) keeps introducing new features and improvements to the product while consistently maintaining a high standard and demonstrating their professionalism in customer support.

After 3 years, our use of MongoDB for Perfect Market’s product suite has gone far beyond simple key-value lookup. Backed by MongoDB, our ad delivery and content recirculation platforms are serving and tracking over 1.5 billion page views per month, and our social platform is performing intensive social user data aggregation and content matching in real time. Furthermore, our MongoDB infrastructure has evolved from a single node to multiple multi-node replica sets. All in all, I would say our bet on MongoDB was a sweet one and has paid off handsomely.

There are many NoSQL products out there, why did we bet on MongoDB? There are four major reasons: great performance, great features, ease of use and great support. Of course not every day with MongoDB is a sunshine day. Some tradeoffs we made are shared at the end of this post.

 

Reason #1: Great Performance

High performance is almost always the #1 driving factor for people to adopt a NoSQL solution. Not surprisingly, this was the #1 reason we bet on MongoDB as well.

MongoDB provides excellent read and write performance. I have discussed this at length in my previous blog post, and my view has not changed even after we expanded the use of MongoDB to the applications (DPS) that are much more demanding on performance.

In the DPS production environment, we partition our partner related data into clusters and we use a group of MongoDB replica sets for each cluster. Every cluster currently handles about 6000 writes and 3000 reads per second, which roughly translates to 500 tracking events (page views, clicks etc.) and 500 serving requests per second being processed. This performance met nicely the cluster level throughput goal we set at the time DPS was built.

Since version 2.2, locking in MongoDB has been made more granular with improved concurrency behavior such as yielding to ensure data is in memory before a lock is acquired. Every database has its own read/write lock (versus only one global read/write lock for the entire MongoDB instance in prior versions). When a read lock exists, many read operations may use this lock. However, when a write lock exists a single write operation holds the lock exclusively. A write lock is generally held for a very short period of time while changes are applied to data in memory, but this can still present a performance challenge for databases with heavy write loads. To reduce lock contention, we decided to run multiple MongoDB instances on one machine and create more granular databases in each instance. Basically data is stored in different instances based on its usage and in every MongoDB instance one database is created for each partner.

The following diagram illustrates at a high level how MongoDB instances and databases for one DPS cluster are organized and used (I only drew two server machines for illustration purpose):

 

mongo1

 

In the above diagram, each cylinder represents a MongoDB instance and each small rounded rectangle represents a database for a partner. Each rectangular cuboid represents a server machine and each large rounded rectangle represents a replica set. As the diagram shows, as long as the job to populate the serving data for a partner is idle, all read requests for serving the partner are not blocked even when the tracking data instances are constantly receiving heavy writes because there is no lock involved at all when reads and writes are applied to different MongoDB instances. Read requests for serving the partner are not likely to be blocked either while the jobs to populate the serving data for other partners are running – because the lock contention is much lower for operations against different databases.

 

Two sample documents shown below illustrate what tracking data and serving data look like in DPS:

{
    "_id": "60d545b6b3c212aae6c62946e2697d1a",
    "hn": "www..com",
    "ov": {
        "to": 1
    },
    "pd": "2014-02-23",
    "ph": "CNN will cancel low-rated 9 p.m. 'Piers Morgan Live'",
    "pi": "http://www..com/img-530acc0b/turbine/la-et-st-cnn-will-cancel-lowrated-9-pm-piers-morgan-live-20140223",
    "pid": "9566fc3e46e8b9ef8ec6945547cef20a",
    "pu": "http://www..com/entertainment/tv/showtracker/la-et-st-cnn-will-cancel-lowrated-9-pm-piers-morgan-live-20140223,0,1748239.story",
    "sc": [
        "entertainment/tv"
    ],
    ...,
    ...,
    "t": {
        "n": "tpl06",
        "v": 1,
        "x": [
            {
                "t": "pro",
                "f": "",
                "p": "a",
                "b": 5,
                "k": 8,
                "ov": {
                    "a": 1
                }
            }
        ]
    },
    "ts": NumberLong("1393354359309")
}

Above is a sample document for one type of the tracking events.

{
    "_id": "04033c2ce01b250e3b4c82e1f81c8101",
    "e": 0.1675,
    "hl": "Rainstorm could be Los Angeles' wettest in 2 years",
    "hn": "www..com",
    "iu": "http://www..com/img-530abdb8/turbine/la-me-ln-rainstorm-los-angeles-wettest-2-years-20140223",
    "k": 156,
    "lec": 0.00212325783972125,
    "l": 73472,
    "me": 0.00234266742268495,
    "pd": "2014-02-23T00:00:00-0800",
    "pid": "6e31798ea2df03460c565258433160c4",
    "v": 3630,
    "sg": "1f0f70bf2b5ad94c7387e64c16dc455a",
    "sop": 3,
    "ts": NumberLong("1393354769486"),
    "ul": "http://www.latimes.com/local/lanow/la-me-ln-rainstorm-los-angeles-wettest-2-years-20140223,0,4826440.story"
}

The sample document above represents a link used by the content recirculation platform. A real serving request is much more complex and may contain a lot of documents like above with optimized link placement information, plus quite a lot of serving data needed for different DPS platforms, such as ad hinting data, editorial control data and personalized content or friend matches for a social media user etc.

To achieve the throughput goal as well as to satisfy the data consistency/durability requirements, we also carefully set the appropriate write concerns and read preferences for all modules:

  • For modules that take very heavy write loads where minor data loss is acceptable, we set the write concern to errors ignored (i.e., fire-and-forget) and created comprehensive monitoring support to make sure the data flows in as expected.
  • For modules that only take heavy read load where temporary data inconsistency can be accepted, we set the read preference to secondaryPreferred and created the monitoring support to watch the replication lag closely.
  • For backend processes that rely on read-your-writes consistency, we set the write concern to acknowledged and the read preference to primary.
  • For modules that processes critical editorial control data sent from partners, the journaled write concern is used.

The rule of thumb here is that you have to balance data consistency/durability and performance for the use cases and the requirements you are dealing with. It’s worth noting that to help its users achieve higher levels of data consistency in general, MongoDB has changed the default write concern for all client drivers from fire-and-forget to acknowledged since November 2012.

 

Reason #2: Great Features

MongoDB has many great features that offered us a competitive edge (it may be worth another post in the future to talk about them in detail), and we especially love the fact that compared with other NoSQL solutions, MongoDB functions more like a general purpose DBMS. The following is a quick rundown of the major building blocks of MongoDB to give you a sense of its DBMS roots:

  • Data organization – MongoDB organizes data logically by databases, collections and documents, which is very similar to the organizational hierarchy of a relational database. A document in MongoDB is a BSON document (“Binary JSON”: http://bsonspec.org), which is JSON-like but supports additional native field types such as dates, integers and binary data. All fields in a document are query-able and updateable (except the “_id” field, which is query-able but not updatable).
  • Data manipulation – MongoDB employs a client-server architecture based on a binary protocol. It provides a general-purpose query language (JSON format, SQL-like) for data manipulation. Data is always persisted (eventually) and memory is automatically managed to gain fast access to hot data.
  • Data administration – MongoDB provides tools and/or support for many administrative tasks, such as data backup and restore, performance monitoring and security (role-based access control is supported in version 2.4 release).
  • Dynamic Schema – Unlike relational DBMS’s, in the NoSQL world data schema is usually taken care of by the application. If databases or collections (tables) do not exist when they are accessed, MongoDB creates them automatically without any schema enforced at the time of creation. MongoDB also gives users all the flexibility of indexing – creating indexes on nested fields and array fields, or creating sparse indexes, geospatial indexes and text indexes etc. (MongoDB Indexes)

We found that the advantages of adopting a NoSQL solution that implements many general purpose DBMS features were manifold. First, our past experiences with RDBMS’s was still useful due to the similarity among all general purpose DBMS’s. The learning curve on MongoDB was shorter. Second, there was less friction when deploying because we felt that there was really not much hard upfront commitment required to deploy such a solution.

MongoDB, like other general purpose DBMS’s, is flexible on hardware and efficient at managing memory and storage. A pure in-memory solution would force you to use a machine (or machines) with large enough memory to hold all your application data. A heavy distributed data system would require you to commit a few machines to your application from the very beginning. MongoDB’s flexibility here can deliver fairly good performance even on an average enterprise class server.

That was how we started with our MongoDB deployment – a moderate single node then gradually evolved into many nodes with more high-end configuration. And third, the versatility of such a general purpose DBMS made it very adaptive to the various use cases that arose since initial deployment. Therefore from an ROI perspective, betting on MongoDB was well worth the investment.

 

Reason #3: Ease of Use

MongoDB is fairly easy to setup and use. Based on my own experience, using MongoDB’s binary distribution, it only takes a few minutes to download the software, run the server and have the console launched against the server instance. It’s that simple! Picking a client driver in your favorite language and having some prototype code running probably won’t take more than one or two hours either.

Why is MongoDB so easy to use? Following are a few reasons that we find most compelling:

  • Simple Server Packaging – To quickly have MongoDB ready and running, you can just download the binary distribution for your platform, unpack it and run. But if you prefer to go through a formal installation process (which is good for easy upgrade later), you can use the installers too (MongoDB Installation). However, in our environment, based on the packaging system we have, we feel it much easier to use the binary distributions.
  • Good Client Driver Support – The packaging for MongoDB client drivers is really simple too. The Java driver is merely one jar file that is self-contained, and PHP driver is a standard PEAR package. MongoDB has 12 official drivers, and the 3 drivers we have used (Java, PHP and Perl) are all easy to use, reliable and high performing.
  • Intuitive Design – Although MongoDB is a NoSQL document DBMS, it bears resemblance to RDBMS’s. The choice of using JSON as the document format and using javascript as the scripting language was wise due to their popularity. Its JSON format query language is SQL-like and intuitive to use. The principle of “Make common things easy, rare things possible” seems well rooted in MongoDB’s design.
  • Schema-less – When using MongoDB, the data model is usually controlled by developers without DBA involvement. Complex JSON documents can be dumped to a database as-is. Of course this does not mean that developers can do whatever they want without any processes and visibility. In our development process, we require developers to document data models on our internal wiki and make sure they are synchronized with the code.
  • Third party support – Due to MongoDB’s popularity, many third party software vendors have built tools and solutions tailored for MongoDB. For example, Java shops may find Spring Data MongoDB a natural fit to their existing architecture. We used Spring Data MongoDB in some of our projects and loved the fact that the automatic mapping between beans and documents, and from methods to queries came in very handy.

 

Reason #4: Great Support

With all the great features MongoDB provides and based on the maturity of the product, it is possible to deploy MongoDB in production without commercial support if the setup is simple and small-scale. The MongoDB community and team members are active and helpful on forums such as StackOverflow and the mongodb-user discussion group. But when MongoDB deployment gets more complex and gains scale in your IT infrastructure – especially when you start to use MongoDB in mission critical applications – the chance of you needing help from experts increases. When an incident does occur (it could simply be a misunderstanding of a feature), you might prefer speaking to a MongoDB engineer instead of reading and posting messages on community forums.

We have had commercial support from MongoDB for about two years. All of our MongoDB production instances are registered and monitored in the free cloud-based MMS (MongoDB Management Service). Commercial support gives us near-immediate response to any support ticket we create 24x7x365. Commercial support also includes routine checkups and need-based on-site consultation hours.

The responsiveness and professionalism demonstrated by the MongoDB support team is truly amazing. I spoke with our Senior Data Warehouse Architect, Tracy Tan, about her experience with MongoDB commercial support, and she outlined some attributes that made MongoDB support stand out from other software vendors:

  • Fast response – In my experience, the response time for a ticket created under commercial support is usually between a couple of hours to less than a day – much faster than the response time from Oracle or SAP support.
  • Strong technical and communication skills – Compared to other software vendors who offer paid support, the support reps at MongoDB generally have more technical skills and are more resourceful. They prefer to use their JIRA system as the main forum for collecting debugging information (e.g. database logs or systems information) and posting responses/solutions. Unlike support reps from other commercial software vendors, MongoDB reps generally do not require a phone call to start the support process. However, every time we request a phone discussion, they were always happy to make the appropriate arrangement and work with us over the phone to answer our questions.
  • Flexible – MongoDB’s commercial support has a subgroup called “Questions,” and they encourage us to ask them any questions we may have – even if the question may be more appropriate for an on-site consultation session. We have utilized this feature to gain important insights into MongoDB operations such as upgrades, backups, and replication setup, which in turn allowed us to perform those operations with ease and confidence.

 

It’s Not All Sunshine and Rainbows

No technology is perfect, and MongoDB is no exception. We had to make some tradeoffs in the course of adopting MongoDB and the following is a list to name a few of them:

  • Lock Contention. At the time DPS was built, we had to trade off running multiple MongoDB instances on one machine and creating many databases for lower lock contention. The decision was based on the result of a benchmark test we performed on MongoDB version 2.2, which showed that on one machine using more MongoDB instances and more databases could achieve significantly higher throughput. We ended up with a more complex implementation and it was quite a challenge to manage so many MongoDB instances and databases. If the lock were more granular in MongoDB, we would rather use one MongoDB instance containing fewer databases on one machine. As noted above, concurrency has improved in subsequent versions of MongoDB with the introduction of lock yielding and database-level locks.
  • Data consistency/durability and performance. This is a common tradeoff people make when using MongoDB to achieve high performance. We ended up making the tradeoff too by specifying the most aggressive write concern (error ignored) and read preference (secondaryPreferred) for the most performance demanding modules. We would rather not do that if MongoDB could give us both strong data consistency/durability and high performance. The cost of the tradeoff was potential data loss and data inconsistency. Although for these modules minor data loss or temporary data inconsistency is acceptable, we want to react quickly if the situation gets worse. That was why we ended up building comprehensive monitoring support to watch the data and replication lag closely.
  • More normalized data and fewer network round trips. In MongoDB, there is no support for joins. If the data is highly normalized, then the client application has to issue more queries (more network round trips that add latency) to fetch data from different collections (tables). We had to de-normalize the data in DPS to reduce network round trips. The cost was that same data could be scattered in different collections, which not only occupied more disk space but also could easily lead to data inconsistency. Applications also needed to do busy work duplicating data in different collections, which became frustrating at times. It is a good idea to carefully design your schema in order to make the right tradeoffs for your application.
  • Giving up multi-document or multi-collection transactions. Using write concerns and read preferences can mitigate some of the data consistency and durability problems without using transactions, but it cannot guarantee atomic update across multiple documents or multiple collections. As of now, we are still not confident to use MongoDB in Perfect Market Vault (a web based admin tool that is used by both internal administrators and external partners) because due to the nature of the data Vault manages the requirement on multi-document or multi-collection data consistency for Vault is much more demanding than it is for DPS.

 

Closing Thoughts

As Perfect Market expands our use of MongoDB, we continue to find more reasons to stick with MongoDB. Whether you were an early MongoDB adopter like us, or a newbie to the NoSQL world, I hope this blog post has given you a good starting point to incorporate or expand the use of MongoDB in your architecture. I applaud MongoDB for listening closely to the community and using our feedback to help the product evolve. We hope they expand their footprint beyond document DBMS and play a much more important role in providing complete big-data solutions.

 

 

This Post Has 0 Comments

Leave A Reply





8 − three =