• srdan Mar 9, 2012

    Storing stuff not worth storing (MongoDB) by srdan

    Developing an analytics platform means dealing with enormous quantities of rapidly evolving data. In order to manage it we at Burt use a database called MongoDB. On paper MongoDB seems to be the ultimate solution: its fast, easy to cluster and shard, doesn’t use schemas and supports indexes and some atomic operations, but… Oh ‘BUT’… three simple letters. Three simple letters that have shattered empires. 

    Cast the shackles of SQL-statement oppression, brother [1]! Deciding what columns you’re gonna use, the relations between tables and keys and foreign keys before-hand is ancient style terminology for schmucks. You want to use something modern. Something like MongoDB. .. right? I mean, that IS what everyone’s using these days?

    With MongoDB you don’t need to know how to write complex queries and constantly update your schema definitions. If you want to store a new field in the database, you just go ahead and store it. When the database grows and your queries start taking longer and longer, you can just add indexes for the specific fields and have everything run fast again.

    MongoDB and Ruby are in many ways a match made in heaven. Two dynamically typed beauties communicating effortlessly and gracefully in a sea of data is damn near profound [2]. Here’s how to connect to your local db, create a database called stuff, create the collection shiny and add an item to it.

    db = Mongo::Connection.new.db('stuff')
    db['shiny'] << {:thing => 'asus transformer prime 2'}

    That’s a million hot-dogs worth of awesome right there. Mongo will add your item to the collection and create a unique id for it. If you setup your mongo in a cluster it will actually replicate all the data across nodes and load balance everything automatically, based on ids. Getting an array of all items from shiny is as simple as:

    db['shiny'].find().to_a

    But that’s just a part of it. Mongo has indexes on fields. Ascending and descending. That’s pretty cool for a nosql. And you can increment values, push items to arrays without reading them first - you know, the atomic functions. Stuff that you can call from multiple processes at the same time without having to worry about race conditions.

    db['shiny'].ensure_index(:thing)
    db['shiny'].update({:thing=>'asus transformer prime 2'}, {:$push => {:who_wants_one => 'me'}})
    db['shiny'].update({:thing=>'asus transformer prime 2'}, {:$push => {:who_wants_one => 'you'}}) 

    And MongoDB is fast. Really really fast. Google around for some benchmarks, you’ll be amazed. There’s really no reason not to use Mongo. It’s scalable, easy to use, setup and manage. No reason. No.. Well.

    There’s just one thing.

    You see, MongoDB is the lovable stoner in the sitcom of life. He’s a good guy and you always feel for him, no matter how he much he screws up. You gotta remember though: The stoner never gets to drive the Camaro.

    Once your data starts piling up you’ll notice peculiarities. Big chunks of data will be missing. You see mongo doesn’t really let you know that it had problems writing down your stuff. It’ll just go ahead and nod and smile until you ask it to repeat what you just told it. Then he’ll look all confused and go:
    - Sorry man. I… I sort of ran out of disk like 20 minutes ago. Soo.. Yeah. No data.
    - What? What the hell, man!? Why didn’t you tell me you ran out of disk?
    - I dunno. You seemed so happy and there was so much data. I.. I though I’d get to it eventually. But then I died and now it’s all gone…
    - Dude, WTF! That was important shit!
    - Well, if it’s that important, why didn’t you ask me if I saved it? I guess we both learned a lesson here.

    So you find yourself asking ‘Are you getting this?’ at the end of every command [3] to make sure he’s getting everything. I mean, even though your data may not be vital, it’s always good to be in charge of what data is discarded rather than having unspecified chunks of it removed at random.

    You’ll also notice that the automatic cluster balancing doesn’t really work for high loads, which is sort of the point.

    Granted, this episode was from running mongos 1.8.3 on amazon, the sort of manic-depressive server setup. 2.0.1 seems to be considerably better here. We’ve however moved away from mongo for the most intensive use cases, so while 2.0.1 seems much better, it doesn’t have the load that our 1.8.3-setup used to have..

    It’s also worth considering that to setup a properly sharded and clustered environment you will need a ridiculous amount of servers. Some of them are tightly coupled and if any of these VIP servers fails you could be in a lot of trouble. I mean, check out the topology overview from http://www.MongoDB.org/display/DOCS/Sharding+Introduction.

    The mongo sharding/replica set topology

    I’m not going to indulge into all things mongo. Suffice it to say MongoDB was our preferred backend and we are now moving away from it. Theo (@iconara) held an excellent talk on the subject. Have a look at it here.

    [1] Or sister. You know. It’s the 21th century.. Come work here btw!
    [2] One quick thing. If you’re using Java and something like Hibernate, you’re doing it wrong. You don’t buy a supercar if you live in the mountains. That’s just stupid. And I love the mountains! Figuratively. But in the figurative mountains you use a figurative god damn pick-up truck! If you want to go dynamic on your database, you use Ruby. Or Python. Or Node. You do not use Java.
    [3] Actually, I believe we do it every 100th command or something like that. As I said, it’s important that we know when bits are missing, not that every bit is always there.

    #engineering, #MongoDB, #Ruby, #Burt, #burtcorp,