The most popular Linux distribution for the desktop

Ubuntu Linux Journal

Subscribe to Ubuntu Linux Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Ubuntu Linux Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Ubuntu Linux Authors: Elizabeth White, Don MacVittie, SmartBear Blog, Liz McMillan, Jnan Dash

Related Topics: Ubuntu Linux Journal, Java Developer Magazine, Open Source Journal, CIO/CTO Update

Blog Feed Post

MongoDB analytics series: Slamdata – Run SQL and build reports directly on MongoDB

This a guest post by John A. De Goes . John is the CTO & co-founder of SlamData. When not working on tricky compilation issues for SlamData, you can find John speaking at conferences, blogging, spending time with his family, and being active in the foothills of the Rocky Mountains. Contact John at john@slamdata.com

MongoDB has been hugely successful in the developer community, partially because it allows developers to store data structures directly in a fast, scalable, modern database.

There's no need to map those data structures to rigid, predefined, and flat tables that have to be reassembled at runtime through lots of intermediate tables. (Described that way, the relational model sounds kind of old fashioned, doesn't it?)

Unfortunately, the world's analytics and reporting software can't make sense of post-relational data. If it isn't flat, if it isn't all uniform, you can't do anything with it inside legacy analytics and reporting solutions!

That's a real problem, because developers create MongoDB apps that generate tons of data. Eventually, businesses want to slice and dice this data, build reports, and gather insight.

You can do all that today, but you need to write gobs of code, transform and replicate MongoDB data into another database or data warehouse, or otherwise go through intense, prolonged pain and suffering of the type I like to avoid.

That's exactly why I founded SlamData. SlamData's an open source company developing a next-generation analytics and reporting tool for NoSQL data.

We just launched the beta version of SlamData, which supports MongoDB. In this post, I'll talk about how you can use this beta to query MongoDB using ordinary SQL, what some of the limitations are, how SlamData compares to the competition, and where the project is going.

Getting & Installing SlamData

You can build SlamData from source code on Github. But if you don't like mucking around with source code, I recommend downloading the GUI installers from the SlamData website.

There are pre-built installers for Linux, OS X, and Windows, and they come bundled with everything you need (except a running instance of MongoDB!).

During the install process, the wizard will ask you for the MongoDB Connection URI (which can include a username and password), the specific database you want to use (don't worry, you can add more later), and where in the SlamData file system you want to mount that database (which you can leave as '/', the root path).

After installation, you can launch SlamData with a double-click. Note that there are actually two components to SlamData: the front-end, and a lightweight server that connects to MongoDB. By default both will be installed, and both will launch when you double-click SlamData.

SlamData Front-End

The SlamData front-end, shown in the screenshot above, is a visual interface to all the power of SlamData (there's a REPL, too, for those who prefer a command-line interface).

The file system tree shows you all the mounted databases, as well as the collections inside these databases. It's a visual reminder of all the data you can query, although in later versions you'll be able to do a lot more with the file system.

The editor interface allows you to create notebooks, which are collections of text (Markdown), queries (SQL), and visualizations (charts). A notebook is like a report, except unlike most reports, it can be interactive and can refresh from live data.

Writing Queries

Time to write some queries!

In these examples, I'm going to assume you have the Zip Code Data Set loaded into your MongoDB database, which you can download from the MongoDB site.

This data set has basic information on zip codes, such as where they are located (city, state, geo) and their population.

Let's pull back the first 5 zip entries:

SELECT * FROM zips LIMIT 5

{ "_id" : "35004" , "city" : "ACMAR" , "loc" : [ -86.51557 , 33.584132] , "pop" : 6055 , "state" : "AL"}

{ "_id" : "35005" , "city" : "ADAMSVILLE" , "loc" : [ -86.959727 , 33.588437] , "pop" : 10616 , "state" : "AL"}

{ "_id" : "35006" , "city" : "ADGER" , "loc" : [ -87.167455 , 33.434277] , "pop" : 3205 , "state" : "AL"}

{ "_id" : "35007" , "city" : "KEYSTONE" , "loc" : [ -86.812861 , 33.236868] , "pop" : 14218 , "state" : "AL"}

{ "_id" : "35010" , "city" : "NEW SITE" , "loc" : [ -85.951086 , 32.941445] , "pop" : 19942 , "state" : "AL"}

What are the cities that have the top 5 most populous zip codes? Piece of cake:

SELECT city, pop FROM zips ORDER BY pop DESC LIMIT 5

{ "_id" : "60623" , "city" : "CHICAGO" , "pop" : 112047}

{ "_id" : "11226" , "city" : "BROOKLYN" , "pop" : 111396}

{ "_id" : "10021" , "city" : "NEW YORK" , "pop" : 106564}

{ "_id" : "10025" , "city" : "NEW YORK" , "pop" : 100027}

{ "_id" : "90201" , "city" : "BELL GARDENS" , "pop" : 99568}

SlamData's dialect of SQL (called SlamSQL) has full support for nested documents and arrays, and you access them using ordinary Javascript notation (e.g. SELECT foo.bar.baz).

Sometimes, you'll want to flatten out an array so you can dig into the contents of that array. In the zips data set, there's a loc array field that contains a geolocation. To flatten this array, just use the star operator ('*') as the array index:

SELECT loc[*] FROM zips LIMIT 1

{ "_id" : { "$oid" : "53ecf39574d2394acbd4b6f1"} , "loc" : -86.51557}

Finally, let's use SQL's GROUP BY operator to sum up the population of all the zip codes for each city, and then extract out the top 5 most populous cities:

SELECT city, SUM(pop) AS pop FROM zips GROUP BY city ORDER BY pop DESC LIMIT 5

Pretty cool, huh?

Most MongoDB functions are currently supported in the beta, although they go by their standard SQL names (or by their PostgreSQL names, as the case may be). These include the following:

  • CONCAT, LOWER, UPPER, SUBSTRING
  • DATE_PART
  • COUNT, SUM, MIN, MAX, AVG
  • +, -, *, /, %
  • ARRAY_LENGTH

Beta Limitations

The beta is called "beta" for a reason: it doesn't implement all of SlamSQL and may have a few bugs in the parts of SlamSQL it does implement.

In particular, you should pay attention to the following limitations:

  • Although GROUP BY is supported, HAVING clauses have not yet been implemented.
  • All the functions not natively supported by the Aggregation Pipeline (e.g. string length) are not yet supported.
  • Selecting individual array elements is not yet supported (e.g. SELECT loc[0] FROM zips) due to a known issue in MongoDB.
  • DISTINCT, and other set-level operations like UNION, UNION ALL, are not yet supported.
  • JOINs are not yet supported in the officially released version (though we have a branch on Github that supports them, for those who like to be on the bleeding edge!).
  • There are various edge cases not yet implemented.

How SlamData Stacks Up

Currently, there aren't a lot of tools on the market for doing analytics on MongoDB.

In the open source world:

  • There's the MongoDB drivers (available in Python, Java, etc.), but these libraries are designed for developers, not analysts or data scientists, and you need to write code to use them.
  • PostgreSQL has an open source foreign data wrapper for MongoDB contributed by CitrusData. Right now, it can only push simple filters into MongoDB. All other operations require streaming the data back from MongoDB to PostgreSQL.
  • MoSQL is an open source tool that simplifies ETL from MongoDB into PostgreSQL. You need to rigidly define schemas, flatten your data, and avoid arrays, as well as maintain separate infrastructure for analytics and reporting.

In the commercial sector:

  • AquaFold's Data Studio has a MongoDB client. It's not clear exactly what it pushes down to the driver, although it appears to be some expressions, filters, and maybe grouping. For some operations, however, data must be streamed back to the client, making it impractical for many applications.
  • JSONStudio has an interface that allows you to "visually" build pipeline queries for the MongoDB aggregation framework, but it's subject to all the same limitations as the aggregation framework, and doesn't benefit from the standardization and declarativity of SQL.
  • Simba Technologies has a MongoDB ODBC Driver. It can push down some filters and aggregations to MongoDB, but everything else has to be streamed back to the driver for further processing.
  • Progress Software has a family of data connectivity products called DataDirect, with a MongoDB connector. It's not clear how much these drivers push down to MongoDB, but they're probably very similar to the driver from Simba, and they do require that you build schemas.
  • UnityJDBC has a JDBC driver for MongoDB, but it relies on streaming to do most of the work (except for raw and very simple filtering).

SlamData stacks up very well against both open source and commercial offerings:

  1. SlamData is 100% open source, with nothing to buy or license.
  2. SlamData pushes 100% of every query down into MongoDB, without streaming anything back to the client. Ever.
  3. SlamData requires no schema definition, no schema mapping, and no ETL of any kind. It just works with the data as it already exists in MongoDB.
  4. SlamData gives nested documents and arrays first-class status. It's easy to dig into any structure using a clean, simple, and obvious extension of SQL (e.g. foo.bar.baz[2].fuz).
  5. SlamData has a snazzy but lightweight front-end that lets you build interactive reports, complete with charts. It's easy to hand off the front-end to "that person" in your team or organization, freeing you up to do what you're paid to do instead of writing code to generate reports.

Roadmap

In early October, we plan to release the production version of SlamData.

In this release, we'll have addressed many of the limitations of the beta, improved the way we can compile SQL to MongoDB, and made the front-end a lot easier to use and more powerful.

As an open source project, our success comes down to adoption. If you want to see SlamData continue to evolve into the best open source tool for NoSQL analytics and reporting, then download SlamData, fork or star the repositories, and comment on any bugs you find or features you'd like to see added.

Read the original blog entry...

More Stories By Dharshan Rangegowda

Dharshan is the founder of MongoDirector.com. Previous to MongoDirector Dharshan worked in the Virtualization and Data management groups in Microsoft.