A few weeks ago I blogged about the HTTP JSON api in Drizzle. (See also a small demo app using it.) In this post I want to elaborate a little on the design decisions taken. (One reason to do this is to provide a foundation for future work, especially in the form of a GSoC project.)
Looking around: MongoDB, CouchDB, Metabase
Designing a HTTP JSON API is of course inspired by the fact that there already are some NoSQL databases that do that. So it makes sense to look more closely into what they do exactly, learn from them, and possible even try to be compatible. When designing the JSON key value interface to Drizzle json_server plugin, I looked at the following:
MongoDB
The MongoDB manual, especially the chapter on Querying.
Mongo Wire Protocol (no, it's not HTTP)
Comparing MongoDB and CouchDB
I also found these posts from Anders Karlsson useful, perhaps since we share the same background with MySQL:
NoSQL for us RDBMS folks - MongoDB 101
MongoDB for MySQL folks - part 2
MongoDB for MySQL folks part 3 - More on queries and indexes
CouchDB
CouchDB Guide, in particular Getting Started page.
MQL
Metaweb Query Language is used by the Freebase online database. It is another JSON based query language but there isn't really a database product you would use it with, it is only used by Freebase. I came to know about MQL mainly through Ronald Bouman's MQL to SQL project, which is a proxy that converts HTTP JSON queries to SQL against a MySQL database. I used Roland's documentation as my primary reference to MQL - see in particular the Samle MQL Queries page. (Roland has also provided many helpful insights in tweets and blog comments.)
Choosing a protocol
I was surprised to learn MongoDB doesn't have a HTTP API rather uses its own binary protocol for which you need specific MongoDB client libraries. Very similar to the SQL world. Supposedly, this gives MongoDB (and SQL databases) more performance.
Even so, this choice is easy. We want to choose HTTP. If nothing else, because that's what Stewart already implemented in 0.1 version. And also, because then we can access Drizzle over HTTP without any client libraries.
Choosing the query "language"
As we settled for HTTP, it then feels natural to focus more on how at least CouchDB does things, since there is an opportunity to become compatible with it.
CouchDB implements a very pure REST design and this leads to a strong key-value emphasis. Basically you end up accessing records via URLs like:
GET https://servername:5984/databasename/documentid
As you can see, there is no obvious way to express something like filtering on another field, let alone a secondary index. For anything that is more complex than a key-value fetch, you will have to write a JavaScript function that will scan through your documents key-by-key. Sounds both complex and inefficient.
Note that even if the Drizzle HTTP JSON api currently only supports key-value operations, I wanted to select a design that will nicely support also more than that.
Btw, note at this point that CouchDB stores the keys under a json key "_id" and they have values that look like "6e1295ed6c29495e54cc05947f18c8af". A complete CouchDB document looks like:
{"_id":"6e1295ed6c29495e54cc05947f18c8af","_rev":"1-2902191555","title":"There is Nothing Left to Lose","artist":"Foo Fighters"}
Ok, giving up on CouchDB, I looked at MQL. With MQL, you specify a query document in JSON:
{
"type": "/sakila/film",
"film_id": 1,
"title": null
}
This query document asks to get all records where film_id equals 1. The title field is null, this means that it acts as a placeholder: we want the database to add the value of the film title, then return the document to us. The table may contain other fields, but we will only get "title" in return, because that's what we asked for.
The field name "type" is special: It basically denotes the schema and table name where to query from. Even if this is schemaless databases we are talking about, it seems we are supposed to assume that records with the same type are going to be similar.
Note that except for "type", there is no "_id" or other special fieldnames. In this case I know the sakila film database and know that film_id is a primary key, but it could have any name.
MQL also supports more complex operations. For instance, this query object maps to an IN (1, 2, 3) query in SQL:
[{
"type": "/sakila/film",
"film_id|=": [1,2,3],
"title": null
}]
Ok, I kind of like the basic approach of using a query document. If there is a value, then this field is used as a parameter. (WHERE, in SQL) If there is a field name but it has a null value, then it means the database should return that value. This much I like.
I don't like the type field being part of the record (or document, if you want to use schemaless terminology) itself. Records should be inside the table, the table(name) should not be inside the record. For instance, this means when returning results, you'd have to insert the table name into every single record you return. Or store the table name into the table itself, which is probably even worse.
I also don't like the more advanced operators like "film_id|=" being a way to express "film_id IN".
So what about MongoDB?
It is a little bit of both. It also has query documents, so you'll do this to find people called "Smith":
db.users.find({'last_name': 'Smith'})
If you only want to know the first name, and not get the full record back, you'll specify it like this:
db.users.find({last_name: 'Smith'}, {'first_name': 1});
Hmm... Okay. I liked the MQL way better.
MongoDB also stores the key in a field called "_id", but it is of a Mongo-proprietary ObjectId() type. Looks like this:
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
A more complex Mongo query looks like this:
db.updates.find( { user : { $in : [1, 2, 5] } } ).sort( { ts : -1 } ).limit(10)
Notice the "$in" keyname, it is again an attempt to express the SQL IN() operator by overloading the meaning of a field name. I find that MongoDB's use of dollarsign+name works slightly better for me than the magic operators appended by MQL, but it still looks hacky.
You'll also notice there's a call to separate .sort() and .limit() methods, the meaning of which seems obvious. But this is now in the specialy MongoDB query language. So it is not something we can use as such for Drizzle which is purely HTTP and JSON.
The Drizzle implementation
In summary I ended up choosing a mix of the above. I like the MQL approach where a query object is basically a template: If I provided a field with a value, then please query for records where these fields match the given values. If I provided an empty field, it means please give me the value of this field back. This is elegant and intuitive.
The Drizzle implementation actually returns all records currently. But returning only the requested records could be implemented with the chosen design.
The Drizzle implementation uses a field/key called "_id" for the primary key. It takes integer values, so it is not strictly identical to either Couch or Mongo, but at least it looks familiar.
I obviously did not take the special "type" field from MQL. Instead you give the schema name and table name on the query string part of the URL. (In the future, the DBA could configure default values and allow users to omit them.)
Drizzle currently does not support any kind of filtering or secondary indexes in the JSON interface. I haven't completely decided how to express those. I didn't quite like any of the above examples. Doing a for loop in JavaScript is out of the question I suppose :-) I don't like inserting | or $ or any other special character into places where I expect to see only field names. This seems hackish.
But of the above two the MongoDB approach seems more usable, so unless someone has a better idea, it will probably come to that.
It should be noted that even if I chose the "query object" approach, in the end I also implemented a way where you can just append the key value to the URI too and omit the query object. It was easy to implement and makes life easy if you want to fetch some records with curl or wget. Also it might be useful is you want to shard your Drizzle JSON database, because you could use a HTTP proxy to direct queries to the correct shard by looking at the key value given in the URI.
- Log in to post comments
- 155539 views
Why didn't you evaluate using
Why didn't you evaluate using GET query strings? This is the HTTP native way to filter.
You might also want to check out http://www.unqlspec.org/display/UnQL/Home .. though it seems lingering and I haven't followed it closely myself. I am not sure how far the integration with CouchDB's HTTP API is yet, but there might be something there.
MQL / type attribute
Hi Henrik,
"I don't like the type field being part of the record (or document, if you want to use schemaless terminology) itself. Records should be inside the table, the table(name) should not be inside the record. For instance, this means when returning results, you'd have to insert the table name into every single record you return. Or store the table name into the table itself, which is probably even worse."
There is no requirement to store the schema / type in the table rows, and my MQL implementation doesn't assume they are. In a way a "magical" attribute like MQL type isn't so different from the _id attribute used by couchdb. You don't know if that is literally stored inside the documents either. Physically they could just as well be the index entry, and added to the document on output.
True about the _id being
True about the _id being magical too, and I had a few issues when trying to figure out how to map that to a primary key column in Drizzle. But in the design I chose for Drizzle HTTP JSON server, the _id field is just another existing column, the only magical things about it are the data type (integer) and the name itself.
With the type column, I see the main issue is that for a given table, there is basically a constant string stored in every record in that table. Ok, you say that it wouldn't really be stored in the table, it can be removed and inserted into the json document on the fly. Ok, but I don't like that either. It's just not nice.
All this is different from the _id because the _id logically is a part of the record - every record has their own _id.
Finally, I didn't like the chosen name "type". There could be an actualy column name "type" which collides. At least it should be called "_type".
Another more REST-y approach
Another more REST-y approach would be use the Path of the URL to identify which schema / table you're working on if you're fine with limiting yourself to single table requests, something like:
http://drizzle//?query={}
The reason why MQL doesn't do this is that it is capable of returning joined, nested resultsets and even attributes from multiple types within the same object container, so it wouldn't make much sense to think of the result as coming from one particular path.
Note: The URL above is
Note: The URL above is actually http://drizzle/schema/table?query={}
Yes, I also had another
Yes, I also had another person point out the a URL like http://servername/schema/table/key would have been more REST-like. I did consider this, but in the end I ended up with the scheme of:
http://servername/api?schema=schemaname&table=tablename
Other than that this was easy to implement, my main excuse for doing it this way is that now specifying schema and/or table can be made optional, but if they are part of the URI path, then of course you cannot leave out components from the middle.
But you are right that in my approach I could have done what you suggest, because the approach is to always access only a single table. We could then support something like views (which could be a javascript function or something) that can provide results that are aggregates from many objects.
A more REST-y approach is
A more REST-y approach is what PostgreSQL is doing http://wiki.postgresql.org/wiki/HTTP_API
I suppose you can always find
I suppose you can always find out something that is even RESTier.
I realize it's contrary to the current fashion, but IMO REST isn't such a great fit for a RDBMS query interface. The possibility of endless bickering about how HTTP should map to which parts of the RDBMS native query language is one of the reasons.
sure .. but the question is
sure .. but the question is why bother with an HTTP interface if you are not going to leverage its native features? the exciting thing about an HTTP interface isn't just that its not binary. there is also to be gained with decentralizing cache layers for example. but for that you need to try and fit in as good as possible.
Lukas, fair point. That said,
Lukas, fair point.
That said, (this may be ignorance on my behalf - if so, please correct me) as long as you're accessing RPC services with GET request (like one typically would with a query protocol), I don't see why this would exclude or complicate this kind of caching. I'm aware that old-skool webservices, in particular SOAP tend to use POST and the message to communicate everything, but I feel that's just a matter of bad implementation, not an invalidation of RPC as a principle.
I don't have a gripe with REST as a whole, it just occurs to me as something that is not particularly suitable for structured and semi-structured data. Maybe in Hendriks case this is actually a good fit, since (if I understand correctly) this JSON interface is mainly about searching through and returning blob column values, and there is no intention to let the query control, say, which fields will be returned.
From what I can tell, the minute you want to support say, a partial update (please change fields so and so, leave others be) or partial selection (please give me fields so and so, don't send the others) RESTS stops being a good fit. I suppose you can resort to things like custom HTTP methods, or invent magical resource names or path elements to "fix" this, but I think the medicine then starts to become worse than the disease.
There is an intention to
There is an intention to allow selecting only some keys of the whole document/record, but it is not implemented yet. This is one reason why I chose to use a query object / query document like the one MQL uses (and a bit like MongoDB uses). It would perhaps be possible to do it also with URI query strings, but I don't see any particular benefit in doing so. Otoh the MQL design of supplying a query object seems quite natural.
Btw Lukas, while the Pg
Btw Lukas, while the Pg design is interesting, I'm not trying to do the exact same thing here. Pg is trying to design an api to access objects of a relational database. However, I'm trying to use a relational database as a foundation for a schemaless json document store. In other words, the api I'm describing will store json as blobs, period. So the approaches are pretty much the opposite (and both are useful exercises).
If you already have some database, with columns of various data types, then this api cannot access those tables. (If nothing else, it will error out when contents of columns are not valid json.)