Monday, September 8, 2008

Searching the Grid: BigTable and Geo Searches

Google's BigTable datastore looks like a relational database.  It smells like a relational database.  It even tastes just a little like a relational database.  But it's no relational database.  Google's not shy about this, either.  In their words: "It's not a relational database!"  Well, OK, I'm paraphrasing, but it really isn't.

Once upon a time, I wrote a multidimensional game of chance where, based on the statuses of adjacent sectors of a playing field, different options of attack and defense were possible.  The framework of the game (who is playing, who's turn is next, what does the display look like) was all written in PERL, but once a player's turn started, all of the logic--ALL of it--was in MySQL.  It was beautifully convoluted, and terribly processor and RAM-intensive, but man, did it keep my PERL code clean.

BigTable's query language ("gql") may look like SQL, but it's severely limited, and ever day there's someone new on the board saying, "This is so easy in MySQL, why can't I do X?", and the reason is that BT/GQL comes from a whole different design philosophy.  Not better, not worse, but certainly different, and it takes a bit of pondering to get used to some of the ideas.

Before I get to my point, I'll go back a bit in history.  I don't know how they came up with this plan, but if I had designed BigTable, this is what would have inspired me.

Most of you are likely familiar with RAID for storage and data security.  The idea is that 10 cheap disks, set up correctly, can have better performance and reliability than one expensive, fast, disk that's, got the same storage capacity as the 10-disk array is configured to support.  The 10 disks can have layers of redundancy spread across them, they can have the data distributed in such a way that effective latency is reduced, and the bus connecting them to the computer can be easily saturated for the greatest efficiency.  And the best thing is that they're cheap.  One disk fails, you pop a new one in, the data rebuilds, and you're safe until the next one breaks.  With some RAID configurations, you might be able to lose several disks and still be able to save your data.

Google (and many other companies) took that same philosophy and applied it to their hardware as well.  While they certainly may have some specialized systems in a lab somewhere, the hard work is done by lots and lots of cheap, easily replaced Linux systems in a big cluster.  A node goes out, just pop a new one in.  They go through a lot of hardware, I'm sure, but everything is fast, fast, fast, and--at least for search--they essentially never have downtime.

So here's the exciting part.  BigTable extends that philosophy to data.  Instead of making the datastore smart and letting you do all sorts of processor-intensive queries, they'd rather have you build in all sorts of redundant data when you save your objects so that they can index your information for easy searching.

A good example is geographic searches.  A lot of people are searching for the solution to this question: 

  "How do I search for all the records in my database within Q degrees of my point, (long1, lat1)?"

In MySQL, it'd be easy: when you save, save the lat and long, and when you search, do something like:

It's not totally pretty, but depending where on the earth you are, you'll get something up to about a 20 mile by 20 mile square (less area and less square as you get closer to the poles, since the earth is round).

But this query isn't possible in GQL, as only one dimension of inequality is allowed.  Basically, this is because, when they're done indexing your data, they want all the possible requests to be able to be represented as a simple flat-file list, so based on the bounds, they just pick and choose the right section of that list.  Multiple inequalities means multiple dimensions, and that suddenly gets expensive in compute cycles.

Someone came up with a nice solution called geohashing, which is pretty cool.  As you move north and east, the geohash that comes out of your lat/long combination gets higher in value; as you move south and west, it gets lower.  All you have to do is do a geohash of your object when you store it, and then when you do your query, it looks something like this:

   upperBound = geohash(mypoint.lat + 0.15, mypoint.long + 0.15)
   lowerBound = geohash(mypoint.lat - 0.15, mypoint.long - 0.15) 
   surroundings = db.GqlQuery("SELECT * FROM Points WHERE geohash > :1 AND geohash < :2", lowerBound, upperBound)

Very cute, very useful, but it has problems.  First, there are accuracy issues that might force you to add extraneous digits in cases where the lattitude and longitude values you're searching on have few digits after the decimal.  A hash of 39,-122 might look like "ABCDE", and a hash of 39.123,-121.98765 would look like "ABCDEFGHIJK".  The degree of accuracy alone might throw off a comparison.  As such, it's important to make sure that all your numbers have the same degree of accuracy, so tricks like turning 39 into 39.0000000001 and 39.123 into 39.1230000001 are important for making it work.

And again, all of this defeats the purpose of BigTable, it's a bit compute-intensive, and it's using inequalities where you don't actually need it, meaning that if you've got another dimension that you want to exclude on, you can't without a redesign.

So we come back to the redundant data option, and my solution to the problem, and that is to define the world as a distorted grid, save each point with a list of its neighboring grid squares, and then search on that.

The way I do it is this:  In addition to the lat and long for each garage sale I save to the datastore (and some other metadata), I save a list I call "surroundings", and that list has nine ASCII values which are related to low-degree-of-accuracy long/lat points.

To wit: if I have a point at long -123.11231212312 and lat 33.4456789, I shift the digit over one, floor it, drop the decimal section, and save it as an ASCII coordinate representation.  Those points become -1231,334.  But that's just the grid square that the point is in (about 7 miles wide at the equator), and searching just for points within that grid won't give you great values if the originating point is on the west edge of that grid square.  There might be a garage sale 0.1 miles to the west that will never be found.

To resolve this, I don't just save the grid square that contains the original point, but also all the grid squares around it.  Once I've gotten the modified lat and long, I just take all the combinations of adding and subtracting one to each of them, and push them all into a list  It looks something like this

    grid = [str(mLon-1)+','+str(mLat+1),
                    str(mLon)+','+str(mLat+1),
                    str(mLon+1)+','+str(mLat+1),
                    str(mLon-1)+','+str(mLat),
                    str(mLon)+','+str(mLat),
                    str(mLon+1)+','+str(mLat),
                    str(mLon-1)+','+str(mLat-1),
                    str(mLon)+','+str(mLat-1),
                    str(mLon+1)+','+str(mLat-1)]         

With this done, I save my point, including the metadata, the actual lat/long, and the 9 grid squares.  It's pretty simple:

    sale = SaleLocation(longitude = float(longitude),
                        lattitude = float(lattitude),
                        address = address,
                        surroundings = grid).put()

When I do a search on a new long/lat, I figure out the grid square that contains that point, and then I can simply search like so:

   points = db.GqlQuery("SELECT * FROM SaleLocation WHERE surroundings=:1", myPoint)

Because of the way lists are compared, if your search point is in any of the nine grid squares associated with a stored point, that stored point will return.

I chose 0.1 degrees because it's mathematically convenient, and nine grid points because it's easy, but this could easily be modified or extended to any shape or size.  Once you've got your results back, then you can cut with a finer blade still, only displaying points, say, that are in a specific circular radius, or some other permutation.  But this resolves the problem.

This solution can be applied to other range problems as well, to avoid using inequalities.  If you have items in a database that have creation and expiration dates, stop thinking about bounds, and instead create a list of all of the dates that they are valid, to whatever degree of accuracy that is required.   That way instead of searching for items where creation
It's sort of a brute force option, but I get the feeling that this is how the Google search engine works.  Lots and lots of cheap data, divided up across lots of infrastructure; all the work is done in creating the entity so that whatever way you're likely to search for it, it'll come back super fast, with little expense on the back end.

It makes sense, too.  In many applications, the information is posted once with the intention of being read many (tens, hundreds, thousands of) times.  Spending a little extra time on the back end to figure out how it'll be read the fastest, and spending a few extra bits to make it easy to find pays off big time on the front end.

1 comment:

Imad Taha said...

Amazing,I am inspired :)