Thoughts on spatial data warehousing

I’ve been thinking quite a bit lately about how to store spatial data. It’s something I’ve covered here and my attitude towards this topic has evolved over the years.

The organisations I’ve worked in have mountains of spatial data accumulated over the years. The data is stored in shapefiles, geodatabases, normal databases, spreadsheets, documents, reports, photos…Why is it this way? It doesn’t have to be this way. It shouldn’t be this way!

In the course of my research for a topic for my project for next year, I’ve honed in on the methods for implementing an enterprise geoportal within an existing spatial data infrastructure. However, I feel like my focus is shifting to the data that the geoportal is trying to expose to a larger audience.

The concepts of a spatial data warehouse and a spatially enabled operational data store have been intriguing me. A regular GIS task involves comparing spatial data across a time period, analysing trends and presenting the results in a map or report. Why aren’t we storing this historical data in a SDW that’s optimised for reporting?

Non-spatial data can come from a variety of sources as well – spreadsheets, other databases etc. Another common GIS task is to spatially enable these datasets. Why are we not storing the outputs in a spatial enabled operational data store in an open format like GML?

I think it’s because to plan and implement a SDW/S-ODS takes time (and money). With a normal EDW, the organisation will not need much convincing to see the benefit of implementing one. “Spatial” is still seen as an “add-on”, or a “nice-to-have”.

The issue with names

I recently underwent a name change, and though I have yet to make it official (who wants to waste a Saturday at Home Affairs?), I have been thinking about the implications of my name change.

Now that my surname contains a hyphen and is 18 characters long (with my full name now 26 characters), I’ve been wondering how I should abbreviate it. Some hasty Googling shows that there is no standard for this. My entire life I just assumed that the first part of the surname takes precedence so the initials remain the same. In other words, Cindy Lee Williams (CLW) becomes Cindy Lee Williams-Jayakumar (CLW).

I toyed around with the idea of dropping my middle name (itself having been an issue with people assuming I’m Cindy-Lee and not Cindy Lee) to become Cindy Williams-Jayakumar (CW), but the thought of having only two initials terrified me.

I came across this blog post which calls out the assumptions programmers make when building systems which need to accept names (I’m guessing that’s about 95% of all systems). Now that my name has become slightly more complicated, I’m going to be more aware of my own assumptions when writing code, and not just when it comes to validating names.

I’ve also decided to be a bit more difficult and use CWJ as my initials. I had CLW for 27 years, it was time for a change.

Database access via Python

In my ongoing quest to do absolutely everything through Python, I’ve been looking a lot lately at manipulating databases. I’ve been using arcpy to access GIS databases for years, and last year I finally got around to using pyodbc (and pypyodbc) for accessing SQL Server databases.

Now that I’m in an Oracle environment, Oracle has provided the cx_Oracle library to directly connect to databases. I have yet to test that though. What I’m interested in at the moment is creating and accessing databases for personal use.

I considered MongoDB for a while, but I don’t think I want to go NoSQL yet. This is why I have been experimenting with SQLite (through the sqlite3 library), as it is included in the Python install, and has the delightful SpatiaLite extension. The slogan goes against my one of my mottos (Spatial is Special) while supporting my other motto (Everything is Spatial).

A rant about utilisation

Last week I posted a script that easily extracted a series of repeating tables from Word to Excel using my favourite programming language of the last 4 years, Python. I’d like to expand on the last paragraph I wrote:

It took about 15 minutes to write (had to play around with accessing the table elements correctly) and less than a minute to extract the data. That’s the amount of time it would have taken to copy 5 of the tables manually. At that pace it would have taken about 4 days to complete the process.

I was quite irritated when I wrote this script, and part of the reason is why I have been railing against utilisation as a metric for billing. The person who requested this task probably reckoned it would take about a day for my former colleaguge to get the data into Excel. The actual time, based on my estimate above, would have been 4 days. In reality, it turned out to only be an hour’s work in total (my time and my colleagues’s time). How do you bill that?

I would say split the difference and bill it as 2 days work – only an extra day on the expectation, while still 2 days’ short of the actual time it would have taken. This way one would be 2 days “ahead”, with time to do research, or catch up on other projects where the budget is low.

The catch with doing things that way is that you would need to keep track of when to submit work. If you give the work after an hour, but then book 2 days to the project, the next week the person who requested the work is either going to come question you, reject your hours, let it pass because you’ve done favours for them before, or not even pick up that the hours were booked because they aren’t doing the project management part of their job correctly. Guess which option happens most often?

What really happened of course is that my colleague only billed for that 1 hour, because the person requesting the work checked in after 2 hours to hear “if it’s done yet”. I’m no expert on what running a business or being a project manager should look like, but I think I have a good idea of what it shouldn’t look like.

What is the alternative to using the billable hour and utilisation as a measure? I don’t know, I didn’t study management and/or finance. This just one example I have from a time when I was in a purely technical role, in a company where output was based on utilisation. I’m now in more a hybrid role, where output is based on “did you do it before the deadline?”. I’ll be able to judge more clearly as time passes.

Professional registration – on my way

After compiling my application a year ago, and it finally arriving at Plato in September, I just received word that I can write the GIS Technologist exam later this year.

I submitted my application at the Practitioner level, but because I don’t have a Honours degree, I did not meet the academic requirement, although I far exceeded the work experience requirement. At least I’m getting that degree sorted out now, so I’ll be able to upgrade my registration in two years time.

Once I’ve written this exam, passed it (?) and gotten my certificate, I will officially be a GIS technologist and I’ll be officially allowed to “do GIS”. I guess the last 5 years of work experience and my GIS/Computer Science/Information Systems degree were unofficial then?

Is 2016 the year of location?

I came across an interesting post on The Next Web the other day. It asks if 2016 is the year of location. The premise of this blog is all about that – the fact that everything we do is tied to a location. Everything is spatial.

I have believed for quite a while now that as GIS grows more and more, and becomes more integrated into every domain, that the role of the GIS professional will shift drastically. The “pure” GIS professional will be enterprise/programming focussed, supporting the domain GIS users who aren’t necessarily trained GIS users, but merely use GIS as part of their toolset.