administration mode
Pssst...Ferdy is the creator of JungleDragon, an awesome wildlife community. Visit JungleDragon

 

Teaser »

FERDY CHRISTANT - JAN 20, 2012 (13:20:52)

JungleDragon specie engine, the basic UI »

FERDY CHRISTANT - JAN 14, 2012 (14:41:13)

In the last few updates concerning JungleDragon, I mentioned how I'm working on the specie engine, the part that integrates specie information of Wikipedia with JungleDragon photos. None of this is live yet, so you can't see it. Neither was there any development UI to demonstrate, it was just me complaining how tedious it is to get structured data out of Wikipedia. 

That is still true, and my struggles in that area continue, but hereby I do want to share some first UI work of the specie engine. The scenario is simple: you have uploaded a photo and are asked to identify the specie on the photo/ For that there is an "Add specie" button, which brings up this dialog:

Since multiple species can appear on a single photo, you can add more than one specie, yet you add them one by one. As the dialog states, you can search both by common name (i.e. "Polar bear") as well by the latin name (i.e. "Ursus Maritimus").

As you type a specie name, the list will help you using suggestions. These suggestions concern species known to JungleDragon. This means they are used before. I do not have a database with all species. Instead, as you add a specie not known to JungleDragon, it will be a known specie from that point on. 

What makes a valid specie? Here are the current rules:

  • There must be an english Wikipedia page for your query, or a redirect to such a page
  • That page in particular must be a specie page, meaning:
    • It has the "taxobox" on the right
    • It has to be a specie or a subspecie, meaning it has either the "binomial" or "trinomial" name property. For example, "Bear" is not a specie, but "Brown bear" is.

Ok, given that you entered a valid specie name, one of two things will happen:

  • If the specie is known to JungleDragon already, it is instantly associated with a photo.
  • If the specie is not known to JungleDragon, yet it is a valid specie, I will parse it from Wikipedia in real-time, which takes a few seconds. A loading indicator will make this clear. From that point on, it is a known specie to JungleDragon.

So, that's how the "Add specie" dialog works. It's how you identify a specie on a photo. Once I know that relationship, I can visualize rich specie information next to such a photo. Here's a very early preview:

Check out the sidebar on the right. This photo of an Impala has been associated with the specie Impala, and as a result, it shows the common name, binomial name, description, and range map. 

Be aware that this is just a simple start. I have a lot more data about the specie and I can also visualize it any way I like. Take note of the concept though. This is where JungleDragon v2 is all about. Instantly learning about what is on the photo. And of course, later on you can click through on the specie name which will show a full page with everything there is to know about it.

Wiki parsing engine updates

I need to reserve some room in this post once again for self-pity. To complain about parsing Wikipedia. The overall complaint is that each time I extend my test set of specie queries, I find new problems, new ways in which Wikipedia pages are structured, that my engine cannot handle yet. It's one step forward, two steps back. Here's two recent situations:

  • I've been relying on the taxobox on a specie page to parse the species' taxonomy. Finally I had my engine robust enough to deal with the unlimited ways in which that taxobox can be structured: levels in the taxonomy can be there or not, the amount of levels varies, the spelling of levels varies, the value of a level can be plain text or contain any Wiki markup. Until I discovered yesterday that some specie pages do not use a taxobox, they use an "automatic" taxobox, a complex variations based on specie keys.
  • Another problem as a result of testing is this. Say you'd have a photo of an African Elephant. In the "Add specie" dialog the instructions are clear: "Elephant" will not work as it is not a specie, and thereby not specific enough. So you try "African Elephant". You'd expect this to be specific enough, unless you're a zoologist. See, in this case, even "African Elephant" is not enough. It's not a specie, instead "African Bush Elephant" is the correct specie, according to science and according to Wikipedia. But probably not according to you. In these situations, I have therefore implemented a routing mechanism. It's a manual table that I maintain in which I map a source, in this case a commonly failing yet well-intended query, into a valid target specie. So if you'd type "African Elephant", I will map it to "African Bush Elephant" for you. Over time I'm hoping this table will give you a better chance at the result you expect.

Without a doubt, there will be dozens more problems coming my way. But I will persist through them, because I dearfully believe in the concept. No matter what it takes, this will get done, and it will be done right.

Google Analytics in Real-time »

FERDY CHRISTANT - JAN 10, 2012 (19:19:53)

I love Google Analytics. It is astonishing to have such an incredibly advanced statistics tool at one's disposal for free. I also believe most owners of small to midsize websites do not get the most out of it, I'm certainly guilty of that. That's why I first want to briefly revisit two earlier posts concerning features you may not expect in Google Analytics:

Measuring your site's speed as seen by your visitors

Change a single line of code in your GA tracking javascript and GA will then sample your site's loading speed. This is a big deal. You can easily see how your site performs based on different bandwidths, locations, browsers, any dimension you like. There's enterprise solutions charging you tons for such a service. In GA it's free, and all you need to do is to add a single line of code.

In-page Analytics

My absolute favorite. We know we have tons of metrics available in GA, but its hard to bring them down to meaningful conclusions. In-page analytics changes that. You will see your page as you designed it, and visually attached you see the click-through data, amongst other metrics. 

This is the first-ever easy way to test your design. Recently I wanted to redesign parts of the navigation of a site, yet I was worried that users would be confused, as some options would be relabeled or removed alltogether. Until I learned that option in question was hardly ever used at all. It is quite powerful to make design decisions based on data, rather than gut feel, personal preference or emotion. This also seriously strengthens your position amongst stakeholders, in case you have to defend a design decision.

Real-time analytics

Ok, so those were the earlier posts. They are game changers, please check them out in detail.

The new one, which actually has been available for several weeks now, is Google Analytics realtime. Where all of GA's former metrics had data with a delay of about one day, realtime analytics show you those metrics as they occur. Perhaps you tried to launch a viral campaign in social media, this way you can check the actual effect as it happens. You can also learn about traffic patterns related to timezones. 

The opening screenshot shows a glimpse of the realtime view, with the currently active visitors, their locations, target pages and more. Truth is to be told though, this really is only interesting in case you have critical mass, meaning a large website with many visitors.

JungleDragon specie engine update »

FERDY CHRISTANT - JAN 7, 2012 (09:56:13)

As you may know, I'm currently working on the integration of specie data from Wikipedia into JungleDragon, as part of JungleDragon v2. It is hard, complex, invisible work. But I'm making great progress. There's no UI to demonstrate yet, but allow me to update you on the progress of the back-end.

Finding a specie:

I can now do this:

$specie = $this->ZoologyManager->FindSpecie("Impala");

Which will result in:

  • A check if there is a Wikipedia page for this entry
  • If there is an entry, yet it is a Wikipedia redirect page, I will resolve the redirect
  • Assuming the page exists, a check is made to see if it is a specie page:
    • It has the taxonomy box to the right
    • The page is at the taxonomy level species or subspecies. In practice this means the property "binomial" or "trinomial" name must be present. Anything less specific is no match.
  • All checks passed, parse the specie page. This is the most complex and fragile part of the process:
    • Parse the taxonomy information. Implemented in a flexible, forgiving way since multiple taxonomy systems exist and are used, the amount of levels varies between pages and even the naming of properties varies between pages. 
    • Parse the rest of the page, which can contain any Wiki markup. I'm intelligently trying to find selective blocks of information by its headers, next I parse, strip and transform this content in a way I find suitable for JungleDragon storage and display. This is a best effort approach, but it's working reasonably well currently.
  • Normalize the parsed elements of both the taxonomy information and the free format text into a specie object that JungleDragon understands.

This is a long explanation of saying that with a single specie search string, I get back a structured object of that specie, containing very rich information. That's quite a powerful API.

Saving a specie

Assuming we found a valid specie, next I can do this:

$this->ZoologyManager->SaveSpecie($specie);

Which will result in:

  • A new specie record being created in my database, if it did not exist already. This record contains both the taxonomy information of that specie as well as the normalized text blocks.
  • In a seperate table I will update specie synonyms, for example, there can be multiple search strings matching a single specie. "Polar Bear" and "Ursus Maritimus" are the same specie, and I maintain that relation to avoid duplicate specie records. 
  • If the specie record contains any specie classifications not yet known to JungleDragon (division, kingdom, class, order, family, genus), it will save it in a seperate table. This table is loosely designed, given that the amount of levels per specie is not consistent in Wikipedia, as well as there not being any guarantee that keys match exactly by string.
  • A very cool one: If the Wikipedia specie page has a range map (a graphic showing the distribution of the specie on a map), I will download it, associate it with a specie, and then transfer it to Amazon S3, just like all JungleDragon photos.

Summary

So, progress has been great. The API is powerful and the building blocks are starting to come together. Without a doubt, the actual logic of this back-end will be tweaked many times due to the inconsistency of Wikipedia, but I'm making big steps forward. It's tedious work, but once this is integrated into the UI, you will see why it's worth it.

Error 139 »

FERDY CHRISTANT - JAN 4, 2012 (20:39:56)

I just wanted to throw this oddity out here, should you encounter this yourself. I'm talking about the rather cryptic MySQL error 139.

First, my situation. I'm using MySQL to power JungleDragon. As you know, one can chose the storage engine per table but usually I go for InnoDB, which allows for foreign keys and some other trickery. In this case, I was creating a new table called "specie". It's a bit of an odd table. It has a few regular fields, but it also needs to store very large text blocks that roughly vary in size between 3 and 5K. In characters that is. Since I'm using UTF-8, the byte size storage is likely double, so between 6 and 10K. Each text block is a seperate column and there's 32 of them. As column data type I use MediumText, which allows up to 64KB of data to be stored.

You could question this table design, but for now, let's put that aside. Given that I've selected data types that allow for plenty of room, I did not expect problems when inserting records into this table. Yet I've hit error 139.

Upon further investigation, I've found that this error occurs when the maximum row size limit is exceeded. This limit is 8K per row. That sounds a bit silly given that we can chose column data types that far exceed that number. As you may know, MySQL doesn't store TEXT or BLOB data inside the rows, it stores them seperately.

Well, not really. MySQL stores the first 768 bytes of each TEXT or BLOB column inside the row, even though the actual TEXT far exceeds that size. So in my case, 32 TEXT columns x 768 bytes = 24576 bytes, which is far beyond the 8000 bytes limit. The solution, other than partitioning the table? Switching the storage engine of the table to MyISAM, which allows for a limit of 64KB.

So, the solution is simple, if you can live with the limitations of MyISAM (I can in this case). The reason I am writing this is to point out that several sources on the web are wrong in stating that the 8000 limit is a MySQL limit. In reality, it's a storage engine limit. 

Continue reading...