administration mode
Pssst...Ferdy is the creator of JungleDragon, an awesome wildlife community. Visit JungleDragon

 

JungleDragon specie engine update »

FERDY CHRISTANT - JAN 7, 2012 (09:56:13 AM)

As you may know, I'm currently working on the integration of specie data from Wikipedia into JungleDragon, as part of JungleDragon v2. It is hard, complex, invisible work. But I'm making great progress. There's no UI to demonstrate yet, but allow me to update you on the progress of the back-end.

Finding a specie:

I can now do this:

$specie = $this->ZoologyManager->FindSpecie("Impala");

Which will result in:

  • A check if there is a Wikipedia page for this entry
  • If there is an entry, yet it is a Wikipedia redirect page, I will resolve the redirect
  • Assuming the page exists, a check is made to see if it is a specie page:
    • It has the taxonomy box to the right
    • The page is at the taxonomy level species or subspecies. In practice this means the property "binomial" or "trinomial" name must be present. Anything less specific is no match.
  • All checks passed, parse the specie page. This is the most complex and fragile part of the process:
    • Parse the taxonomy information. Implemented in a flexible, forgiving way since multiple taxonomy systems exist and are used, the amount of levels varies between pages and even the naming of properties varies between pages. 
    • Parse the rest of the page, which can contain any Wiki markup. I'm intelligently trying to find selective blocks of information by its headers, next I parse, strip and transform this content in a way I find suitable for JungleDragon storage and display. This is a best effort approach, but it's working reasonably well currently.
  • Normalize the parsed elements of both the taxonomy information and the free format text into a specie object that JungleDragon understands.

This is a long explanation of saying that with a single specie search string, I get back a structured object of that specie, containing very rich information. That's quite a powerful API.

Saving a specie

Assuming we found a valid specie, next I can do this:

$this->ZoologyManager->SaveSpecie($specie);

Which will result in:

  • A new specie record being created in my database, if it did not exist already. This record contains both the taxonomy information of that specie as well as the normalized text blocks.
  • In a seperate table I will update specie synonyms, for example, there can be multiple search strings matching a single specie. "Polar Bear" and "Ursus Maritimus" are the same specie, and I maintain that relation to avoid duplicate specie records. 
  • If the specie record contains any specie classifications not yet known to JungleDragon (division, kingdom, class, order, family, genus), it will save it in a seperate table. This table is loosely designed, given that the amount of levels per specie is not consistent in Wikipedia, as well as there not being any guarantee that keys match exactly by string.
  • A very cool one: If the Wikipedia specie page has a range map (a graphic showing the distribution of the specie on a map), I will download it, associate it with a specie, and then transfer it to Amazon S3, just like all JungleDragon photos.

Summary

So, progress has been great. The API is powerful and the building blocks are starting to come together. Without a doubt, the actual logic of this back-end will be tweaked many times due to the inconsistency of Wikipedia, but I'm making big steps forward. It's tedious work, but once this is integrated into the UI, you will see why it's worth it.

Share |
RATE THIS CONTENT (OPTIONAL)
Was this document useful to you?
 
rating Awesome
rating Good
rating Average
rating Poor
rating Useless
CREATE A NEW COMMENT
required field
required field HTML is not allowed. Hyperlinks will automatically be converted.