I know there is a treasure
of unknown size somewhere
wish I could measure
wish I knew where
will the treasure just drop
off Santa's bag onto my lap
or do I have to go out of my way
and to carry it off into my lair
treasures are fun to have
and even great to possess
but the greatest fun to be had
is by truly finding out where.Geoparsing has been attempted on and off over the recent past. The now defunct Yahoo Placenames was used to extract and disambiguate place names from text. Other notable defunct projects include BioGeomancer from Berkley and NGA GEOnet Names Server)
There are a few other current projects trying to do the same thing (geolocate from Tulane University, and CLAVIN - Cartographic Location And Vicinity INdexer.)
I don't know of any geoparser that goes beyond place names, into geoparsing street addresses and street intersections too. Maybe it does exist. Maybe not. Either way, I will build it.
Geo::Parser::Text interfaces with geocode.xyz, a geoparser written in perl, to disambiguate and extract location information from text (locations expressed as combinations of street names, city names and locality names.)
The Motivation?
I'd like to get a more detailed map of every location ever mentioned in literature. So, I'm currently geoparsing the gutenberg project (and a few others kind enough to offer free books for download)
Books are rich with actual location information (unless they are fantasy books mentioning fantasy locations, such as Santa's lair in the North Pole ) - so are tweets, microblogs, chats, and more.
If I search for "Which books mention Rue Saint-Jacques in Paris?" I'll get a mixed bag of answers, some good some irrelevant. I will have the complete list soon, so far from the books I have parsed this location is mentioned in the Count of Monte Cristo and Madame Bovary.
Nothing so far on the North Pole. Stay tuned.
- Finding Where in What
-
People keep reporting on what they do and sometimes we need to know where. Knowing for certain, now that is a problem nobody quite knows how to solve.
Geo::Parser::Text is just an interface. The NLP module behind geocode.xyz does the heavy lifting.
#!/usr/bin/perl
use strict;
use Geo::Parser::Text;
use Data::Dumper;
my $g = Geo::Parser::Text->new('http://geocode.xyz');
my $str = qq("
Why, my dear boy, when a man has been proscribed by the mountaineers,
has escaped from Paris in a hay-cart, been hunted over the plains of Bordeaux by Robespierre's bloodhounds, he becomes accustomed to most things.
But go on, what about the club in the Rue Saint-Jacques?"
"Why, they induced General Quesnel to go there, and General Quesnel, who quitted his own house at nine o'clock in the evening, was found the next day in the Seine."
);
my $ref = $g->geocode(scantext=>$str); #in strict mode will return top matches only using context aware parsing (which may be slower)
print Dumper $ref;And the response comes out as:
$VAR1 = {
'match' => {
'longt' => '2.343',
'confidence' => '0.7',
'MentionIndices' => '258,89,84',
'latt' => '48.8463',
'location' => 'RUE SAINT-JACQUES, PARIS, FR'
}
};There is only one location returned in this example (strict mode is default). It could however match many more in nostrict mode. (It will perhaps also match the river Seine, and a lot of other "locations" which are not, in this context at least:
my $ref = $g->geocode(scantext=>$str,nostrict => 1);The ambiguity rich response might be:
perl geo.pl
$VAR1 = {
'match' => [
{
'confidence' => '1',
'MentionIndices' => '142,84',
'latt' => '44.84122467294317',
'longt' => '-0.5819759504279641',
'location' => 'Bordeaux, FR'
},
{
'confidence' => '0.9',
'longt' => '2.321780662038148',
'location' => 'Paris, FR',
'latt' => '48.8585406663244',
'MentionIndices' => '89,84'
},
{
'MentionIndices' => '89,131',
'location' => 'Paris, PL',
'longt' => '18.623269999999998',
'latt' => '53.01594',
'confidence' => '0.2'
},
{
'longt' => '13.6377402411685',
'latt' => '50.50466656250016',
'location' => 'most, CZ',
'MentionIndices' => '206',
'confidence' => '0.15'
},
{
'confidence' => '0.1',
'MentionIndices' => '206,9',
'latt' => '51.767255',
'longt' => '12.27308',
'location' => 'most, DE'
},
{
'confidence' => '0.1',
'MentionIndices' => '206',
'longt' => '15.14741',
'location' => 'most, SI',
'latt' => '45.96118'
},
{
'confidence' => '0.05',
'MentionIndices' => '104,340',
'longt' => '11.93452',
'latt' => '46.04092',
'location' => 'cart, IT'
},
{
'MentionIndices' => '122',
'latt' => '52.69396936842102',
'longt' => '-1.2728069473684216',
'location' => 'over, UK',
'confidence' => '0.05'
},
{
'confidence' => '0.05',
'MentionIndices' => '34',
'longt' => '7.42547',
'latt' => '46.94972',
'location' => 'been, CH'
},
{
'confidence' => '0.05',
'MentionIndices' => '131',
'location' => 'plains, UK',
'longt' => '-3.9319304347826085',
'latt' => '55.88144239130435'
},
{
'MentionIndices' => '363,76',
'longt' => '-8.851603846153846',
'latt' => '42.67637230769231',
'location' => 'nine, ES',
'confidence' => '0.05'
},
{
'location' => 'nine, PT',
'longt' => '-8.54254',
'latt' => '41.47052',
'MentionIndices' => '363',
'confidence' => '0.05'
},
{
'MentionIndices' => '84',
'longt' => '-2.32211',
'location' => 'from, UK',
'latt' => '51.22834',
'confidence' => '0.05'
},
{
'MentionIndices' => '246,340',
'longt' => '11.50476',
'latt' => '44.58116',
'location' => 'club, IT',
'confidence' => '0.05'
}
]
};But then again, that's probably not what you want. (did you notice there is a town named "club" in Italy? Or a city named "From" in the UK? Is that where you are from?)
But, We want no ambiguity.
And if we want only one result (geocoding is just a specific case of geoparsing) we do:
my $ref = $g->geocode(locate=>$str);
So a geocoding answer comes back with a bit more info on the location:
$VAR1 = {
'latt' => '48.84630',
'longt' => '2.34300',
'standard' => {
'prov' => 'FR',
'city' => 'Paris',
'addresst' => 'RUE SAINT JACQUES',
'confidence' => '0.10',
'postal' => '75005'
}
};It gives you a breakdown of the elements (street, city, country) and sometimes provides the post code
This is probably what you want to do when you want to parse locations out of short text or twitter feeds. Or if you want to parse out locations out of Santa's letters so that the Post Office will deliver them.
As exciting as geoparsing and geocoding are, nothing is more useful than batch geocoding. Just supply a file of locations (one per line) and get those locations plotted on a map.
Santa gets a lot of letters, such as this one: (from http://cuyahogafallshistory.com/2014/12/1930s-dear-santa-claus/ )
Dear Santa,
I am a little boy, six years old and I go to school. Please bring me a motor boat, a gun, a go and stop sign, a tractor, football, pair of roller skates, scooter, and house slippers and oranges and nuts. I am a real good little boy. Your little friend,
Junior Hall – 1084 Stroman Ave Akron, OSuppose Santa keeps a text file of all these letters. All it needs to do is upload the file on a batch geocoding service to see where it needs to go to drop off its wares.
Junior Hall – 1084 Stroman Ave Akron
Bobbie Mitchell – RFD1 Cuyahoga Falls
3057 W. Bailey Road
... etcUpload them and you get the Map
Santa's got his work cut out now.
It may be slow sometimes, I've squizzed geocode.xyz into a small t1.micro AWS server (which is free) with 1G of RAM and 1vCPU (YES Perl runs reasonably fast on that). There is no rate limiting or throttling either, so that may be a factor too (depending on how people are abusing the API.) If that does not cut it for you, get your own server on AWS with the provided server image. (I suspect a P2 GPU instance will perform quite well)
Also, geocode.xyz is limited to about 50 European countries. geocoder.ca covers North America.
The day of a geoparser with worldwide coverage is not far. There is a pretty good chance that will happen before the New Year. Could be one of Santa's gifts along with a faster server to run it on.
And, That's where it's at!
* Geo::Coder::OpenCage * Text::NLP * Geocode.xyz * Geocoder.ca
Hey! The above document had some coding errors, which are explained below:
- Around line 240:
-
Non-ASCII character seen before =encoding in '–'. Assuming UTF-8

