Warning: this documentation, while probably correct, may be a bit out of date.
This is a gem to allow bots to be written to fetch and format data that can be easily imported into OpenCorporates, the largest openly licensed database of companies in the world. It also aims to be a curated set of tools to allow data to be retrieved, formatted and imported on a regular basis.
By including OpencBot you have access to a number of methods for setting up and writing to/reading from a local SQLite database in which the data can be stored. It is expected to expose two class or module methods: #update_data and #export_data. If the exported data is in the correct format it will be able to be seamlessly able to be imported into OpenCorporates.
##How to install/create a bot
mkdir your_bot_name
cd your_bot_name
curl -s https://raw.github.com/openc/openc_bot/master/create_bot.sh | bash##Required methods
A standard bot module should look like this:
Module MyBot
extend OpencBot
extend self
def update_data
# fetch or scrape data and store (possibly in local SQLite database)
end
def export_data(options={})
# return data (possibly from the SQLite database)
end
endIf you follow the conventions and use these methods (and you must do in order for this to validate) there are several tasks available to you to run and test the data:
bundle exec openc_bot rake bot:create # creates the bot in the first place
bundle exec openc_bot rake bot:run # runs the #update_data method
bundle exec openc_bot rake bot:export # runs the #export_data method and outputs data to stdout as JSON
bundle exec openc_bot rake bot:test # validates that the exported data conforms to the basic data structure expected
For a simple licence bot, the test is quite thorough and includes checking against a JSON schema.
####NB the data, db and tmp directories should not be committed to git
root
|_config.yml # A YAML file with configuration for the bot
|_data/ # Put persistent data in here
|_db/ # This is where the sqlite database will be stored
|_lib/ # For the code itself
|_spec/ # For the specs
|_tmp/ # Temporary store
Data in data/ and db/ will be persisted through deployments, but tmp/ will not be persisted.
We expect licence data as a hash with the following keys:
:sample_date
required (if end_date is not provided)
:start_date
optional
:start_date_type
required if :start_date is present
one of "=", "<", or ">"
:end_date
optional (if sample_date is provided)
:end_date_type
required if :end_date is present
one of "=", "<", or ">"
:company
required
a hash with the following keys:
:name
required
a string of the name of the company
:jurisdiction
required
a string of the jurisdiction
eg "us_ca"
:source_url
required
a string of the URL of the data
:data
required
an array with a single hash, with the following keys:
:data_type
required
must be :licence
:properties
required
a hash with the following keys:
:licence_number
optional
:jurisdiction_classification
required
an array of strings that describe the licence or the licenced company, using the vocabulary of the data source
examples might be:
foreign bank branch
co-operative credit
motor vehicle finance
trust company
:oc_classification
not required yet
an array of strings that describe the licence or the licenced company, taken from a vocabulary list provided by OpenCorporates
Imagine you are interested in mining licenses in Liliput and Brobdingnag, and you want to provide this data to OpenCorporates. You find a website that lists mining licenses for these jurisdictions, so you write a bot that can submit each license.
You find that Liliputian licenses have a definied start date and a definied end date, which mean you can explicitly say "this license is valid between 1 June 2012 and 31 Aug 2013" for a particular license.
In this case, you would submit the data with a start_date of
2012-06-01 and an end_date of 2013-08-31; and a
start_date_type of = and an end_date_type of =. You would
also submit a sample_date for that document, which is the date on
which the license was known to be current (often today's date, but
sometimes the reporting date given in the source).
However, you find that Brobdingnagian licenses only tell you currently
issued licenses. As a bot writer, all you can say of a particular
license is "I saw this license when we ran the bot on 15 January
2012". In this case, you would leave start_date and end_date
blank, and submit a sample_date of 2012-01-15 instead.
If you subsequently see the license on 15 February, you'd submit
exactly the same data with a new sample_date.
This means OpenCorporates can infer, based on the running schedule of
the bot, and the sample_dates of its data, the dates between which a
license was valid (in this case, between 15 January and 15 February).
To summarise, there are three kinds of dates that OpenCorporates deals with:
- The date on which an observation was true: the
sample_date. This is the date of a bot run, or a reporting date given in the source document. Every observation should have a sample date. - A
start_dateand/orend_datedefined explicitly in the source document - A
start_dateorend_datethat has not been provided by the source, but which OpenCorporates can infer from one or more sample dates.
All data sources are different, so the following are just pointers as opposed to rules.
Think about breaking the scraping process down into three stages. This is sometimes referred to as "Extract, Transform, Load"
"Extract" would mean saving the pages/data to the data folder. "Transform" means loading these files from the
data folder and parsing them into the right format (probably a hash). The final step, "Load", simply means saving them
the the database using the save_data method.
module MyBot
extend self
SOURCE_FILE_URL = 'http://www.dfi.utah.gov/Download/INSTADDR.TXT'
# ... other methods
def extract
src = MyBot.scrape(SOURCE_FILE_URL)
outfile = File.expand_path("../data/utah_institutions_file.txt", File.dirname(__FILE__))
File.open(outfile, "w") {|f| f.write src}
end
# ... other methods
endSaving to disk first provides some benefits in terms of being able to re-run a scraper on failure. It also makes it easier to trace if there have been problems with the data.
module MyBot
extend self
# ...
def transform
src_path = File.expand_path("../data/utah_institutions_file.txt", File.dirname(__FILE__))
src = IO.read(src_path)
CSV.parse(src, :headers => true).collect do |row|
new_row = row.merge(:retrieved_at => Time.now)
new_row[:company_name] = row[:name].gsub(/[[:space:]]+/, ' ') # for example
new_row # this is now a data hash ready for import
end
end
# ...
endThis is a very simple example. You will probably need to break out the process into more methods, but consider keeping them "wrapped" in a transform method so that you and other scraper authors can see what is going on in future.
module MyBot
extend self
# ...
def load
data = transform # from our previous example
save_data([:uniq_id_field], data)
end
endThis is fairly straightforward using one of the included helper methods (see "Helper methods").
Why sqlite? It's important to be able to view and query any data you gather in order to check it's accuracy and quality. We use sqlite as an interim storage method because it has very few external dependencies and works well in this single user environment. See the "Working with sqlite" section for more details on how to query/check data with sqlite See the tips on scraping for more details on how to query/check data with sqlite.
Specs are required 😺. If nothing else, specs help to explain what you were trying to do with a particular method. We encourage all Bot authors to take the time to write specs, as it pays off in the long run.
Given scraping involves retrieving resources you should be careful
when writing specs that you don't request files from the web every time they run, otherwise you might find yourself getting blocked!
You can achieve this by stubbing out the scrape method and returning content from the spec/dummy_responses folder
(see spec_helper.rb for details).
describe MyBot do
it "should parse a HTML file" do
response = MyModule.scrape("http://my-interesting-data-source.com")
MyBot.parse(response).should_not be_empty
end
# NO - this will hit the network every time you run the test
# not only is this slow, but you're putting unnecessary load
# on the data source
enddescribe MyBot do
before :each do
MyBot.stub(:scrape).and_return(dummy_reponse('path/to/saved/html/response/in/spec/dummy_responses/sample.html'))
end
it "should parse a HTML file" do
response = MyModule.scrape("http://my-interesting-data-source.com")
MyBot.parse(response).should_not be_empty
end
# YES - this only uses a file on your local disk
# This makes for a very fast test - the only downside is
# that you have to keep your sample responses up to date
# if they change (html layout for example) on the data source.
# `stub` will change the method that you pass in, making sure that
# it's original doesn't get called, but also checking that the new
# dummy method will.
# `and_return` is a way of simulating what that method would have responded with
# `dummy_response` is defined in the sample spec_helper.rb file
endFor general advice on using RSpec, there is a good slide deck here: http://kerryb.github.io/iprug-rspec-presentation/ For general RSpec style guidelines, http://betterspecs.org/ is worth a read.
It's not unusual for scrapes to take several days. Power cuts and accidental keypresses do happen. Your script should contain code that allows it to resume where it left off if interrupted; for example, storing the latest value of a counter, or logging some identifier for each record which allows it to be re-got. The save_var and get_var methods documented below are useful for this; or you can log identifiers to a sqlite table.
It's always a good idea to look at the data you've collected to see if you're happy with it. There are several good
sqlite clients available, or alternatively you can use the command line - sqlite3 path/to/my/database.db
Check to see for any obvious issues before submitting your bot.
See the Working with sqlite section for examples on how you can analyse and agregate your data to check for issues.
You often learn a lot about a domain whilst working on a scraper and it's important that this is saved with the bot.
Follow the instructions in the generated README file and you should be off to a good start. It's important for others
to be able to review and understand what you've written in case they need to work on it in future.
Simple - retrieving a static file in a standard format (e.g. CSV)
Incremental - scraping a site by incrementing some id parameter eg. a query string param ?id=47
Iterative - working over a range of possible inputs eg. searching for all the letters from a..z
Each of these have their separate challenges and some data sources require a combination of all three. With the
incremental and iterative steps, it's a good idea to keep track of where you are up to in case you need to stop/restart
the bot (see the get_var and save_var example in Helper methods).
This bot includes a copy of the sqlite3 gem, but you might need to install the sqlite3 program using your
package manage (brew, apt-get, yum etc.)
In the root folder of your bot, you should be able to run
sqlite3 db/mybotname.dbwhere mybotname is the name of your bot. This will open up an sql prompt. A few useful commands:
These will let you know what to do in most cases. Commands beginning with a dot (.) have a special meaning
in the sqlite prompt.
Nice format for reviewing records at the shell
sqlite> SELECT * FROM us_nj_banks ORDER BY RANDOM() LIMIT 2; licensee_name = RICHARDSON IMPORTS INC
ref_num = 8901006
business_name_address = RICHARDSON IMPORTS INC1230 ROUTE 73MOUNT LAUREL, NJ 08054
license_type = MOTOR VEHICLE INSTALLMENT SELLER (CORPORATION)
status = ACTIVELY LICENSED
licensee_name = ISAAC,RHONDA
ref_num = 0805480
business_name_address = EMPIRE TODAY LLC1200 TAYLORS LANESUITE 2BCINNAMINSON, NJ 08077
license_type = HOME REPAIR SALESMAN
status = ACTIVELY LICENSED
Potentially useful way of reviewing larger numbers of records without a dedicated sqlite program. You have to add in the opening and closing table tags yourself though.
sqlite> SELECT * FROM us_nj_banks ORDER BY RANDOM() LIMIT 100;
-- This would output 100x <tr> tags into a file called result.html in the project rootThere are lots of useful GUI clients for Sqlite3. We've used the following with some success:
By extending OpencBot, you'll have access to the following methods which may be helpful in obtaining, saving and transforming data. More detailed usage is found in the generated code and README for new bots.
save_data ( uniq_keys, values_array, table_name='ocdata' ) - The primary method of saving to the sqlite db.
data = [
{:name => "Acme Corporation Ltd.", :type => "Investment Bank"},
{:name => "Acme Holdings Ltd.", :type => "Bank Holding Company"}
]
MyBot.save_data([:name], data, 'ocdata')This method saves data in an sqlite database named after the name of this class or module.
If no table name is given, ocdata will be used. The table will be created if it doesn't already exist.
The first parameter are names of unique keys, and the data element should be an array of hashes, with keys becoming the field names.
If the table has not been created or field names are given that are not in the table, they will be created
The save_data method currently saves all values as strings.
insert_or_update ( uniq_keys, values_array, table_name='ocdata' ) - Update/insert data based on existing key sqlite db.
Similar to save_data but attempts to update the row based on the unique_key
save_var ( name, value ) - save a value to the database
get_var ( name, default=nil ) - retrive a value, with a fallback if it doesn't exist
current_id = MyBot.get_var('current_id', 1) # get the last good id, otherwise return 1
long_scraping_process.each do |page|
save_to_disk(page)
MyBot.save_var(current_id)
current_id += 1
endAllows bot authors to store small bits of information between runs. Unfortunately long running bots tend to get stopped unexpectedly in development (power cuts, connectivity failures etc.) so these methods are useful in picking up where you left off.
select ( sqlquery ) - Convenience method for selecting records for the sqlite db
MyBot.select('* from ocdata') # return everythingsave_run_report ( reporthash ) - To be called at the end of each run
MyBot.update_data
super_complicated_scrape_task
save_run_report(:status => 'success')
endUsed by Open Corporates to monitor the status of the bot.
Please include relevant information such as failures and error messages in the report hash.
Time.now is added to the output automatically.
scrape ( url, params=nil, agent=nil ) - fetches content from a webserver
MyBot.scrape("https://google.com") # returns a string of the page sourceRetrieves a resource from the web and returns a string. Uses the HTTPClient gem internally which handles SSL and gzipped content.
- Fork it
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create new Pull Request