This R data package provides access to texts of Peace Books (PB), reference reports prepared by the Historical Section of the British Foreign Office between 1918 and 1919 for use by the British Delegation at the Paris Peace Conference in 1919. The reports were published in 1920 in English. The individual volumes contain reference information regarding geography, culture, population, and economy of almost all countries and territories in the world.
The package enables text analysis of the individual volumes. The list of references was removed from the corpus together with appendices, tables, tables of content, and other paratext.
The package contains five datasets covering full text, bibliographic references, sentiment tokens, and administrative metadata.
| dataset title | purpose | lines | columns |
|---|---|---|---|
| peace_books_pages | Contains full-text. The page structure of the individual published volumes remains preserved. | 370112 | 6 |
| peace_books_volumes | Contains full-text. The texts are available as continuous text. The structure was dissolved, headings, section titles were removed. | 160 | 5 |
| peace_books_authorities | Contains bibliographic data of works used for compilation of the Peace books. | 4830 | 10 |
| peace_books_ner_sentiment | Table that only contains tokens that were assigned a sentiment value (positive / negative) by the Bing sentiment dictionary. |
107013 | 7 |
| peace_books_ids | Administrative metadata for each volume that allows for cross-reference between different versions of the corpus. | 168 | 8 |
remotes::install_github("standap/peaceBooksR")
After installing peaceBooksR, just run the name of the dataset. The data frame is ready to be used as is. The data are already in the tidy format.
# Number of lines in each volume
peace_books_pages |>
count(volume_title, sort = TRUE)
# distribution of volumes by continent
peace_books_volumes |>
count(continent)
# Distribution of languages of resources in the authorities datasets
peace_books_authorities |>
count(lang_label, sort = TRUE) |>
head()
# Number of tokens with positive/negative sentiment
peace_books_ner_sentiment |>
count(sentiment)
# Number of volumes in the first bound volumes of the `Peace books`.
peace_books_ids |>
count(htid, title_vol) |>
head(5)
The introductory section, some 36 lines, that is at the beginning of each digitized volume were removed. This section contains administrative information about each volume, including its title, authors, and publisher, as well as URL to the source file, information regarding copyright and OCR procedures. Subsequently, running headers and footers, empty lines, page numbers, and other paratext elements were removed, as well. Word strings split by hyphens at the end of the lines were joined. Finally, m- and n-dashes, as well as quotation marks were removed. Most of these removals were executed by a sed script and regular expressions, but many typos and extraneous characters were removed manually during visual inspections of the texts.
The page-based organization of the volumes with line breaks after each line of text was preserved in the dataset peace_books_pages. Some methods require the text as an uninterrupted flow; the dataset peace_books_volumes contains 160 rows with the full text of each volume in one row.
Although the lists of references (authorities) were removed from the corpus, they were transformed into a separate dataset peace_books_authorities that is part of this data package, as well.
The corpus was created from twenty five bound volumes in the Hathi Trust Digital Library.
The German Opinion section of volume 24 is not included in the corpus. The volume's metadata is included in the table below for completeness.
Table: List of bound volumes and their titles that constitute the peaceBooks corpus.
| htid | title | volume |
|---|---|---|
| mdp.39015039471258 | Austria-Hungary (I) | 1 |
| mdp.39015064326716 | Austria-Hungary (II) | 2 |
| mdp.39015010785791 | Balkan States (I) | 3 |
| mdp.39015064326732 | Balkan States (II) | 4 |
| mdp.39015064326740 | Benelux | 5 |
| wu.89004522017 | France, Italy, Spain, etc. | 6 |
| mdp.39015011939983 | Germany | 7 |
| mdp.39015064326864 | Poland and Finland | 8 |
| iau.31858044687006 | The Russian Empire | 9 |
| mdp.39015010785874 | Mohammedanism: Turkey in Asia (I) | 10 |
| iau.31858044687022 | Turkey In Asia (II) | 11 |
| iau.31858044687030 | China, Japan, Siam | 12 |
| iau.31858044687048 | Persian Gulf: French and Portuguese Possessions | 13 |
| iau.31858044687055 | Dutch and British Possessions | 14 |
| iau.31858044687063 | Partition of Africa: British Possessions (I) | 15 |
| iau.31858044687071 | British Possessions (II): The Congo | 16 |
| iau.31858044687089 | French African Possessions | 17 |
| iau.31858044687097 | German African Possessions (Late) | 18 |
| iau.31858044687105 | Portuguese Possessions | 19 |
| iau.31858044871808 | Spanish And Italian Possessions: Independent States | 20 |
| iau.31858044687113 | North, Central And South America: Atlantic Islands | 21 |
| iau.31858044687121 | Pacific Islands | 22 |
| iau.31858044687139 | International Affairs | 23 |
| iau.31858044687147 | Congresses: German Opinion | 24 |
| iau.31858044687154 | Indemnities, Plebiscites, Indemnities In Previous Wars | 25 |
To cite this package, use the citation information provided in CITATION.cff or click "Cite this repository" in the sidebar.
