textAnalysis/background.html at main · bowersd/textAnalysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: FAQ
layout: default
---
<!DOCTYPE html>
<html lang="en">
<head>
    <title>{{ page.title }}</title>

    <!-- Recommended meta tags -->
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width,initial-scale=1.0">

    <!-- PyScript CSS
    <link rel="stylesheet" href="https://pyscript.net/releases/2024.1.1/core.css">
     -->
    <!-- This script tag bootstraps PyScript -->
    <script type="module" src="https://pyscript.net/releases/2024.1.1/core.js"></script>
</head>
<body>
    <h1>FAQ</h1>
    <p>This page gives some answers to some common questions about the site, and analyzers more generally.</p>
    <h2>What analysis option should I choose?</h2>
    <div id="complexity_explainer"></div>
    <p>As neat as the full analysis of a word, sentence or story is, people probably want something further to happen besides just getting the raw analysis. Here we describe (some of) what has already been done. If you have an idea for what else should be done, please reach out!</p>
    <hr/>
    <h3>Glossary</h3>
    <p>Most people probably want to quickly see which words are used in a story and what they mean without fussing over grammar (maybe you are about to teach the story to your class, and you want to start practicing vocabulary?). That's why we made the glossary option. It will give a list of live links to dictionary entries for all of the words that the analyzer was able to analyze. Each row of the table also includes part of speech information, a very rough English equivalent for the word, and how often the word appears in the story. Definitely take advantage of the download buttons, because then you can use a spreadsheet program to re-order the rows so that all of the vocabulary of a particular type is grouped together, or so that you can see what all of the most frequent words are. There will also be a list of words that were not analyzed and that will require more work on your part to figure out. This seems like a chore, and it can be a real slog, but it does have the benefit of drawing attention to things that could use a bit more attention.</p>
    <hr/>
    <h3>Crib Sheet</h3>
    <p>Where the glossary doesn't waste time getting you straight to the dictionary, the crib sheet covers you with a thorough explanation of each word exactly as it is written in the story (maybe you already taught the vocabulary, but you want students to be able to look up specific words if they get confused by the prefixes and suffixes?). The crib sheet shows each analyzed word, the dictionary entry it corresponds to, codes for the grammar in the prefixes and suffixes on the word, a rough English equivalent, and how often that specific word appears. If different forms of the same word are used, they will all be listed.</p>
    <hr/>
    <h3>Frequency</h3>
    <p>It can be really useful to know which words appear the most in a text. Without an analyzer, this is hard to determine in a language like Nishnaabemwin, where there can be lots of different prefixes or suffixes added on to the base, or root, word. When you select this option, the site looks at the analysis of the word instead of the word itself. All of the words that are analyzed as having the same root are grouped together and counted.</p>
    <hr/>
    <h3>Sentence Sort (by complexity)</h3>
    <p>Sentences may be hard or simple. The various "sentence sort" options were created to help students/teachers zero in on the sentences at the right level for them.</p>
    <p>"Sentence sort (by complexity)" works by counting the amount of grammatical information. That is, it counts the number of narrow morphosyntactic features in a sentence. The idea here is that longer sentences/sentences with more morphological information are more complex. This score seems to provide a fairly good distribution of scores that matches our own subjective judgements. Off the cuff, a score of 20 seems to be a pretty middle of the road score. Note you might say that the complexity sort actually punts on the question of finding sentences "at the right level" for someone, because it only measures the quantity of grammatical information, not the type of grammatical information.</p>
    <p>At some point it would be good to scale the complexity score according to hard/easy or by grade level, but we do not currently have data about how difficult various texts are for learners or readers.</p>
    <hr/>
    <h3>Sentence Sort (by verb type)</h3>
    <p> The verb type sorting option lets students or teachers access sentences with the same types of grammatical structures. The different verb types of Nishnaabemwin (VTA, VTI, VAIO, VAI, VII, see the grammatical code explanation <a href="./explanation.html">here</a> if these abbreviations are not familiar) differ a lot in what information they convey. The VII verbs describe inanimate objects and only have a handful of different suffixes that can be added to them, mostly describing whether there is one or many inanimate things being described. VAI verbs describe actions done by animate entities, and these verbs have machinery that encodes (among other things) whether I/you/someone else did the action, and how many of them there were. VAIO and VTI verbs are much the same as VAI verbs, but they also indicate some information about the thing that the action was done to. Finally VTA verbs describe things that are done to animate entities, and there is an enormous amount of machinery used to track whether I/you/someone else did the action, and whether it was done to me/you/someone else (among other things).</p>
    <p>So that students or teachers can see sentences containing these different verb types in one place (and then compare the instances of the same verb types together to see how they differ), the verb type sorting option was made. The sentences within each block of verbs are sorted by their complexity score. Note that a sentence will be listed in multiple sections if it has verbs of different types in it. Also, to reduce clutter/give students something productive to struggle with, there is no indication of where the verb is in the sentence. This could obviously be changed if requested, but also see below.</p>
    <p>Interestingly, stories seem to use roughly the same mix of verb types. At least, I did a very quick and dirty investigation by scoring verbs as follows (where higher numbers are more complex): VTA=4, VAIO=3, VTI=3, VAI=2, VII=1 (see the grammatical code explanation <a href="./explanation.html">here</a> for more information on these codes). In a very small sample of texts, the average on this measure did not seem to differ much between texts.  </p>
    <hr/>
    <h3>Verb Collation</h3>
    <p>The verb collation option allows you to isolate the verbs from a story for comparison. Within each block of verbs of the same type, the verbs are sorted by the broad analysis. This means that verbs with the same subject (the doer) will be grouped together, within that verbs with the same object (the do-ee) are grouped together, and within that, verbs that fit into the same context are grouped together (subordinate clauses/relative clauses/questions, aka 'conjunct order' verbs; commands, aka 'imperative order' verbs, and main clause verbs aka 'independent order' verbs), and so on down the list of categories that verbs can show.</p>
    <p>A small tangent, since we mentioned the conjunct order/independent order split. One possible metric to score texts on is the proportion of verbs in conjunct vs independent order. If there are more conjunct order verbs, the score is positive. The score is negative if 50% or more of the verbs are in independent order. In my view, this score does not say much about how hard a text is, because both independent order and conjunct order are tricky in their own ways. The conjunct order has a lot of irregularities, but conjunct order verbs put all of the characteristics of the doer/do-ee in one place. The independent order is very regular, and as the order that appears in main clauses, it will be something you use a lot. The thing that is hard about it is that information about the doer/do-ee is spread across multiple affixes, plus there are tricky questions about which vowels will appear. However, it may be useful to see how far a text leans in a particular direction, so you have an idea of what kind of verbs you are going to be getting.</p>
    <hr/>
    <h3>Base Analysis</h3>
    <p>The base analysis option shows all of the analytical information for the entire story. The words and sentences are kept in their original order.</p>
    <hr/>
    <h3>Triage</h3>
    <p>The "triage" options are intended for easily identifying words that the analyzer failed on. This is mostly used for bug testing.</p>
    <p>Speaking of bugs, while the analyzers are quite reliable, they are not perfect (at some point in the near future, I intend to post performance data for the analyzers, especially because the analyzers will be developed further). There could be thoughtless mistakes, and in some places I had to make educated guesses about how the language works, and I could have been wrong. The site also only presents one analysis for a word, though there may be several possible analyses.
    <hr/>
    <hr/>
    <h2>Why does the site say a word is unanalyzed?</h2>
    <p>The biggest reason is that it is tricky to spell in any language that does not have a widely agreed on spelling system, and it is especially tricky in the eastern Nishnaabemwin dialects where major vowel losses happened fairly recently. It is not practical to compensate for this given the resources available for this site. Instead, we have to use strict analyzers that fail to analyze almost any word that diverges even slightly from "correct" spelling rules.</p>
    <hr/>
    <h2>Which analyzer should I pick?</h2>
    <div id="process_explainer"></div>
    <p>When you ask the site to perform an analysis, the site takes the text you entered, and sends each word to the first analyzer you specified in step 0 on the <a href="./index.html" target="_blank" rel="noopener noreferrer">analysis page</a>.  Each analyzer is applied to the words that the previous analyzer failed on. Be sure to make the highest priority analyzer be the one that best matches the text overall. </p>
    <p>So which analyzer to choose? The analyzers mainly differ in spelling practice, but now we also have enough analyzers that we can support different dialects! </p>
    <!--p>The relaxed analyzers are large, slow, and less accurate than the strict analyzers. They are so large that they actually are not hosted on this site, and have to be downloaded in the background.</p--!>
    <p>There are currently four analyzers to choose from:
    <ul>
        <li>Nishnaabemod</li>
            <ul>
                <li>Dialect zone: Eastern</li>
                <li>Spelling: Chuck Fiero's double vowel system (see the <i>Eastern Ojibwa-Chippewa-Ottawa Dictionary</i> by Richard Rhodes)</li>
                <li>Grammar/Vocabulary: The vocabulary of this analyzer is based on the <a href="https://dictionary.nishnaabemwin.atlas-ling.ca/#/help" target="_blank" rel="noopener noreferrer"><i>Nishnaabemwin Online Dictionary</i></a>, and the grammar is based on Professor Rand Valentine's <i>Nishnaabemwin Reference Grammar</i>.</li>
                <li>Other notes: These Eastern dialects have dropped many vowels and so can be called Nishnaabemwin instead of Anishinaabemowin. There has been some evolution in the designations of the spelling styles that each analyzer is tuned for. For a while I used the last name of a person associated with the system, but that was a bit cumbersome to remember and didn't quite sit right (see below). The current practice just shows how an example word <i>Nishnaabemod/Nishnaabemat/Anishinaabemod</i> 'if he/she speaks Nishnaabemwin' is spelled (hat tip to Professor Mary Ann Corbiere for the final, concise name idea). Hopefully this is transparent and informative.</li>
            </ul>
        <li>Nishnaabemat</li>
            <ul>
                <li>Dialect zone: Eastern</li>
                <li>Spelling: A popular modification of Chuck Fiero's double vowel system. This style is used in the <a href="https://dictionary.nishnaabemwin.atlas-ling.ca/#/help" target="_blank" rel="noopener noreferrer"><i>Nishnaabemwin Online Dictionary</i></a></li>
                <li>Grammar/Vocabulary: The vocabulary of this analyzer is based on the <a href="https://dictionary.nishnaabemwin.atlas-ling.ca/#/help" target="_blank" rel="noopener noreferrer"><i>Nishnaabemwin Online Dictionary</i></a>, and the grammar is based on Rand Valentine's <i>Nishnaabemwin Reference Grammar</i>.</li>
                <li>Other notes: This analyzer is the <i>Nishnaabemat</i> analyzer, with another layer to convert the <i>Nishnaabemod</i> spelling into <i>Nishnaabemat</i> spelling. Moving from <i>Nishnaabemod</i> spelling to <i>Nishnaabemat</i> spelling is seamless, but going the other way can produce hiccups. For a while I called this spelling system the 'Corbiere system' in honor of Professor Maanyaan/Mary Ann Corbiere, who has promoted it. However, when I asked Dr. Corbiere about this, she pointed out that the differences are fairly minor, so branding it with a totally different name seemed inappropriate.</li>
            </ul>
        <li>Anishinaabemod</li>
            <ul>
                <li>Dialect zone: Eastern</li>
                <li>Spelling: Chuck Fiero's double vowel system, but vowels have not been dropped. </li>
                <li>Grammar/Vocabulary: The vocabulary of this analyzer is based on the <a href="https://dictionary.nishnaabemwin.atlas-ling.ca/#/help" target="_blank" rel="noopener noreferrer"><i>Nishnaabemwin Online Dictionary</i></a>, and the grammar is based on Rand Valentine's <i>Nishnaabemwin Reference Grammar</i>.</li>
                <li>Other notes: This analyzer is the <i>Nishnaabemod</i> analyzer, but with the vowel dropping rule turned off and a couple other minor adjustments. See <i>The Dog's Children</i> by Angeline Williams for examples of stories from what we are calling the Eastern or Nishnaabemwin region, but with dropped vowels retained, so that instead of, for instance, <i>kidod</i> you will find <i>ikidod</i> 'if he/she says'.</li>
            </ul>
        <li>Anishinaabemod (Southwestern)</li>
            <ul>
                <li>Dialect zone: Southwestern (Border Lakes/Minnesota/Wisconsin)</li>
                <li>Spelling: Chuck Fiero's double vowel system, also without dropping vowels.</li>
                <li>Grammar/Vocabulary: The vocabulary of this analyzer is based on the <a href="https://ojibwe.lib.umn.edu" target="_blank" rel="noopener noreferrer"><i>Ojibwe People's Dictionary</i></a>, and the grammar is based on paradigms collected by Professor Chris Hammerly.</li>
                <li>Divergence alert: This analyzer is not related to the various Nishnaabemwin analyzers above. It was written by a <a href="https://github.com/ELF-Lab/OjibweMorph" target="_blank" rel="noopener noreferrer">different team</a> (though for uninteresting reasons this site is using <a href="https://github.com/giellalt/lang-ciw" target="_blank" rel="noopener noreferrer">this version</a>). The Southwestern Anishinaabemowin analyzer uses a different set of grammatical abbreviations, so the "narrow" analysis field will look fairly different. See below for more discussion of the differences.</li>
                <li>Other notes: At the moment, we do not have access to terse one word translations for Southwestern Anishinaabemowin words.</li>
            </ul>
    </ul></p>
    <hr/>
    <h2>What even is an analyzer?</h2>
    <div id="analyzer_explanation"></div>
    <p>An analyzer is a network that represents the possible words of the language. It might be helpful to think of it like a map of a road system, where there are towns and the roads that connect them, except this is a map of possible words, where each road is labeled with a letter, and a path through the map represents a single word. The network was built by hand to specify what the prefixes and suffixes of the language are, how they combine with the root words, and any modification that happens to a prefix/suffix/root combination.</p>
    <p>The site reads a word with the analyzer by starting at the beginning of the network, and spending each letter as "fuel" to get to the next town if the letter matches the letter labeling the road. Some towns in the network are designated as "pass through only towns", and others are "destination towns" where you are allowed to end a trip. If the site can get all the way to a destination town by using all of the letters in the word, it says that the word is a possible word in the language. Otherwise, it reports that the trip was a failure.</p>
    <p>There's one more aspect of the analyzer: as you travel along the roads, you can pick up souvenirs. In this case the souvenirs are "tags" that show what prefixes, root word, and suffixes you have encountered on the journey. When a successful trip is completed, the analyzer presents the tags in the order that they were collected in. This appears in the "narrow analysis" field when you select the "full analysis (interlinear)" option. Everything else on the site is based on this narrow analysis.</p>
    <hr/>
    <h2>What is a conjugator?</h2>
    <p>A conjugator is an analyzer run backwards, so that instead of giving the analytical information for a word, it gives the word for analytical information. I am working hard to make sure that the conjugator has an unintimidating interface for as many analytical options as possible, but if there is some special combination of narrow analysis codes that you want to try out, you can use the direct-to-narrow-analysis entry option at the bottom of the page.
    <hr/>
    <h2>How are the Nishnaabemwin and Southwestern Anishinaabemowin analyzers different?</h2>
    <div id="analysis format"></div>
    <p>There are dialect differences that would be exhausting to talk about here (I also usually like to emphasize how much the dialects have in common). Otherwise, the main difference lies in some technical and fairly esoteric design decisions. The largest differences between the Nishnaabemwin analyzers and the Southwestern Anishinaabemowin analyzer are that the Nishnaabemwin analyzers are "terse" and "concrete", while the Southwestern Anishinaabemowin analyzer is what we might call "verbose" and "abstract". The Nishnaabemwin analyzers are terse because, for instance, they only mark plural and add nothing for the singular, leaving the default unstated. Saying that the Southwestern Anishinaabemowin analyzer is verbose means that either value of "singular" or "plural" is always stated, instead of leaving default values like "singular" unstated.</p>
    <p>The terseness of the Nishnaabemwin analyzers derives partially from the Nishnaabemwin analyzers being concrete, meaning that they tend to "follow the language". Nishnaabemwin only marks plural and adds nothing for the singular (like most languages), so the Nishnaabemwin analyzers do the same. The language also pervasively uses combinations of affixes to convey information (and some affixes in VTAs are basically instructions for how other combinations of affixes should be interpreted!). Since the Nishnaabemwin analyzers are concrete, they generally represent grammatical information as it comes up in the word, even if what might be expressed in one place in other languages is spread over multiple affixes. The Southwestern Anishinaabemowin analyzer is abstract, meaning that it combines the information from potentially multiple affixes. This means that the two analyzers drift even further apart in how they represent the language.</p>
    <p>To be clear, I do not think the verbose/abstract approach to analysis is "wrong". Ultimately, the same information is being conveyed by the two systems, so a lot of this comes down to preference for the casual user. The verbose/abstract approach used in the Southwestern Anishinaabemowin analyzer is direct (everything is fully written out and each tag states everything relevant about itself). It is also approachable, since the great majority of people who have studied languages have not studied Algonquian languages, so it feels more familiar to abstract away from the Algonquian specific combination system. The terse/concrete approach used in the Nishnaabemwin analyzer is granular and faithful to the actual language. Which you prefer is entirely up to you. That said, at some point, serious students of the language should probably have a granular, faithful understanding of how the affixes work. The Nishnaabemwin analyzer obviously supports this directly. How and when students should develop this understanding are important questions.</p>
    <hr/>
    <h2>Does the difference between the Nishnaabemwin and Southwestern Anishinaabemowin analyzers matter?</h2>
    <p>Higher level analyses that depend on the narrow analyses might not perform identically if you use the Southwestern Anishinaabemowin analyzer or the Nishnaabemwin analyzers. For instance, the complexity scoring system (see above) counts the amount of morphological information in the analysis. With the Nishnaabemwin analyzers, the complexity score will reflect how many affixes are in the word, while with the Southwestern Anishinaabemowin analyzer, it will behave more like a traditional word-counting complexity measure as is used in English reading grade level scores (though more "informationally heavy" word categories, like VTAs, will contribute more to the complexity score of a sentence than "informationally light" words like adverbs or unpossessed nouns).</p>
    <hr/>
    <h2>Why is the grammatical analysis presented in a "narrow analysis" format and a "broad analysis" format?</h2>
    <p>The biggest reason is that there are (at least) two ways to approach grammar. You might care about how a word is built from individual pieces, or you might focused on "who did what to who". For nouns and "Series A"/"independent order" verbs, a prefix like <i>g(doo)-</i> tells you that you/you and me/you guys are involved, and then a suffix <i>-naa(n)/-mi(n)</i> will tell you if it is "you and me" or a suffix like <i>-waa</i> tells you that it is "you guys". For "Series B"/"conjunct order" verbs, it is all boiled into one suffix: <i>-(y)an</i> for "you", <i>-(y)ang</i> for "you and me", and <i>-(y)eg</i> for "you guys". If you want to track the differences between "Series A"/"independent order" and "Series B"/"conjunct order", the narrow analysis is for you. If you just care about "you" vs "you and me" vs "you guys", then the broad analysis is the format for you. Basically, the broad analysis is a bit abstract (combining information from multiple affixes into a single place), while the narrow analysis is very granular. For most purposes the broad analysis will probably be what people want.
    <p>There's another benefit of making a broad analysis. Since the Nishnaabemwin analyzer is terse and granular and the Southwestern Anishinaabemowin analyzer is verbose and abstract, we can use the broad analysis as a way to consistently summarize grammatical information (abstractly, but still terse). The "broad" analyses should be consistent between the analyzers. Please contact the author of the site if you find something wrong.</p>
    <hr/>
    <p>Last updated: 4/3/2026</p>
</body>