Max Silberztein
Foundation
I constructed the first package of Finite State tools for Natural Language Processing, as well as the French DELAC-DELACF dictionaries for compound words, for my PhD research from 1986 to 1989 at the LADL (University of Paris 7-CNRS), under the supervision of Prof. Maurice Gross. The thesis was later published as: Max Silberztein, 1993. Dictionnaires électroniques et analyse automatique de textes : le système INTEX. Masson Ed.: Paris
In 1992 I started to work on INTEX. It's a linguistic development environment that includes large-coverage dictionaries and grammars, and parses texts of several million words in real time. Which is base on my PhD research. I developed it with the UTBM and the MSHE(Maison des sciences de l'homme et de l'environnement Claude-Nicolas Ledoux). But I stop the developement of INTEX in 2002.
Since 2002 I am working on Nooj which based on what I developed for my PhD and INTEX.
The Technology
NooJ runs on MS-Windows, Mac OS X, LINUX and BSD Unix.
NooJ processes texts and corpora (i.e. sets of text files) at the Orthographical, Lexical, Morphological, Syntactic and Semantic levels. All linguistic information (at any level) is represented by annotations that are stored in the Text Annotation Structure (TAS).
Annotations are typically inserted added to the TAS in cascade, without destroying the original text. Annotations can describe units inside word forms (for contracted words, e.g. "cannot" and for agglutinative languages), simple forms (e.g. "table"), multiword units (e.g. "round table") as well as discontinuous expressions (e.g. "turn ... off").
NooJ offers the four types of grammars/machines of the Chomsky hierarchy:
NooJ contains several tools to process Finite-State machines and Regular grammars.
NooJ processes Context-Free Grammars and Push Down Automata. Note that in most cases, NooJ can "flatten out" sets of recursively embedded graphs, to de-recursivate Context-Free Grammars into Regular grammars.
NooJ processes Context-Sensitive Grammars in two steps: the first step is performed by a Push Down Automaton (or even a Finite-State Machine when the grammar is flattened out), the second step is performed by computing variables' value and testing the constraints of the Grammar (in O(n)).
NooJ can perform Z. Harris's transformations in cascade, giving NooJ the power of a Turing Machine. The morphological and the syntactic engines are integrated: this makes it possible to perform morphological operations on words while performing a syntactic transformation.
NooJ can process texts written in over 20 languages, including some Roman, Germanic, Slavic, Semitic and Asian languages, as well as Hungarian. All NooJ grammars/machines are compatible, i.e. one can insert parts of a Regular Grammar in a Context-Free Grammar, in a Context-Sensitive Grammar, and use them in a loop to simulate a Turing-Machine.
NooJ dictionaries are extremely simple objects and can describe orthographical and synonymous variants, inflectional as well derivational forms. NooJ includes tools to check, debug, adapt, maintain, and share dictionaries and grammars.
The Book
I made a book to provides the theoretical and methodological framework needed to create a successful linguistic project. If you are a teacher, please contact: max.silberztein@univ-fcomte.fr to get the solutions of the exercises proposed at the end of each chapters.
Errata
Page Number | In the context | Should Be |
43 | UTF uses either one 1 byte | is coded 27 |
48 | The letter s is coded 19 | UTF uses either one byte |
83 | 250,000 | 350,000 |
85 | Member of the working classis not a … | working class is not a |
117 | ..using anparser… | …using a parser… |
132 | G3=G1 | G1= eat | eats | G3=G1 | G2 = eat | eats |
134 | …the following language LGN | …the following language LNP |
149 | [MES 08] | [MES 08b] |
149 | The following grammar... | The grammar shown in Figure 6.18… |
150 | (Grammar recognizes the text “Have evening”) | (In the graph, replace node "evening" with “a nice evening”) |
162 | In Figure 7.2: Main = :NP :VG :NP. |
(Replace the full stop “.” character with a semi-colon “;”) Main = :NP :VG :NP ; |
163 | (The first sentence on this page ends with “above:”) | Replace with “which is presented in Figure 7.2.” |
163 | Thehighlighted nodes… (One cannot see which nodes are highlighted because of the poor contrast of the print) | The NP, VG and NP nodes... |
164 | In figure 7.4. the graph VG recognizes the texts ‘be going’ and ‘be not going’ | Replace |
165 | We can also produce the sequence ‘abab’...’ | We can also produce the sequence 'aabb', |
168 | Wmissing reference [MOO 88] | Moore, Robert C. (May 2000). "Removing Left Recursion from Context-Free Grammars" (PDF). In 6th Applied Natural Language Processing Conference: 249–255. |
174 | Missing reference [DON 13] | Donnelly Ch. Stallman R. The Bison Manual. https://jdcqivvcr.updog.co/amRjcWl2dmNyMTg4MjExNDIzWA.pdf |
176 | …: does not mean that context-free languages are themselves ‘smaller’ than context-free languages. | … does not mean that context-free languages are themselves ‘smaller’ than context-sensitive languages. |
Thanks
I wish to express many thanks to my colleagues and students, as well as to all the INTEX users who have contributed to help enhance INTEX, and now NooJ, with their patience, criticisms, creative ideas and ambitious expectations.