From c3a7be5d6c2d4c5ea42e4284d9fba2158f95351e Mon Sep 17 00:00:00 2001 From: Jim Martens Date: Sun, 19 Jan 2014 13:52:21 +0100 Subject: [PATCH] Prosem: Abstract von Russel & Norvig verbessert. --- prosem/prosem-ki.bib | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prosem/prosem-ki.bib b/prosem/prosem-ki.bib index 58bd3a3..0f1b828 100755 --- a/prosem/prosem-ki.bib +++ b/prosem/prosem-ki.bib @@ -157,7 +157,7 @@ Edition = {Third}, Series = {Prentice-Hall series in artificial intelligence}, - Abstract = {The first method to understanding natural language is syntactic analysis or parsing. The goal is to find the phrase structure of a sequence of words according to the rules of the applied grammar. A strict top-to-bottom or bottom-to-top parsing can be inefficient. Given two sentences with the same first 10 words and a difference only from the 11th word on, parsing from left-to-right would force the parser to make a guess about the nature of the sentence. But it doesn't know if it's right until the 11th word. From there it had to backtrack and reanalyze the sentence. To prevent that dynamic programming is used. Every analyzed substring gets stored for later. Once it is discovered that for example "the students in section 2 of Computer Science 101" is a noun phrase, this information can be stored in a structure known as chart. Algorithms that do such storing are called chart parsers. One of this chart parsers is a bottom-up version called CYK algorithm after its inventors John Cocke, Daniel Younger and Tadeo Kasami. This algorithm requires a grammar in the Chomsky Normal Form. The algorithm takes O(n²m) space for the P table with n being the number of words in the sentence and m the number of nonterminal symbols in the grammar. It takes O(n³m) time whereas m is constant for a particular grammar. That's why it is commonly described as O(n³). There is no faster algorithm for general context-free grammars. The CYK algorithm only co mputes the probability of the most probable tree. The subtrees are all represented in P table. PCFGs (Probabilistic context free grammars) have many rules with a probability for each one of them. Learning the grammar from data is better than a knowledge engineering approach. Learning is easiest if we are given a corpus of correctly parsed sentences; commonly known as a treebank. The best known treebank is the Penn Treebank as it consists of 3 million words which have been annotated with part of speech and parse-tree structure. Given an amount of trees, a PCFG can be created just by counting and smoothing. If no treebank is given it is still possible to learn the grammar but it is more difficult. In such a case there are actually two problems: First learning the structure of the grammar rules and second learning the probabilities associated with them. PCFGs have the problem that they are context-free. Combining a PCFG and Markov model will get the best of both. This leads ultimately to lexicalized PCFGs. But another problem of PCFGs is there preference for short sentences. Lexicalized PCFGs introduce so called head words. Such words are the most important words in a phrase and the probabilities are calculated between the head words. Example: "eat a banana" "eat" is the head of the verb phrase "eat a banana", whereas "banana" is the head of the noun phrase "a banana". Probability P1 now depends on "eat" and "banana" and the result would be very high. If the head of the noun phrase were "bandanna", the result would be significantly lower. The next step are definite clause grammars. They can be used to parse in a way of logical inference and makes it possible to reason about languages and strings in many different ways. Furthermore augmentations allow for distinctions in a single subphrase. For example the noun phrase (NP) depends on the subject case and the person and number of persons. A real world example would be "to smell". It is "I smell", "you smell", "we smell", "you smell" and "they smell" but "he/she/it smells". It depends on the person what version is taken. Semantic interpretation is used to give sentences a meaning. This is achieved through logical sentences. The semantics can be added to an already augmented grammar (created during the previous step), resulting in multiple augmentations at the same time. Chill is an inductive logic programming program that can learn to achieve 70% to 85% accuracy on various database query tasks. But there are several complications as English is endlessly complex. First there is the time at which things happened (present, past, future). Second you have the so called speech act which is the speaker's action that has to be deciphered by the hearer. The hearer has to find out what type of action it is (a statement, a question, an order, a warning, a promise and so on). Then there are so called long-distance dependencies and ambiguity. The ambiguity can reach from lexical ambiguity where a word has multiple usages, over syntactic ambiguity where a sentence has multiple parses up to semantic ambiguity where the meaning of and the same sentence can be different. Last there is ambiguity between literal meaning and figurative meanings. Finally there are four models that need to be combined to do disambiguation properly: the world model, the mental model, the language model and the acoustic model. -- not so much an abstract of the specific content of that section as an abstract about speech recognition in general -- The second method is speech recognition. It has the added difficulty that the words are not clearly separated and every speaker can pronounce the same sentence with the same meaning different. An example is "The train is approaching". Another written form would be "The train's approaching". Both convey the same meaning in the written language. But if a BBC, a CNN and a german news anchor speeks this sentence it will sound dramatically different. Speech recognition has to deal with that problem to get the written text associated with the spoken words. From the text the first method can than be used to analyze the words and find a meaning. Finally this meaning can be used to create some kind of action in a dialog system. -- Some problems of speech recognition are segmentation, coarticulation and homophones. Two used models are the acoustic model and the language model. Another major model is the noisy channel model, named after Claude Shannon (1948). He showed that the original message can always be recovered in a noisy channel if the original message is encoded in a redundant enough way. The acoustic model in particular is used to get to the really interesting parts. It is not interesting how words were spoken but more what words where spoken. That means that not all available information needs to be stored and a relative low sample rate is enough. 80 samples at 8kHz with a frame length of about 10 milliseconds is enough for that matter. To distinguish words so called phones are used. There are 49 phones used in English. A phoneme is the smallest unit of sound that has a distinct meaning to speakers of a particular language. Back to the frames: every frame is summarized by a vector of features. Features are important aspects of a speech signal. It can be compared to listening to an orchestra and saying "here the French horns are playing loudly and the violins are playing softly". Yet another difficulty are dialect variations. The language model should be learned from a corpus of transcripts of spoken language. But such a thing is more difficult than building an n-gram model of text, because it requires a hidden Markov model. All in all speech recognition is most effective when used for a specific task against a restricted set of options. A general purpose system can only work accurately if it creates one model for every speaker. Prominent examples like Apple's siri are therefore not very accurate.}, + Abstract = {The first method to understanding natural language is syntactic analysis or parsing. The goal is to find the phrase structure of a sequence of words according to the rules of the applied grammar. A strict top-to-bottom or bottom-to-top parsing can be inefficient. Given two sentences with the same first 10 words and a difference only from the 11th word on, parsing from left-to-right would force the parser to make a guess about the nature of the sentence. But it doesn't know if it's right until the 11th word. From there it had to backtrack and reanalyze the sentence. To prevent that dynamic programming is used. Every analyzed substring gets stored for later. Once it is discovered that for example "the students in section 2 of Computer Science 101" is a noun phrase, this information can be stored in a structure known as chart. Algorithms that do such storing are called chart parsers. One of this chart parsers is a bottom-up version called CYK algorithm after its inventors John Cocke, Daniel Younger and Tadeo Kasami. This algorithm requires a grammar in the Chomsky Normal Form. The algorithm takes O(n²m) space for the P table with n being the number of words in the sentence and m the number of nonterminal symbols in the grammar. It takes O(n³m) time whereas m is constant for a particular grammar. That's why it is commonly described as O(n³). There is no faster algorithm for general context-free grammars. The CYK algorithm only co mputes the probability of the most probable tree. The subtrees are all represented in P table. PCFGs (Probabilistic context free grammars) have many rules with a probability for each one of them. Learning the grammar from data is better than a knowledge engineering approach. Learning is easiest if we are given a corpus of correctly parsed sentences; commonly known as a treebank. The best known treebank is the Penn Treebank as it consists of 3 million words which have been annotated with part of speech and parse-tree structure. Given an amount of trees, a PCFG can be created just by counting and smoothing. If no treebank is given it is still possible to learn the grammar but it is more difficult. In such a case there are actually two problems: First learning the structure of the grammar rules and second learning the probabilities associated with them. PCFGs have the problem that they are context-free. Combining a PCFG and Markov model will get the best of both. This leads ultimately to lexicalized PCFGs. But another problem of PCFGs is there preference for short sentences. Lexicalized PCFGs introduce so called head words. Such words are the most important words in a phrase and the probabilities are calculated between the head words. Example: "eat a banana" "eat" is the head of the verb phrase "eat a banana", whereas "banana" is the head of the noun phrase "a banana". Probability P1 now depends on "eat" and "banana" and the result would be very high. If the head of the noun phrase were "bandanna", the result would be significantly lower. The next step are definite clause grammars. They can be used to parse in a way of logical inference and makes it possible to reason about languages and strings in many different ways. Furthermore augmentations allow for distinctions in a single subphrase. For example the noun phrase (NP) depends on the subject case and the person and number of persons. A real world example would be "to smell". It is "I smell", "you smell", "we smell", "you smell" and "they smell" but "he/she/it smells". It depends on the person what version is taken. Semantic interpretation is used to give sentences a meaning. This is achieved through logical sentences. The semantics can be added to an already augmented grammar (created during the previous step), resulting in multiple augmentations at the same time. Chill is an inductive logic programming program that can learn to achieve 70% to 85% accuracy on various database query tasks. But there are several complications as English is endlessly complex. First there is the time at which things happened (present, past, future). Second you have the so called speech act which is the speaker's action that has to be deciphered by the hearer. The hearer has to find out what type of action it is (a statement, a question, an order, a warning, a promise and so on). Then there are so called long-distance dependencies and ambiguity. The ambiguity can reach from lexical ambiguity where a word has multiple usages, over syntactic ambiguity where a sentence has multiple parses up to semantic ambiguity where the meaning of the same sentence can be different. Last there is ambiguity between literal meaning and figurative meanings. Finally there are four models that need to be combined to do disambiguation properly: the world model, the mental model, the language model and the acoustic model. -- not so much an abstract of the specific content of that section as an abstract about speech recognition in general -- The second method is speech recognition. It has the added difficulty that the words are not clearly separated and every speaker can pronounce the same sentence with the same meaning different. An example is "The train is approaching". Another written form would be "The train's approaching". Both convey the same meaning in the written language. But if a BBC, a CNN and a german news anchor speeks this sentence it will sound dramatically different. Speech recognition has to deal with that problem to get the written text associated with the spoken words. From the text the first method can than be used to analyze the words and find a meaning. Finally this meaning can be used to create some kind of action in a dialogue system. -- Some problems of speech recognition are segmentation, coarticulation and homophones. Two used models are the acoustic model and the language model. Another major model is the noisy channel model, named after Claude Shannon (1948). He showed that the original message can always be recovered in a noisy channel if the original message is encoded in a redundant enough way. The acoustic model in particular is used to get to the really interesting parts. It is not interesting how words were spoken but more what words where spoken. That means that not all available information needs to be stored and a relative low sample rate is enough. 80 samples at 8kHz with a frame length of about 10 milliseconds is enough for that matter. To distinguish words so called phones are used. There are 49 phones used in English. A phoneme is the smallest unit of sound that has a distinct meaning to speakers of a particular language. Back to the frames: every frame is summarized by a vector of features. Features are important aspects of a speech signal. It can be compared to listening to an orchestra and saying "here the French horns are playing loudly and the violins are playing softly". Yet another difficulty are dialect variations. The language model should be learned from a corpus of transcripts of spoken language. But such a thing is more difficult than building an n-gram model of text, because it requires a hidden Markov model. All in all speech recognition is most effective when used for a specific task against a restricted set of options. A general purpose system can only work accurately if it creates one model for every speaker. Prominent examples like Apple's siri are therefore not very accurate.}, Bookauthor = {Russel, Stuart J. and Norvig, Peter}, Booktitle = {Artificial intelligence: A Modern Approach}, Date = {December 11},