ABSTRACT Most research in Machine translation is about having the computers completely bear the load of translating one human language into another. This paper looks at the machine translation problem afresh and observes that there is a need to share the load between man and machine, distinguish ‘reliable’ knowledge from the ‘heuristics’, provide a spectrum of outputs to serve different strata of people, and finally make use of existing resources instead of reinventing the wheel.
This paper describes the architecture and design based on the fundamental premise of sharing the load, resulting in “good enough” results according to the needs of the reader. The architecture differs from the conventional in three major ways: 1. Reversal in the order of operations as compared to conventional machine translation systems 2. Introduction of interfaces that act like glue and improve the modularity of the system 3.
Development of a GUI to provide the ‘right ‘ amount of information at the right time The paper attempts to prove that this new architecture is a better approach to Machine translation process transparent to the user-cum-developer; and it leads to machine translation in stages, thus ensuring robustness. INTRODUCTION Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.
At its basic level, MT performs simple substitution of words in one natural language for words in another. Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies. Current machine translation software often allows for customization by domain or profession (such as weather reports) — improving output by limiting the scope of allowable substitutions. This technique is particularly effective in domains where formal or formulaic language is used.
It follows then that machine translation of government and legal documents more readily produces usable output than conversation or less standardized text. Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has unambiguously identified which words in the text are names. With the assistance of these techniques, MT has proven useful as a tool to assist human translators and, in a very limited number of cases, can even produce output that can be used as is (e. g. , weather reports). The idea of machine translation may be traced back to the 17th century.
In 1629, Rene Descartes proposed a universal language, with equivalent ideas in different tongues sharing one symbol. In the 1950s, The Georgetown experiment (1954) involved fully-automatic translation of over sixty Russian sentences into English. The experiment was a great success and ushered in an era of substantial funding for machine-translation research. The authors claimed that within three to five years, machine translation would be a solved problem. Real progress was much slower, however, and after the ALPAC report (1966), which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced.
Beginning in the late 1980s, as computational power increased and became less expensive, more interest was shown in statistical models for machine translation. The idea of using digital computers for translation of natural languages was proposed as early as 1946 by A. D. Booth and possibly others. The Georgetown experiment was by no means the first such application, and a demonstration was made in 1954 on the APEXC machine at Bareback College (University of London) of a rudimentary translation of English into French.
Several papers on the topic were published at the time, and even articles in popular journals (see for example Wireless World, Sept. 1955, Cleave and Zacharov). A similar application, also pioneered at Birkbeck College at the time, was reading and composing Braille texts by computer. The translation process may be stated as: 1. Decoding the meaning of the source text; and 2. Re-encoding this meaning in the target language. Behind this ostensibly simple procedure lies a complex cognitive operation.
To decode the meaning of the source text in its entirety, the translator must interpret and analyse all the features of the text, a process that requires in-depth knowledge of the grammar, semantics, syntax, idioms, etc. , of the source language, as well as the culture of its speakers. The translator needs the same in-depth knowledge to re-encode the meaning in the target language. Therein lies the challenge in machine translation: how to program a computer that will “understand” a text as a person does, and that will “create” a new text in the target language that “sounds” as if it has been written by a person.
This problem may be approached in a number of ways. Approaches Machine translation can use a method based on linguistic rules, which means that words will be translated in a linguistic way — the most suitable (orally speaking) words of the target language will replace the ones in the source language. It is often argued that the success of machine translation requires the problem of natural language understanding to be solved first. Generally, rule-based methods parse a text, usually creating an intermediary, symbolic representation, from which the text in the target language is generated.
According to the nature of the intermediary representation, an approach is described as interlingual machine translation or transfer-based machine translation. These methods require extensive lexicons with morphological, syntactic, and semantic information, and large sets of rules. Given enough data, machine translation programs often work well enough for a native speaker of one language to get the approximate meaning of what is written by the other native speaker. The difficulty is getting enough data of the right kind to support the particular method.
For example, the large multilingual corpus of data needed for statistical methods to work is not necessary for the grammar-based methods. But then, the grammar methods need a skilled linguist to carefully design the grammar that they use. [pic] [pic] Pyramid showing comparative depths of intermediary representation, interlingual machine translation at the peak, followed by transfer-based, then direct translation. Rule-based The rule-based machine translation paradigm includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms.
Transfer-based machine translation In a rule-based machine translation system the original text is first analyzed morphologically and syntactically in order to obtain a syntactic representation. This representation can then be refined to a more abstract level putting emphasis on the parts relevant for translation and ignoring other types of information. The transfer process then converts this final representation (still in the original language) to a representation of the same level of abstraction in the target language. These two representations are referred to as “intermediate” representations.
From the target language representation, the stages are then applied in reverse. Interlingual Interlingual machine translation is one instance of rule-based machine-translation approaches. In this approach, the source language, i. e. the text to be translated, is transformed into an interlingual, i. e. source-/target-language-independent representation. The target language is then generated out of the Interlingua. Dictionary-based Machine translation can use a method based on dictionary entries, which means that the words will be translated as they are by a dictionary. Statistical
Statistical machine translation tries to generate translations using statistical methods based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament and EUROPARL, the record of the European Parliament. Where such corpora are available, impressive results can be achieved translating texts of a similar kind, but such corpora are still very rare. The first statistical machine translation software was CANDIDE from IBM. Google used SYSTRAN for several years, but has switched to a statistical translation method in October 2007.
Recently, they improved their translation capabilities by inputting approximately 200 billion words from United Nations materials to train their system. Accuracy of the translation has improved. Example-based Example-based machine translation (EBMT) approach is often characterized by its use of a bilingual corpus as its main knowledge base, at run-time. It is essentially a translation by analogy and can be viewed as an implementation of case-based reasoning approach of machine learning. Hybrid MT Hybrid machine translation (HMT) leverages the strengths of statistical and rule-based translation methodologies.
ARCHITECTURE AND DESIGN OF MACHINE TRANSLATION SYSTEM The system has two major components : • Core engine • User-cum-Developer Interface Core engine is the main engine of the system. This engine produces the output in different layers making the process of Machine Translation transparent to the user. The core engine has four components : • Word Level Substitution • Word Sense Disambiguation • Preposition placement • Hindi Word Order generation Word Level Substitution • At this level the ‘gloss’ of each source language word into the target language is provided. However, the Polysemous words (words having more than one related meaning) create problems When there is no one-one mapping, it is not practical to list all the meanings. On the other hand, system claims ‘faithfulness’ to the original text. Word Sense Disambiguation English has a very rich source of systematic ambiguities. Majority of nouns in English can potentially be used as verbs. Therefore the WSD task in case of English can be split into 2 classes : 1. WSD across POS 2. WSD within POS • POS taggers can help in WSD when ambiguity is across POS Ex. Consider two sentences 1. He chairs the session. . the chairs in this room are comfortable POS taggers mark the words with the appropriate POS tags. These taggers use certain heuristics rules and hence may sometimes go wrong. However, they are still useful since they reduce search space for meanings substantially. However disambiguation in case of polysemous word requires certain disambiguation rules. To frame disambiguation rules manually would require thousands of years. Is it possible to automate this process? The WASP WORKBENCH is best example of how, with help of a small seed data, machines can learn from the corpus and produce disambiguation rules.
System uses WASP workbench to semi-automatically generate the disambiguation rules. The output produced at this stage is irreversible, since machine makes choices based on heuristics. Preposition placement • English has prepositions whereas Hindi has postpositions. Hence it is necessary to move preposition to proper positions in Hindi before substituting their meanings. • While moving the preposition from their English positions to proper Hindi positions, record of their movements must be stored, so that in case need arises, they can be reverted back to their original position. Therefore the transformations performed by this module are also reversible. Hindi word order generation • Hindi is a free word order language. Therefore output from the previous layer also makes sense to the Hindi reader. • However this output not being in natural Hindi is not enjoyed as much as the output with natural Hindi order. Additionally, it would not be treated as translation. Therefore in this module, the attempt is to generate correct Hindi order. Interface for different linguistic tools • Machine Translation requires language resources such as POS taggers, morphological analyzers and parsers. More than one kind of these tools exists. Hence it is wise to use these tools. Output and User interface • A Java based user interface will be developed to display the outputs produced by the system. • User interface provides a flexibility to control the display. REFRENCES 1. Cohen, J. M. , “Translation”, Encyclopedia Americana, 1986, vol. 27. 2. “Machine Translation”, an introductory guide to MT by D. J. Arnold et al. (1994) 3. “Machine translation (computer-based translation)” — Publications by John Hutchins. 4. “Readings in Machine Translation”- Sergei Nirenburg , Harold L. Somers and Yorick A. Wilks