Suzanne Briet (1894–1989) is recognized for her pioneer role in laying the foundations of modern information science. In her manifesto titled Qu’est-ce que la documentation? (“What is documentation?”), she defines a document as evidence in support of a fact; a physical or symbolic sign, preserved or recorded, for representing, reconstructing or demonstrating a physical or a conceptual phenomenon. Thus, a document documents something, it is a representation serving as evidence for some purpose. We can also say that the information it provides is presented in such a way that the producer of the document becomes able to convey this information to its users/readers, in an intelligible way. This again means that producer and reader use shared representation rules for the information.
苏珊娜(1894—1989)奠定现代信息科学基础上的先驱角色。在她题为(“什么是文档?)她将文件定义为支持事实的证据;保存或记录的物理或符号符号,用于表示、重建或演示物理或概念现象。因此,一个文档记录了一些东西,它是一种表示,作为某种目的的证据。我们还可以说,它所提供的信息是以这样一种方式呈现的,即文档的制作者能够以一种可理解的方式将该信息传达给其用户/读者。这再次意味着生产者和读者使用信息的共享表示规则。
Compared to oral communication, the main difference is the preservation of the representation, so that the consumer of the information does not need to communicate directly and immediately with its producer. The most ancient and well-known document type is of course the written document; the introduction of writing was of such importance for mankind that historians usually consider that history starts at that moment, when knowledge of the past can be recovered from written records.
与口头交流相比,主要的区别在于保留了表达方式,使得信息的消费者不需要直接、即时地与信息的生产者进行交流。最古老和最著名的文献类型当然是书面文献;对人类来说,文字的引入是如此重要,以至于历史学家通常认为,历史始于那个时刻,那时可以从书面记录中恢复对过去的认识。
n Chap. 1 (A Brief History of Documents and Writing Systems), Henry Baird introduces us to writing systems and briefly reviews the history of written documents. Throughout the centuries, these documents have been physically produced on various materials, using many different writing or printing processes, until this century’s documents which are very often originally available in electronic format. Elisa Barney Smith reviews the evolution of these creation processes and equipment in Chap. 2 (Document Creation, Image Acquisition and Document Quality). She also introduces us to the numerous acquisition processes for converting physical documents into images for further processing by appropriate software tools.
在第一章(文献和书写系统的简要历史)中,亨利·贝尔德向我们介绍了书写系统,并简要回顾了书写文档的历史。几个世纪以来,这些文件一直是在各种材料上实际制作的,使用许多不同的书写或打印过程,直到本世纪的文件,这些文件最初常常以电子格式提供。Elisa Barney Smith在第二章(文档创建、图像采集和文档质量)中回顾了这些创建过程和设备的演变。她还向我们介绍了许多采集过程,将物理文档转换为图像,以便通过适当的软件工具进行进一步处理。
When people hear about document image processing, they probably think first and foremost of optical character recognition (OCR). But in the same way as children do not only learn to decipher characters, but progressively learn to understand complete documents, document analysis systems also have gone much further than just OCR. This evolution from character to full and complex documents is presented by Henry Baird and Karl Tombre in Chap. 3 (The Evolution of Document Image Analysis).
当人们听说文档图像处理时,他们可能首先想到的是光学字符识别(OCR)。但同样地,随着孩子们不仅学会了破译字符,而且逐渐学会了理解完整的文档,文档分析系统也已经远远超过了OCR。亨利·贝尔德(Henry Baird)和卡尔·汤姆布雷(Karl Tombre)在第三章(文档图像分析的演变)中提出了从文字到完整复杂文档的演变过程。
A common, fundamental toolbox for all these document analysis systems is that of image processing algorithms applied to document images. Image processing as a whole would deserve a full handbook on its own, but in Chap. 4 (Imaging Techniques in Document Analysis Processes), Basilis Gatos presents an overview of the most fundamental image processing methods needed in any document processing and recognition system.
所有这些文档分析系统的一个共同的基本工具箱是应用于文档图像的图像处理算法。作为一个整体,图像处理本身应该有一本完整的手册,但在第4章(文件分析过程中的成像技术)中,Basilis Gatos概述了任何文件处理和识别系统所需的最基本的图像处理方法。
1. A Brief History of Documents and Writing Systems
Contents
Introduction. ........................................................................................ 4
The Origins of Writing............................................................................. 4
Writing System Terminology. ..................................................................... 4
Reading Order and Segmentation. ................................................................. 5
Types of Writing Systems. ......................................................................... 7
Origins of Writing Media.......................................................................... 8
Punctuation.......................................................................................... 9
Conclusion. ......................................................................................... 9
Cross-References. .................................................................................. 10
References. .......................................................................................... 10
Further Reading. ................................................................................ 10
摘要
This chapter provides a review of the history of written language, with emphasis on the origins and evolution of characteristics which have been found to affect – and in some cases continue to challenge – the automated recognition and processing of document images.
本章回顾了书面语言的历史,重点介绍了影响文档图像自动识别和处理的特征的起源和演变,在某些情况下,这些特征会继续受到挑战。
简介
高尔丰富的书写历史说明[2]清楚地表明,除了手写或机器印刷在平面上的文字外,人类的交流还包括岩画(许多史前人的作品)、留言棒(澳大利亚土著人的作品)、用点和线标记的豆子(印加文)、奎普文(用头发或棉线打结的作品);印加加加中国、非洲、波利尼西亚等国)、万普姆带(北美土著)、科里贝壳组(尼日利亚约鲁巴)和理货棒串(托雷斯海峡群岛)。尽管以上这些都还没有被计算机自动读取,但可以想象,文档图像分析研发界有朝一日会尝试使用它们。
书写的起源
高尔区分了“思想写作”(即“直接传达思想”,例如“树的画”是指“树”)和“声音写作”(即通过一组传统的符号使语音可见)(桑普森称之为“声门图”系统[7])。帕克斯[6]将西方早期语言与写作的关系描述如下:
在古代,书面文字被视为口头文字的记录,文本通常被大声朗读。但从六世纪开始,人们对书面语的态度发生了变化:书面语被认为是通过眼睛直接向大脑传递信息,塞维利亚的伊西多尔(约560-636ce)可以表示对沉默阅读的偏爱,沉默阅读随后成为一种规范。
丹尼尔斯和布莱特1996年对世界写作系统的调查列出了50多个主要家庭,其中一些家庭有十几个或更多的子家庭。书写习惯(尤其是语音)的巨大多样性表明,许多书写习惯在很大程度上是武断的文化发明:当然,这种多样性仍然是自动识别面临的最令人困惑的技术挑战。写作系统进化的关键阶段也没有被完全理解,而且可能一直如此。或许正因为如此,目前的写作体系显得非常混乱:几乎没有什么广泛适用的规则是显而易见的。即使是有着悠久历史记录的系统,尤其是中国的系统,也倾向于挑战完全系统的分析。类似的悲观主义,毫无疑问被现代语言学的进步所限定,可能被扩展到数千种已知的语言中,其中只有一小部分享有书写系统。(2009)SIL族群(4)列出了6909种现存的人类语言,估计有7000到10000种不同的生活语言存在。
PartA Introduction, Background, Fundamentals:
原文:https://www.cnblogs.com/2008nmj/p/12187047.html