rwdgutenberg 0.09 → 0.12

Sign up to get free protection for your applications and to get access to all the features.
Files changed (131) hide show
  1. data/Readme.txt +20 -7
  2. data/code/01rwdcore/01rwdcore.rb +3 -0
  3. data/code/01rwdcore/openhelpwindow.rb +1 -1
  4. data/code/01rwdcore/runopentinkerdocument.rb +1 -1
  5. data/code/01rwdcore/rwdtinkerversion.rb +1 -1
  6. data/code/superant.com.gutenberg/0uninstallapplet.rb +17 -11
  7. data/code/superant.com.gutenberg/changegutenbergname.rb +2 -2
  8. data/code/superant.com.gutenberg/clearbookscreendisplay.rb +5 -5
  9. data/code/superant.com.gutenberg/cleargutenbergfiles.rb +0 -0
  10. data/code/superant.com.gutenberg/cleargutrecordfiles.rb +0 -0
  11. data/code/superant.com.gutenberg/copyfilename.rb +2 -2
  12. data/code/superant.com.gutenberg/createnewnote.rb +13 -12
  13. data/code/superant.com.gutenberg/deletegutenbergrecord.rb +9 -8
  14. data/code/superant.com.gutenberg/gutenbergcreatefile.rb +22 -10
  15. data/code/superant.com.gutenberg/helptexthashload.rb +21 -0
  16. data/code/superant.com.gutenberg/launchurl.rb +13 -0
  17. data/code/superant.com.gutenberg/listdirectories.rb +32 -0
  18. data/code/superant.com.gutenberg/listnamerecord.rb +10 -10
  19. data/code/superant.com.gutenberg/listnotedirshtml3.rb +57 -0
  20. data/code/superant.com.gutenberg/listtextfilesgutenberg.rb +72 -47
  21. data/code/superant.com.gutenberg/loadbookrecord.rb +55 -17
  22. data/code/superant.com.gutenberg/loadconfigurationrecord.rb +4 -4
  23. data/code/superant.com.gutenberg/loadconfigurationvariables.rb +19 -9
  24. data/code/superant.com.gutenberg/loadhtmlnoterecord.rb +31 -0
  25. data/code/superant.com.gutenberg/openhelpwindowgutenberg.rb +8 -2
  26. data/code/superant.com.gutenberg/resetdir.rb +7 -0
  27. data/code/superant.com.gutenberg/runbackwindow.rb +16 -10
  28. data/code/superant.com.gutenberg/rungutenbergwindow.rb +89 -71
  29. data/code/superant.com.gutenberg/rwdgutenbergbackward.rb +27 -27
  30. data/code/superant.com.gutenberg/rwdtinkerversion.rb +10 -10
  31. data/code/superant.com.gutenberg/saveconfigurationrecord.rb +4 -4
  32. data/code/superant.com.gutenberg/savegutenbergrecord.rb +13 -11
  33. data/code/superant.com.gutenberg/updir.rb +7 -0
  34. data/code/superant.com.rwdtinkerbackwindow/initiateapplets.rb +110 -108
  35. data/code/superant.com.rwdtinkerbackwindow/installgemapplet.rb +10 -8
  36. data/code/superant.com.rwdtinkerbackwindow/listzips.rb +8 -2
  37. data/code/superant.com.rwdtinkerbackwindow/removeappletvariables.rb +6 -6
  38. data/code/superant.com.rwdtinkerbackwindow/viewappletcontents.rb +1 -1
  39. data/code/superant.com.rwdtinkerbackwindow/viewgemappletcontents.rb +1 -1
  40. data/code/superant.com.rwdtinkerbackwindow/viewlogfile.rb +13 -0
  41. data/configuration/rwdtinker.dist +4 -8
  42. data/configuration/rwdwgutenberg.dist +23 -0
  43. data/configuration/tinkerwin2variables.dist +17 -7
  44. data/gui/00coreguibegin/applicationguitop.rwd +1 -1
  45. data/gui/frontwindow0/{viewlogo/cc0openphoto.rwd → cc0openphoto.rwd} +0 -0
  46. data/gui/{frontwindowselectionbegin/selectiontabbegin → frontwindowselections}/00selectiontabbegin.rwd +0 -0
  47. data/gui/frontwindowselections/jumplinkcommands.rwd +15 -0
  48. data/gui/{frontwindowselectionzend/viewselectionzend → frontwindowselections}/wwselectionend.rwd +0 -0
  49. data/gui/{frontwindowselectionzend/viewselectionzend/zzdocumentbegin.rwd → frontwindowtdocuments/00documentbegin.rwd} +0 -0
  50. data/gui/frontwindowtdocuments/{superant.com.documents/tinkerdocuments.rwd → tinkerdocuments.rwd} +0 -0
  51. data/gui/{helpaboutbegin/superant.com.helpaboutbegin → frontwindowtdocuments}/zzdocumentend.rwd +0 -0
  52. data/gui/helpaboutbegin/{superant.com.helpaboutbegin/zzzrwdlasttab.rwd → zzzrwdlasttab.rwd} +0 -0
  53. data/gui/helpaboutbegin/{superant.com.helpaboutbegin/zzzzhelpscreenstart.rwd → zzzzhelpscreenstart.rwd} +0 -0
  54. data/gui/{helpaboutinstalled/superant.com.tinkerhelpabout/helpabouttab.rwd → helpaboutbegin/zzzzzzhelpabouttab.rwd} +0 -0
  55. data/gui/helpaboutzend/{superant.com.helpaboutend/helpscreenend.rwd → helpscreenend.rwd} +0 -0
  56. data/gui/helpaboutzend/{superant.com.helpaboutend/zhelpscreenstart2.rwd → zhelpscreenstart2.rwd} +0 -0
  57. data/gui/helpaboutzend/{superant.com.helpaboutend/zzzzhelpabout2.rwd → zzzzhelpabout2.rwd} +0 -0
  58. data/gui/helpaboutzend/{superant.com.helpaboutend/zzzzhelpscreen2end.rwd → zzzzhelpscreen2end.rwd} +0 -0
  59. data/gui/tinkerbackwindows/superant.com.backgutenberg/10appletbegin.rwd +4 -0
  60. data/gui/tinkerbackwindows/{superant.com.gutenberg → superant.com.backgutenberg}/1tabfirst.rwd +0 -0
  61. data/gui/tinkerbackwindows/{superant.com.gutenberg → superant.com.backgutenberg}/20listfiles.rwd +5 -4
  62. data/gui/tinkerbackwindows/{superant.com.gutenberg → superant.com.backgutenberg}/30booklistutilities.rwd +0 -0
  63. data/gui/tinkerbackwindows/superant.com.backgutenberg/35displaytab.rwd +26 -0
  64. data/gui/tinkerbackwindows/{superant.com.gutenberg → superant.com.backgutenberg}/67viewconfiguration.rwd +0 -0
  65. data/gui/{frontwindowselections/superant.com.rwdtinkerwin2selectiontab/jumplinkcommands.rwd → tinkerbackwindows/superant.com.backgutenberg/81jumplinkcommands.rwd} +2 -0
  66. data/gui/tinkerbackwindows/superant.com.backgutenberg/9end.rwd +6 -0
  67. data/gui/tinkerbackwindows/superant.com.gutenberg/10htmlnote.rwd +46 -0
  68. data/gui/tinkerbackwindows/superant.com.gutenberg/12tabfirst.rwd +39 -0
  69. data/gui/tinkerbackwindows/superant.com.gutenberg/35displaytab.rwd +4 -1
  70. data/gui/tinkerbackwindows/superant.com.gutenberg/50listfiles.rwd +37 -0
  71. data/gui/tinkerbackwindows/superant.com.gutenberg/81jumplinkcommands.rwd +1 -1
  72. data/gui/tinkerbackwindows/superant.com.tinkerbackwindow/75rwdlogfile.rwd +20 -0
  73. data/gui/tinkerbackwindows/superant.com.tinkerbackwindow/81jumplinkcommands.rwd +1 -1
  74. data/gui/zzcoreguiend/{tinkerapplicationguiend/yy9rwdend.rwd → yy9rwdend.rwd} +0 -0
  75. data/init.rb +15 -10
  76. data/installed/gutenbergdata02.inf +2 -2
  77. data/installed/{rwdwgutenberg-0.09.inf → rwdwgutenberg.inf} +3 -2
  78. data/lang/en/rwdcore/languagefile.rb +4 -3
  79. data/lang/es/rwdcore/languagefile-es.rb +1 -0
  80. data/lang/fr/rwdcore/languagefile.rb +1 -0
  81. data/lang/jp/rwdcore/languagefile.rb +1 -0
  82. data/lang/nl/rwdcore/languagefile.rb +1 -0
  83. data/{extras → lib}/rconftool.rb +13 -6
  84. data/{ev → lib/rwd}/browser.rb +2 -2
  85. data/{ev → lib/rwd}/ftools.rb +0 -0
  86. data/{ev → lib/rwd}/mime.rb +0 -0
  87. data/{ev → lib/rwd}/net.rb +18 -7
  88. data/{ev → lib/rwd}/ruby.rb +1 -1
  89. data/{ev → lib/rwd}/rwd.rb +108 -625
  90. data/{ev → lib/rwd}/sgml.rb +1 -1
  91. data/{ev → lib/rwd}/thread.rb +1 -1
  92. data/{ev → lib/rwd}/tree.rb +2 -2
  93. data/{ev → lib/rwd}/xml.rb +1 -1
  94. data/lib/rwdthemes/default.rwd +317 -0
  95. data/lib/rwdthemes/pda.rwd +72 -0
  96. data/lib/rwdthemes/windowslike.rwd +171 -0
  97. data/lib/rwdtinker/rwdtinkertools.rb +24 -0
  98. data/{extras → lib}/zip/ioextras.rb +0 -0
  99. data/{extras → lib}/zip/stdrubyext.rb +0 -0
  100. data/{extras → lib}/zip/tempfile_bugfixed.rb +0 -0
  101. data/{extras → lib}/zip/zip.rb +2 -2
  102. data/{extras → lib}/zip/zipfilesystem.rb +0 -0
  103. data/{extras → lib}/zip/ziprequire.rb +0 -0
  104. data/rwd_files/Books/marip10.lnk +6 -0
  105. data/{Books → rwd_files/Books}/marip10.txt +0 -0
  106. data/{Books → rwd_files/Books}/shannon1948.html +0 -0
  107. data/{Books/Shannon.gut → rwd_files/Books/shannon1948.lnk} +1 -1
  108. data/rwd_files/Books/shannon1948.txt +2667 -0
  109. data/rwd_files/HowTo_Gutenberg.txt +21 -1
  110. data/rwd_files/HowTo_Tinker.txt +58 -1
  111. data/rwd_files/log/rwdtinker.log +2082 -0
  112. data/{code/superant.com.gutenberg/helptexthashrwdgutenberg.rb → rwd_files/rwdgutenberghelpfiles.txt} +26 -19
  113. data/rwdconfig.dist +14 -13
  114. data/tests/makedist-rwdwgutenberg.rb +9 -7
  115. data/tests/makedist.rb +2 -2
  116. data/zips/rwdwcalc-0.63.zip +0 -0
  117. data/zips/rwdwfoldeditor-0.05.zip +0 -0
  118. data/zips/rwdwgutenberg-0.12.zip +0 -0
  119. data/zips/rwdwruby-1.08.zip +0 -0
  120. data/zips/wrubyslippers-1.07.zip +0 -0
  121. metadata +74 -59
  122. data/Books/Mariposa.gut +0 -6
  123. data/code/superant.com.gutenberg/rwdhypernotehelpabout.rb +0 -14
  124. data/code/superant.com.rwdtinkerbackwindow/installapplet.rb +0 -27
  125. data/configuration/language.dist +0 -8
  126. data/configuration/rwdapplicationidentity.dist +0 -3
  127. data/configuration/rwdwgutenberg-0.09.dist +0 -20
  128. data/gui/tinkerbackwindows/superant.com.gutenberg/36displaytab.rwd +0 -15
  129. data/gui/tinkerbackwindows/superant.com.gutenberg/40rwdgutenberg.rwd +0 -16
  130. data/gui/tinkerbackwindows/superant.com.gutenberg/40rwdgutenberghtml.rwd +0 -16
  131. data/lib/temp.rb +0 -1
@@ -0,0 +1,24 @@
1
+
2
+
3
+ module RwdtinkerTools
4
+
5
+ # tools to use in rwdtinker
6
+
7
+ def RwdtinkerTools.tail(filename, lines=12)
8
+
9
+
10
+ begin
11
+ tmpFile = File.open(filename, 'r')
12
+
13
+ return tmpFile.readlines.reverse!.slice(0,lines)
14
+
15
+ tmpFile.close
16
+ rescue
17
+ return "error in opening log"
18
+ $rwdtinkerlog.error "RwdtinkerTools.tail: file open error"
19
+ end
20
+ end
21
+
22
+ end
23
+
24
+
File without changes
File without changes
File without changes
@@ -4,8 +4,8 @@ require 'singleton'
4
4
  require 'tempfile'
5
5
  require 'ftools'
6
6
  require 'zlib'
7
- require 'extras/zip/stdrubyext'
8
- require 'extras/zip/ioextras'
7
+ require 'lib/zip/stdrubyext'
8
+ require 'lib/zip/ioextras'
9
9
 
10
10
  if Tempfile.superclass == SimpleDelegator
11
11
  require 'zip/tempfile_bugfixed'
File without changes
File without changes
@@ -0,0 +1,6 @@
1
+ rwd_files/Books/marip10.txt
2
+ #Their Mariposa Legend
3
+
4
+
5
+
6
+
File without changes
File without changes
@@ -1,4 +1,4 @@
1
- Books/shannon1948.html
1
+ rwd_files/Books/shannon1948.txt
2
2
  #Theory of Communications
3
3
 
4
4
 
@@ -0,0 +1,2667 @@
1
+ Reprinted with corrections from The Bell System Technical Journal,
2
+ Vol. 27, pp. 379�423, 623�656, July, October, 1948.
3
+ A Mathematical Theory of Communication
4
+ By C. E. SHANNON
5
+ INTRODUCTION The recent development of various methods of modulation such as
6
+ PCM and PPM which exchange bandwidth for signal-to-noise ratio has intensified
7
+ the interest in a general theory of communication. A T basis for such a theory
8
+ is contained in the important papers of Nyquist1 and Hartley2 on this subject.
9
+ In thepresent paper we will extend the theory to include a number of new
10
+ factors, in particular the effect of noisein the channel, and the savings
11
+ possible due to the statistical structure of the original message and due to
12
+ thenature of the final destination of the information. The fundamental problem
13
+ of communication is that of reproducing at one point either exactly or ap-
14
+ proximately a message selected at another point. Frequently the messages have
15
+ meaning; that is they referto or are correlated according to some system with
16
+ certain physical or conceptual entities. These semanticaspects of communication
17
+ are irrelevant to the engineering problem. The significant aspect is that the
18
+ actualmessage is one selected from a setof possible messages. The system must
19
+ be designed to operate for eachpossible selection, not just the one which will
20
+ actually be chosen since this is unknown at the time of design. If the number
21
+ of messages in the set is finite then this number or any monotonic function of
22
+ this number can be regarded as a measure of the information produced when one
23
+ message is chosen from the set, allchoices being equally likely. As was pointed
24
+ out by Hartley the most natural choice is the logarithmicfunction. Although
25
+ this definition must be generalized considerably when we consider the influence
26
+ of thestatistics of the message and when we have a continuous range of
27
+ messages, we will in all cases use anessentially logarithmic measure. The
28
+ logarithmic measure is more convenient for various reasons: 1. It is
29
+ practically more useful. Parameters of engineering importance such as time,
30
+ bandwidth, number of relays, etc., tend to vary linearly with the logarithm of
31
+ the number of possibilities. For example,adding one relay to a group doubles
32
+ the number of possible states of the relays. It adds 1 to the base 2logarithm
33
+ of this number. Doubling the time roughly squares the number of possible
34
+ messages, ordoubles the logarithm, etc. 2. It is nearer to our intuitive
35
+ feeling as to the proper measure. This is closely related to (1) since we in-
36
+ tuitively measures entities by linear comparison with common standards. One
37
+ feels, for example, thattwo punched cards should have twice the capacity of one
38
+ for information storage, and two identicalchannels twice the capacity of one
39
+ for transmitting information. 3. It is mathematically more suitable. Many of
40
+ the limiting operations are simple in terms of the loga- rithm but would
41
+ require clumsy restatement in terms of the number of possibilities. The choice
42
+ of a logarithmic base corresponds to the choice of a unit for measuring
43
+ information. If the base 2 is used the resulting units may be called binary
44
+ digits, or more briefly bits,a word suggested byJ. W. Tukey. A device with two
45
+ stable positions, such as a relay or a flip-flop circuit, can store one bit
46
+ ofinformation. Nsuch devices can store Nbits, since the total number of
47
+ possible states is 2Nand log2 2N N. = If the base 10 is used the units may be
48
+ called decimal digits. Since log2 M log log = 10 M= 10 2 3 32 log = : 10 M;
49
+
50
+ 1Nyquist, H., "Certain Factors Affecting Telegraph Speed," Bell System
51
+ Technical Journal,April 1924, p. 324;
52
+
53
+ "Certain Topics in Telegraph Transmission Theory," A.I.E.E. Trans.,v. 47, April
54
+ 1928, p. 617. 2Hartley, R. V. L., "Transmission of Information," Bell System
55
+ Technical Journal,July 1928, p. 535. 1
56
+ ===============================================================================
57
+ INFORMATION SOURCE TRANSMITTER RECEIVER DESTINATION SIGNAL RECEIVED SIGNAL
58
+ MESSAGE MESSAGE NOISE SOURCE Fig. 1 -- Schematic diagram of a general
59
+ communication system. a decimal digit is about 3 1 bits. A digit wheel on a
60
+ desk computing machine has ten stable positions and 3 therefore has a storage
61
+ capacity of one decimal digit. In analytical work where integration and
62
+ differentiationare involved the base eis sometimes useful. The resulting units
63
+ of information will be called natural units.Change from the base ato base
64
+ bmerely requires multiplication by logb a. By a communication system we will
65
+ mean a system of the type indicated schematically in Fig. 1. It consists of
66
+ essentially five parts: 1. An information sourcewhich produces a message or
67
+ sequence of messages to be communicated to the receiving terminal. The message
68
+ may be of various types: (a) A sequence of letters as in a telegraphof teletype
69
+ system;
70
+
71
+ (b) A single function of time f tas in radio or telephony;
72
+
73
+ (c) A function of time and other variables as in black and white television -
74
+ - here the message may be thought of as afunction f x y tof two space
75
+ coordinates and time, the light intensity at point x yand time ton a ;
76
+
77
+ ;
78
+
79
+ ;
80
+
81
+ pickup tube plate;
82
+
83
+ (d) Two or more functions of time, say f t, g t, h t-- this is the case in
84
+ "three- dimensional" sound transmission or if the system is intended to service
85
+ several individual channels inmultiplex;
86
+
87
+ (e) Several functions of several variables -- in color television the message
88
+ consists of threefunctions f x y t, g x y t, h x y tdefined in a three-
89
+ dimensional continuum -- we may also think ;
90
+
91
+ ;
92
+
93
+ ;
94
+
95
+ ;
96
+
97
+ ;
98
+
99
+ ;
100
+
101
+ of these three functions as components of a vector field defined in the region
102
+ -- similarly, severalblack and white television sources would produce
103
+ "messages" consisting of a number of functionsof three variables;
104
+
105
+ (f) Various combinations also occur, for example in television with an
106
+ associatedaudio channel. 2. A transmitterwhich operates on the message in some
107
+ way to produce a signal suitable for trans- mission over the channel. In
108
+ telephony this operation consists merely of changing sound pressureinto a
109
+ proportional electrical current. In telegraphy we have an encoding operation
110
+ which producesa sequence of dots, dashes and spaces on the channel
111
+ corresponding to the message. In a multiplexPCM system the different speech
112
+ functions must be sampled, compressed, quantized and encoded,and finally
113
+ interleaved properly to construct the signal. Vocoder systems, television and
114
+ frequencymodulation are other examples of complex operations applied to the
115
+ message to obtain the signal. 3. The channelis merely the medium used to
116
+ transmit the signal from transmitter to receiver. It may be a pair of wires, a
117
+ coaxial cable, a band of radio frequencies, a beam of light, etc. 4. The
118
+ receiverordinarily performs the inverse operation of that done by the
119
+ transmitter, reconstructing the message from the signal. 5. The destinationis
120
+ the person (or thing) for whom the message is intended. We wish to consider
121
+ certain general problems involving communication systems. To do this it is
122
+ first necessary to represent the various elements involved as mathematical
123
+ entities, suitably idealized from their 2
124
+ ===============================================================================
125
+ physical counterparts. We may roughly classify communication systems into three
126
+ main categories: discrete,continuous and mixed. By a discrete system we will
127
+ mean one in which both the message and the signalare a sequence of discrete
128
+ symbols. A typical case is telegraphy where the message is a sequence of
129
+ lettersand the signal a sequence of dots, dashes and spaces. A continuous
130
+ system is one in which the message andsignal are both treated as continuous
131
+ functions, e.g., radio or television. A mixed system is one in whichboth
132
+ discrete and continuous variables appear, e.g., PCM transmission of speech. We
133
+ first consider the discrete case. This case has applications not only in
134
+ communication theory, but also in the theory of computing machines, the design
135
+ of telephone exchanges and other fields. In additionthe discrete case forms a
136
+ foundation for the continuous and mixed cases which will be treated in the
137
+ secondhalf of the paper. PART I: DISCRETE NOISELESS SYSTEMS 1. THE DISCRETE
138
+ NOISELESS CHANNEL Teletype and telegraphy are two simple examples of a discrete
139
+ channel for transmitting information. Gen-erally, a discrete channel will mean
140
+ a system whereby a sequence of choices from a finite set of elementarysymbols
141
+ S1 Sncan be transmitted from one point to another. Each of the symbols Siis
142
+ assumed to have ;
143
+
144
+ : : : ;
145
+
146
+ a certain duration in time tiseconds (not necessarily the same for different
147
+ Si, for example the dots anddashes in telegraphy). It is not required that all
148
+ possible sequences of the Sibe capable of transmission onthe system;
149
+
150
+ certain sequences only may be allowed. These will be possible signals for the
151
+ channel. Thusin telegraphy suppose the symbols are: (1) A dot, consisting of
152
+ line closure for a unit of time and then lineopen for a unit of time;
153
+
154
+ (2) A dash, consisting of three time units of closure and one unit open;
155
+
156
+ (3) A letterspace consisting of, say, three units of line open;
157
+
158
+ (4) A word space of six units of line open. We might placethe restriction on
159
+ allowable sequences that no spaces follow each other (for if two letter spaces
160
+ are adjacent,it is identical with a word space). The question we now consider
161
+ is how one can measure the capacity ofsuch a channel to transmit information.
162
+ In the teletype case where all symbols are of the same duration, and any
163
+ sequence of the 32 symbols is allowed the answer is easy. Each symbol
164
+ represents five bits of information. If the system transmits nsymbols per
165
+ second it is natural to say that the channel has a capacity of 5nbits per
166
+ second. This does notmean that the teletype channel will always be transmitting
167
+ information at this rate -- this is the maximumpossible rate and whether or not
168
+ the actual rate reaches this maximum depends on the source of informationwhich
169
+ feeds the channel, as will appear later. In the more general case with
170
+ different lengths of symbols and constraints on the allowed sequences, we make
171
+ the following definition:Definition: The capacity Cof a discrete channel is
172
+ given by log N T C Lim = T T ! where N Tis the number of allowed signals of
173
+ duration T. It is easily seen that in the teletype case this reduces to the
174
+ previous result. It can be shown that the limit in question will exist as a
175
+ finite number in most cases of interest. Suppose all sequences of the symbolsS1
176
+ Snare allowed and these symbols have durations t1 tn. What is the channel
177
+ capacity? If N t ;
178
+
179
+ : : : ;
180
+
181
+ ;
182
+
183
+ : : : ;
184
+
185
+ represents the number of sequences of duration twe have N t N t t1 N t t2 N t
186
+ tn = , + , + + , : The total number is equal to the sum of the numbers of
187
+ sequences ending in S1 S2 Snand these are ;
188
+
189
+ ;
190
+
191
+ : : : ;
192
+
193
+ N t t1 N t t2 N t tn, respectively. According to a well-known result in finite
194
+ differences, N t , ;
195
+
196
+ , ;
197
+
198
+ : : : ;
199
+
200
+ , is then asymptotic for large tto Xtwhere X 0 0 is the largest real solution
201
+ of the characteristic equation: X t t tn , 1 X, 2 X, 1 + + + = 3
202
+ ===============================================================================
203
+ and therefore C log X0 = : In case there are restrictions on allowed sequences
204
+ we may still often obtain a difference equation of this type and find Cfrom the
205
+ characteristic equation. In the telegraphy case mentioned above N t N t 2 N t 4
206
+ N t 5 N t 7 N t 8 N t 10 = , + , + , + , + , + , as we see by counting
207
+ sequences of symbols according to the last or next to the last symbol
208
+ occurring.Hence Cis
209
+ log
210
+ 2
211
+ 4
212
+ 5
213
+ 7
214
+ 8
215
+ 10
216
+ 0 where 0 is the positive root of 1 . Solving this we find , = + + + + + C 0
217
+ 539. = : A very general type of restriction which may be placed on allowed
218
+ sequences is the following: We imagine a number of possible states a1 a2 am.
219
+ For each state only certain symbols from the set S1 Sn ;
220
+
221
+ ;
222
+
223
+ : : : ;
224
+
225
+ ;
226
+
227
+ : : : ;
228
+
229
+ can be transmitted (different subsets for the different states). When one of
230
+ these has been transmitted thestate changes to a new state depending both on
231
+ the old state and the particular symbol transmitted. Thetelegraph case is a
232
+ simple example of this. There are two states depending on whether or not a
233
+ space wasthe last symbol transmitted. If so, then only a dot or a dash can be
234
+ sent next and the state always changes.If not, any symbol can be transmitted
235
+ and the state changes if a space is sent, otherwise it remains the same.The
236
+ conditions can be indicated in a linear graph as shown in Fig. 2. The junction
237
+ points correspond to the DASH DOT DOT LETTER SPACE DASH WORD SPACE Fig. 2 -
238
+ - Graphical representation of the constraints on telegraph symbols. states and
239
+ the lines indicate the symbols possible in a state and the resulting state. In
240
+ Appendix 1 it is shownthat if the conditions on allowed sequences can be
241
+ described in this form Cwill exist and can be calculatedin accordance with the
242
+ following result: s Theorem 1:Let b be the duration of the sth symbol which is
243
+ allowable in state iand leads to state j. i j Then the channel capacity Cis
244
+ equal to logWwhere Wis the largest real root of the determinant equation: s W b
245
+ , i j i j 0 , = s where i j 1 if i jand is zero otherwise. = = For example, in
246
+ the telegraph case (Fig. 2) the determinant is: 1 W2 4 , W, , + 0 W3 6 2 4 = :
247
+ , W, W, W, 1 + + , On expansion this leads to the equation given above for this
248
+ case. 2. THE DISCRETE SOURCE OF INFORMATION We have seen that under very
249
+ general conditions the logarithm of the number of possible signals in a
250
+ discretechannel increases linearly with time. The capacity to transmit
251
+ information can be specified by giving thisrate of increase, the number of bits
252
+ per second required to specify the particular signal used. We now consider the
253
+ information source. How is an information source to be described
254
+ mathematically, and how much information in bits per second is produced in a
255
+ given source? The main point at issue is theeffect of statistical knowledge
256
+ about the source in reducing the required capacity of the channel, by the use 4
257
+ ===============================================================================
258
+ of proper encoding of the information. In telegraphy, for example, the messages
259
+ to be transmitted consist ofsequences of letters. These sequences, however, are
260
+ not completely random. In general, they form sentencesand have the statistical
261
+ structure of, say, English. The letter E occurs more frequently than Q, the
262
+ sequenceTH more frequently than XP, etc. The existence of this structure allows
263
+ one to make a saving in time (orchannel capacity) by properly encoding the
264
+ message sequences into signal sequences. This is already doneto a limited
265
+ extent in telegraphy by using the shortest channel symbol, a dot, for the most
266
+ common Englishletter E;
267
+
268
+ while the infrequent letters, Q, X, Z are represented by longer sequences of
269
+ dots and dashes. Thisidea is carried still further in certain commercial codes
270
+ where common words and phrases are representedby four- or five-letter code
271
+ groups with a considerable saving in average time. The standardized greetingand
272
+ anniversary telegrams now in use extend this to the point of encoding a
273
+ sentence or two into a relativelyshort sequence of numbers. We can think of a
274
+ discrete source as generating the message, symbol by symbol. It will choose
275
+ succes- sive symbols according to certain probabilities depending, in general,
276
+ on preceding choices as well as theparticular symbols in question. A physical
277
+ system, or a mathematical model of a system which producessuch a sequence of
278
+ symbols governed by a set of probabilities, is known as a stochastic process.3
279
+ We mayconsider a discrete source, therefore, to be represented by a stochastic
280
+ process. Conversely, any stochasticprocess which produces a discrete sequence
281
+ of symbols chosen from a finite set may be considered a discretesource. This
282
+ will include such cases as: 1. Natural written languages such as English,
283
+ German, Chinese. 2. Continuous information sources that have been rendered
284
+ discrete by some quantizing process. For example, the quantized speech from a
285
+ PCM transmitter, or a quantized television signal. 3. Mathematical cases where
286
+ we merely define abstractly a stochastic process which generates a se- quence
287
+ of symbols. The following are examples of this last type of source. (A) Suppose
288
+ we have five letters A, B, C, D, E which are chosen each with probability .2,
289
+ successive choices being independent. This would lead to a sequence of which
290
+ the following is a typicalexample. B D C B C E C C C A D C B D D A A E C E E AA
291
+ B B D A E E C A C E E B A E E C B C E A D. This was constructed with the use of
292
+ a table of random numbers.4 (B) Using the same five letters let the
293
+ probabilities be .4, .1, .2, .2, .1, respectively, with successive choices
294
+ independent. A typical message from this source is then: A A A C D C B D C E A
295
+ A D A D A C E D AE A D C A B E D A D D C E C A A A A A D. (C) A more
296
+ complicated structure is obtained if successive symbols are not chosen
297
+ independently but their probabilities depend on preceding letters. In the
298
+ simplest case of this type a choicedepends only on the preceding letter and not
299
+ on ones before that. The statistical structure canthen be described by a set of
300
+ transition probabilities pi j, the probability that letter iis followed by
301
+ letter j. The indices iand jrange over all the possible symbols. A second
302
+ equivalent way ofspecifying the structure is to give the "digram" probabilities
303
+ p i j, i.e., the relative frequency of ;
304
+
305
+ the digram i j. The letter frequencies p i, (the probability of letter i), the
306
+ transition probabilities 3See, for example, S. Chandrasekhar, "Stochastic
307
+ Problems in Physics and Astronomy," Reviews of Modern Physics, v. 15, No. 1,
308
+ January 1943, p. 1. 4Kendall and Smith, Tables of Random Sampling
309
+ Numbers,Cambridge, 1939. 5
310
+ ===============================================================================
311
+ pi jand the digram probabilities p i jare related by the following formulas: ;
312
+
313
+ p i p i jp j ip j pj i = ;
314
+
315
+ = ;
316
+
317
+ = j j j p i j p i pi j ;
318
+
319
+ = pi jp ip i j1 = = ;
320
+
321
+ = : j i i j ;
322
+
323
+ As a specific example suppose there are three letters A, B, C with the
324
+ probability tables: pi j j i p i p i j j ;
325
+
326
+ A B C A B C A 0 4 1 A 9 A 0 4 1 5 5 27 15 15 i B 1 1 0 B 16 i B 8 8 0 2 2 27 27
327
+ 27 C 1 2 1 C 2 C 1 4 1 2 5 10 27 27 135 135 A typical message from this source
328
+ is the following: A B B A B A B A B A B A B A B B B A B B B B B A B A B A B A B
329
+ A B B B A C A C A BB A B B B B A B B A B A C B B B A B A. The next increase in
330
+ complexity would involve trigram frequencies but no more. The choice ofa letter
331
+ would depend on the preceding two letters but not on the message before that
332
+ point. Aset of trigram frequencies p i j kor equivalently a set of transition
333
+ probabilities pi j kwould ;
334
+
335
+ ;
336
+
337
+ be required. Continuing in this way one obtains successively more complicated
338
+ stochastic pro-cesses. In the general n-gram case a set of n-gram probabilities
339
+ p i1 i2 inor of transition ;
340
+
341
+ ;
342
+
343
+ : : : ;
344
+
345
+ probabilities pi i is required to specify the statistical structure. 1 i i n ;
346
+
347
+ 2;
348
+
349
+ :::;
350
+
351
+ n1 , (D) Stochastic processes can also be defined which produce a text
352
+ consisting of a sequence of "words." Suppose there are five letters A, B, C, D,
353
+ E and 16 "words" in the language withassociated probabilities: .10 A .16 BEBE
354
+ .11 CABED .04 DEB .04 ADEB .04 BED .05 CEED .15 DEED .05 ADEE .02 BEED .08 DAB
355
+ .01 EAB .01 BADD .05 CA .04 DAD .05 EE Suppose successive "words" are chosen
356
+ independently and are separated by a space. A typicalmessage might be: DAB EE A
357
+ BEBE DEED DEB ADEE ADEE EE DEB BEBE BEBE BEBE ADEE BED DEEDDEED CEED ADEE A
358
+ DEED DEED BEBE CABED BEBE BED DAB DEED ADEB. If all the words are of finite
359
+ length this process is equivalent to one of the preceding type, butthe
360
+ description may be simpler in terms of the word structure and probabilities. We
361
+ may alsogeneralize here and introduce transition probabilities between words,
362
+ etc. These artificial languages are useful in constructing simple problems and
363
+ examples to illustrate vari- ous possibilities. We can also approximate to a
364
+ natural language by means of a series of simple artificiallanguages. The zero-
365
+ order approximation is obtained by choosing all letters with the same
366
+ probability andindependently. The first-order approximation is obtained by
367
+ choosing successive letters independently buteach letter having the same
368
+ probability that it has in the natural language.5 Thus, in the first-order ap-
369
+ proximation to English, E is chosen with probability .12 (its frequency in
370
+ normal English) and W withprobability .02, but there is no influence between
371
+ adjacent letters and no tendency to form the preferred 5Letter, digram and
372
+ trigram frequencies are given in Secret and Urgentby Fletcher Pratt, Blue
373
+ Ribbon Books, 1939. Word frequen- cies are tabulated in Relative Frequency of
374
+ English Speech Sounds,G. Dewey, Harvard University Press, 1923. 6
375
+ ===============================================================================
376
+ digrams such as TH, ED, etc. In the second-order approximation, digram
377
+ structure is introduced. After aletter is chosen, the next one is chosen in
378
+ accordance with the frequencies with which the various lettersfollow the first
379
+ one. This requires a table of digram frequencies pi j. In the third-order
380
+ approximation, trigram structure is introduced. Each letter is chosen with
381
+ probabilities which depend on the preceding twoletters. 3. THE SERIES OF
382
+ APPROXIMATIONS TO ENGLISH To give a visual idea of how this series of processes
383
+ approaches a language, typical sequences in the approx-imations to English have
384
+ been constructed and are given below. In all cases we have assumed a 27-
385
+ symbol"alphabet," the 26 letters and a space. 1. Zero-order approximation
386
+ (symbols independent and equiprobable). XFOML RXKHRJFFJUJ ZLPWCFWKCYJ
387
+ FFJEYVKCQSGHYD QPAAMKBZAACIBZL-HJQD. 2. First-order approximation (symbols
388
+ independent but with frequencies of English text). OCRO HLI RGWR NMIELWIS EU LL
389
+ NBNESEBYA TH EEI ALHENHTTPA OOBTTVANAH BRL. 3. Second-order approximation
390
+ (digram structure as in English). ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY
391
+ ACHIN D ILONASIVE TU-COOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE. 4.
392
+ Third-order approximation (trigram structure as in English). IN NO IST LAT WHEY
393
+ CRATICT FROURE BIRS GROCID PONDENOME OF DEMONS-TURES OF THE REPTAGIN IS
394
+ REGOACTIONA OF CRE. 5. First-order word approximation. Rather than continue
395
+ with tetragram, , n-gram structure it is easier : : : and better to jump at
396
+ this point to word units. Here words are chosen independently but with
397
+ theirappropriate frequencies. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME
398
+ CAN DIFFERENT NAT-URAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO
399
+ FURNISHESTHE LINE MESSAGE HAD BE THESE. 6. Second-order word approximation. The
400
+ word transition probabilities are correct but no further struc- ture is
401
+ included. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHAR-
402
+ ACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THATTHE TIME OF
403
+ WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. The resemblance to ordinary
404
+ English text increases quite noticeably at each of the above steps. Note that
405
+ these samples have reasonably good structure out to about twice the range that
406
+ is taken into account in theirconstruction. Thus in (3) the statistical process
407
+ insures reasonable text for two-letter sequences, but four-letter sequences
408
+ from the sample can usually be fitted into good sentences. In (6) sequences of
409
+ four or morewords can easily be placed in sentences without unusual or strained
410
+ constructions. The particular sequenceof ten words "attack on an English writer
411
+ that the character of this" is not at all unreasonable. It appears thenthat a
412
+ sufficiently complex stochastic process will give a satisfactory representation
413
+ of a discrete source. The first two samples were constructed by the use of a
414
+ book of random numbers in conjunction with (for example 2) a table of letter
415
+ frequencies. This method might have been continued for (3), (4) and (5),since
416
+ digram, trigram and word frequency tables are available, but a simpler
417
+ equivalent method was used. 7
418
+ ===============================================================================
419
+ To construct (3) for example, one opens a book at random and selects a letter
420
+ at random on the page. Thisletter is recorded. The book is then opened to
421
+ another page and one reads until this letter is encountered.The succeeding
422
+ letter is then recorded. Turning to another page this second letter is searched
423
+ for and thesucceeding letter recorded, etc. A similar process was used for (4),
424
+ (5) and (6). It would be interesting iffurther approximations could be
425
+ constructed, but the labor involved becomes enormous at the next stage. 4.
426
+ GRAPHICAL REPRESENTATION OF A MARKOFF PROCESS Stochastic processes of the type
427
+ described above are known mathematically as discrete Markoff processesand have
428
+ been extensively studied in the literature.6 The general case can be described
429
+ as follows: Thereexist a finite number of possible "states" of a system;
430
+
431
+ S1 S2 Sn. In addition there is a set of transition ;
432
+
433
+ ;
434
+
435
+ : : : ;
436
+
437
+ probabilities;
438
+
439
+ pi jthe probability that if the system is in state Siit will next go to state S
440
+ j. To make this Markoff process into an information source we need only assume
441
+ that a letter is produced for each transitionfrom one state to another. The
442
+ states will correspond to the "residue of influence" from preceding letters.
443
+ The situation can be represented graphically as shown in Figs. 3, 4 and 5. The
444
+ "states" are the junction A .1 .4 B E .2 .1 C D .2 Fig. 3 -- A graph
445
+ corresponding to the source in example B. points in the graph and the
446
+ probabilities and letters produced for a transition are given beside the
447
+ correspond-ing line. Figure 3 is for the example B in Section 2, while Fig. 4
448
+ corresponds to the example C. In Fig. 3 B C A A .8 .2 .5 .5 B C B .4 .5 .1 Fig.
449
+ 4 -- A graph corresponding to the source in example C. there is only one state
450
+ since successive letters are independent. In Fig. 4 there are as many states as
451
+ letters.If a trigram example were constructed there would be at most n2 states
452
+ corresponding to the possible pairsof letters preceding the one being chosen.
453
+ Figure 5 is a graph for the case of word structure in example D.Here S
454
+ corresponds to the "space" symbol. 5. ERGODIC AND MIXED SOURCES As we have
455
+ indicated above a discrete source for our purposes can be considered to be
456
+ represented by aMarkoff process. Among the possible discrete Markoff processes
457
+ there is a group with special propertiesof significance in communication
458
+ theory. This special class consists of the "ergodic" processes and weshall call
459
+ the corresponding sources ergodic sources. Although a rigorous definition of an
460
+ ergodic process issomewhat involved, the general idea is simple. In an ergodic
461
+ process every sequence produced by the process 6For a detailed treatment see M.
462
+ Fr�echet, M�ethode des fonctions arbitraires. Th�eorie des �ev�enements en
463
+ cha^ine dans le cas d'un nombre fini d'�etats possibles. Paris, Gauthier-
464
+ Villars, 1938. 8
465
+ ===============================================================================
466
+ is the same in statistical properties. Thus the letter frequencies, digram
467
+ frequencies, etc., obtained fromparticular sequences, will, as the lengths of
468
+ the sequences increase, approach definite limits independentof the particular
469
+ sequence. Actually this is not true of every sequence but the set for which it
470
+ is false hasprobability zero. Roughly the ergodic property means statistical
471
+ homogeneity. All the examples of artificial languages given above are ergodic.
472
+ This property is related to the structure of the corresponding graph. If the
473
+ graph has the following two properties7 the corresponding process willbe
474
+ ergodic: 1. The graph does not consist of two isolated parts A and B such that
475
+ it is impossible to go from junction points in part A to junction points in
476
+ part B along lines of the graph in the direction of arrows and alsoimpossible
477
+ to go from junctions in part B to junctions in part A. 2. A closed series of
478
+ lines in the graph with all arrows on the lines pointing in the same
479
+ orientation will be called a "circuit." The "length" of a circuit is the number
480
+ of lines in it. Thus in Fig. 5 series BEBESis a circuit of length 5. The second
481
+ property required is that the greatest common divisor of the lengthsof all
482
+ circuits in the graph be one. D E B E S A B E E D A B D E S B D E C A E E B B D
483
+ E A D B E E A S Fig. 5 -- A graph corresponding to the source in example D. If
484
+ the first condition is satisfied but the second one violated by having the
485
+ greatest common divisor equal to d 1, the sequences have a certain type of
486
+ periodic structure. The various sequences fall into ddifferent classes which
487
+ are statistically the same apart from a shift of the origin (i.e., which letter
488
+ in the sequence iscalled letter 1). By a shift of from 0 up to d 1 any sequence
489
+ can be made statistically equivalent to any , other. A simple example with d 2
490
+ is the following: There are three possible letters a b c. Letter ais = ;
491
+
492
+ ;
493
+
494
+ followed with either bor cwith probabilities 1 and 2 respectively. Either bor
495
+ cis always followed by letter 3 3 a. Thus a typical sequence is a b a c a c a c
496
+ a b a c a b a b a c a c: This type of situation is not of much importance for
497
+ our work. If the first condition is violated the graph may be separated into a
498
+ set of subgraphs each of which satisfies the first condition. We will assume
499
+ that the second condition is also satisfied for each subgraph. We have inthis
500
+ case what may be called a "mixed" source made up of a number of pure
501
+ components. The componentscorrespond to the various subgraphs. If L1, L2, L3
502
+ are the component sources we may write ;
503
+
504
+ : : : L p1L1 p2L2 p3L3 = + + + 7These are restatements in terms of the graph of
505
+ conditions given in Fr�echet. 9
506
+ ===============================================================================
507
+ where piis the probability of the component source Li. Physically the situation
508
+ represented is this: There are several different sources L1, L2, L3 which are ;
509
+
510
+ : : : each of homogeneous statistical structure (i.e., they are ergodic). We do
511
+ not know a prioriwhich is to beused, but once the sequence starts in a given
512
+ pure component Li, it continues indefinitely according to thestatistical
513
+ structure of that component. As an example one may take two of the processes
514
+ defined above and assume p1 2 and p2 8. A = : = : sequence from the mixed
515
+ source L 2L1 8L2 = : + : would be obtained by choosing first L1 or L2 with
516
+ probabilities .2 and .8 and after this choice generating asequence from
517
+ whichever was chosen. Except when the contrary is stated we shall assume a
518
+ source to be ergodic. This assumption enables one to identify averages along a
519
+ sequence with averages over the ensemble of possible sequences (the
520
+ probabilityof a discrepancy being zero). For example the relative frequency of
521
+ the letter A in a particular infinitesequence will be, with probability one,
522
+ equal to its relative frequency in the ensemble of sequences. If Piis the
523
+ probability of state iand pi jthe transition probability to state j, then for
524
+ the process to be stationary it is clear that the Pimust satisfy equilibrium
525
+ conditions: Pj Pipi j = : i In the ergodic case it can be shown that with any
526
+ starting conditions the probabilities Pj Nof being in state jafter Nsymbols,
527
+ approach the equilibrium values as N . ! 6. CHOICE, UNCERTAINTY AND ENTROPY We
528
+ have represented a discrete information source as a Markoff process. Can we
529
+ define a quantity whichwill measure, in some sense, how much information is
530
+ "produced" by such a process, or better, at what rateinformation is produced?
531
+ Suppose we have a set of possible events whose probabilities of occurrence are
532
+ p1 p2 pn. These ;
533
+
534
+ ;
535
+
536
+ : : : ;
537
+
538
+ probabilities are known but that is all we know concerning which event will
539
+ occur. Can we find a measureof how much "choice" is involved in the selection
540
+ of the event or of how uncertain we are of the outcome? If there is such a
541
+ measure, say H p1 p2 pn, it is reasonable to require of it the following
542
+ properties: ;
543
+
544
+ ;
545
+
546
+ : : : ;
547
+
548
+ 1. Hshould be continuous in the pi. 2. If all the p 1 iare equal, pi , then
549
+ Hshould be a monotonic increasing function of n. With equally = n likely events
550
+ there is more choice, or uncertainty, when there are more possible events. 3.
551
+ If a choice be broken down into two successive choices, the original Hshould be
552
+ the weighted sum of the individual values of H. The meaning of this is
553
+ illustrated in Fig. 6. At the left we have three 1 2 1 2 1 2 1 3 2 3 1 3 1 2 1
554
+ 6 1 3 1 6 Fig. 6 -- Decomposition of a choice from three possibilities.
555
+ possibilities p 1 1 1 1 , p2 , p3 . On the right we first choose between two
556
+ possibilities each with = 2 = 3 = 6 probability 1 , and if the second occurs
557
+ make another choice with probabilities 2 , 1 . The final results 2 3 3 have the
558
+ same probabilities as before. We require, in this special case, that H1 1 1 H1
559
+ 1 1 H2 1 2 ;
560
+
561
+ 3 ;
562
+
563
+ 6 = 2 ;
564
+
565
+ 2 + 2 3 ;
566
+
567
+ 3 : The coefficient 1 is because this second choice only occurs half the time.
568
+ 2 10
569
+ ===============================================================================
570
+ In Appendix 2, the following result is established: Theorem 2:The only
571
+ Hsatisfying the three above assumptions is of the form: n H K pilog pi = , i1 =
572
+ where Kis a positive constant. This theorem, and the assumptions required for
573
+ its proof, are in no way necessary for the present theory. It is given chiefly
574
+ to lend a certain plausibility to some of our later definitions. The real
575
+ justification of thesedefinitions, however, will reside in their implications.
576
+ Quantities of the form H pilog pi(the constant Kmerely amounts to a choice of a
577
+ unit of measure) = , play a central role in information theory as measures of
578
+ information, choice and uncertainty. The form of Hwill be recognized as that of
579
+ entropy as defined in certain formulations of statistical mechanics8 where
580
+ piisthe probability of a system being in cell iof its phase space. His then,
581
+ for example, the Hin Boltzmann'sfamous Htheorem. We shall call H pilog pithe
582
+ entropy of the set of probabilities p1 pn. If xis a = , ;
583
+
584
+ : : : ;
585
+
586
+ chance variable we will write H xfor its entropy;
587
+
588
+ thus xis not an argument of a function but a label for a number, to
589
+ differentiate it from H ysay, the entropy of the chance variable y. The entropy
590
+ in the case of two possibilities with probabilities pand q 1 p, namely = , H
591
+ plog p qlogq = , + is plotted in Fig. 7 as a function of p. 1.0 .9 .8 .7 H BITS
592
+ .6 .5 .4 .3 .2 .1 0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 p Fig. 7 -- Entropy in the
593
+ case of two possibilities with probabilities pand 1 p. , The quantity Hhas a
594
+ number of interesting properties which further substantiate it as a reasonable
595
+ measure of choice or information. 1. H 0 if and only if all the pibut one are
596
+ zero, this one having the value unity. Thus only when we = are certain of the
597
+ outcome does Hvanish. Otherwise His positive. 2. For a given n, His a maximum
598
+ and equal to log nwhen all the piare equal (i.e., 1 ). This is also n
599
+ intuitively the most uncertain situation. 8See, for example, R. C. Tolman,
600
+ Principles of Statistical Mechanics,Oxford, Clarendon, 1938. 11
601
+ ===============================================================================
602
+ 3. Suppose there are two events, xand y, in question with mpossibilities for
603
+ the first and nfor the second. Let p i jbe the probability of the joint
604
+ occurrence of ifor the first and jfor the second. The entropy of the ;
605
+
606
+ joint event is H x y p i jlog p i j ;
607
+
608
+ = , ;
609
+
610
+ ;
611
+
612
+ i j ;
613
+
614
+ while H x p i jlogp i j = , ;
615
+
616
+ ;
617
+
618
+ i j j ;
619
+
620
+ H y p i jlogp i j = , ;
621
+
622
+ ;
623
+
624
+ : i j i ;
625
+
626
+ It is easily shown that H x y H x H y ;
627
+
628
+ + with equality only if the events are independent (i.e., p i j p i p j). The
629
+ uncertainty of a joint event is ;
630
+
631
+ = less than or equal to the sum of the individual uncertainties. 4. Any change
632
+ toward equalization of the probabilities p1 p2 pnincreases H. Thus if p1 p2 and
633
+ ;
634
+
635
+ ;
636
+
637
+ : : : ;
638
+
639
+ we increase p1, decreasing p2 an equal amount so that p1 and p2 are more nearly
640
+ equal, then Hincreases.More generally, if we perform any "averaging" operation
641
+ on the piof the form p0 i ai j p j = j where i ai j 1, and all ai j 0, then
642
+ Hincreases (except in the special case where this transfor- = j ai j= mation
643
+ amounts to no more than a permutation of the p jwith Hof course remaining the
644
+ same). 5. Suppose there are two chance events xand yas in 3, not necessarily
645
+ independent. For any particular value ithat xcan assume there is a conditional
646
+ probability pi jthat yhas the value j. This is given by p i j ;
647
+
648
+ pi j = : j p i j ;
649
+
650
+ We define the conditional entropyof y, Hx yas the average of the entropy of
651
+ yfor each value of x, weighted according to the probability of getting that
652
+ particular x. That is Hx y p i jlogpi j = , ;
653
+
654
+ : i j ;
655
+
656
+ This quantity measures how uncertain we are of yon the average when we know x.
657
+ Substituting the value of pi jwe obtain Hx y p i jlog p i jp i jlogp i j = , ;
658
+
659
+ ;
660
+
661
+ + ;
662
+
663
+ ;
664
+
665
+ i j i j j ;
666
+
667
+ ;
668
+
669
+ H x y H x = ;
670
+
671
+ , or H x y H x Hx y ;
672
+
673
+ = + : The uncertainty (or entropy) of the joint event x yis the uncertainty of
674
+ xplus the uncertainty of ywhen xis ;
675
+
676
+ known. 6. From 3 and 5 we have H x H y H x y H x Hx y + ;
677
+
678
+ = + : Hence H y Hx y : The uncertainty of yis never increased by knowledge of
679
+ x. It will be decreased unless xand yare independentevents, in which case it is
680
+ not changed. 12
681
+ ===============================================================================
682
+ 7. THE ENTROPY OF AN INFORMATION SOURCE Consider a discrete source of the
683
+ finite state type considered above. For each possible state ithere will be aset
684
+ of probabilities pi jof producing the various possible symbols j. Thus there is
685
+ an entropy Hifor each state. The entropy of the source will be defined as the
686
+ average of these Hiweighted in accordance with theprobability of occurrence of
687
+ the states in question: H PiHi = iPipi jlogpi j = , : i j ;
688
+
689
+ This is the entropy of the source per symbol of text. If the Markoff process is
690
+ proceeding at a definite timerate there is also an entropy per second H0 fiHi =
691
+ i where fiis the average frequency (occurrences per second) of state i. Clearly
692
+ H0 mH = where mis the average number of symbols produced per second. Hor H0
693
+ measures the amount of informa-tion generated by the source per symbol or per
694
+ second. If the logarithmic base is 2, they will represent bitsper symbol or per
695
+ second. If successive symbols are independent then His simply pilog piwhere
696
+ piis the probability of sym- , bol i. Suppose in this case we consider a long
697
+ message of Nsymbols. It will contain with high probabilityabout p1Noccurrences
698
+ of the first symbol, p2Noccurrences of the second, etc. Hence the probability
699
+ of thisparticular message will be roughly p pp1N pp2N ppnN = 1 2 n or log p: N
700
+ pilog pi = i log p: NH = , log 1 p H = : = : N His thus approximately the
701
+ logarithm of the reciprocal probability of a typical long sequence divided by
702
+ thenumber of symbols in the sequence. The same result holds for any source.
703
+ Stated more precisely we have(see Appendix 3): Theorem 3:Given any 0 and 0, we
704
+ can find an N0 such that the sequences of any length N N0 fall into two
705
+ classes: 1. A set whose total probability is less than . 2. The remainder, all
706
+ of whose members have probabilities satisfying the inequality log p1 , H , : N
707
+ log p1 , In other words we are almost certain to have very close to Hwhen Nis
708
+ large. N A closely related result deals with the number of sequences of various
709
+ probabilities. Consider again the sequences of length Nand let them be arranged
710
+ in order of decreasing probability. We define n qto be the number we must take
711
+ from this set starting with the most probable one in order to accumulate a
712
+ totalprobability qfor those taken. 13
713
+ ===============================================================================
714
+ Theorem 4: log n q Lim H = N N ! when qdoes not equal 0 or 1. We may interpret
715
+ log n qas the number of bits required to specify the sequence when we consider
716
+ only log n q the most probable sequences with a total probability q. Then is
717
+ the number of bits per symbol for N the specification. The theorem says that
718
+ for large Nthis will be independent of qand equal to H. The rateof growth of
719
+ the logarithm of the number of reasonably probable sequences is given by H,
720
+ regardless of ourinterpretation of "reasonably probable." Due to these results,
721
+ which are proved in Appendix 3, it is possiblefor most purposes to treat the
722
+ long sequences as though there were just 2HNof them, each with a probability2
723
+ HN , . The next two theorems show that Hand H0 can be determined by limiting
724
+ operations directly from the statistics of the message sequences, without
725
+ reference to the states and transition probabilities betweenstates. Theorem 5:
726
+ Let p Bibe the probability of a sequence Biof symbols from the source. Let 1 GN
727
+ p Bilogp Bi = , N i where the sum is over all sequences Bicontaining Nsymbols.
728
+ Then GNis a monotonic decreasing functionof Nand Lim GN H = : N ! Theorem 6:Let
729
+ p Bi S jbe the probability of sequence Bifollowed by symbol S jand pB S j ;
730
+
731
+ i = p Bi S j p Bibe the conditional probability of S jafter Bi. Let ;
732
+
733
+ = FN p Bi SjlogpB Sj = , ;
734
+
735
+ i i j ;
736
+
737
+ where the sum is over all blocks Biof N 1 symbols and over all symbols S j.
738
+ Then FNis a monotonic , decreasing function of N, FN NGN N 1 GN1 = , , ;
739
+
740
+ , 1 N GN Fn = ;
741
+
742
+ N n1 = FN GN ;
743
+
744
+ and LimN FN H. = ! These results are derived in Appendix 3. They show that a
745
+ series of approximations to Hcan be obtained by considering only the
746
+ statistical structure of the sequences extending over 1 2 Nsymbols. FNis the ;
747
+
748
+ ;
749
+
750
+ : : : ;
751
+
752
+ better approximation. In fact FNis the entropy of the Nth order approximation
753
+ to the source of the typediscussed above. If there are no statistical
754
+ influences extending over more than Nsymbols, that is if theconditional
755
+ probability of the next symbol knowing the preceding N 1 is not changed by a
756
+ knowledge of , any before that, then FN H. FNof course is the conditional
757
+ entropy of the next symbol when the N 1 = , preceding ones are known, while
758
+ GNis the entropy per symbol of blocks of Nsymbols. The ratio of the entropy of
759
+ a source to the maximum value it could have while still restricted to the same
760
+ symbols will be called its relative entropy. This is the maximum compression
761
+ possible when we encode intothe same alphabet. One minus the relative entropy
762
+ is the redundancy. The redundancy of ordinary English,not considering
763
+ statistical structure over greater distances than about eight letters, is
764
+ roughly 50%. Thismeans that when we write English half of what we write is
765
+ determined by the structure of the language andhalf is chosen freely. The
766
+ figure 50% was found by several independent methods which all gave results in
767
+ 14
768
+ ===============================================================================
769
+ this neighborhood. One is by calculation of the entropy of the approximations
770
+ to English. A second methodis to delete a certain fraction of the letters from
771
+ a sample of English text and then let someone attempt torestore them. If they
772
+ can be restored when 50% are deleted the redundancy must be greater than 50%.
773
+ Athird method depends on certain known results in cryptography. Two extremes of
774
+ redundancy in English prose are represented by Basic English and by James
775
+ Joyce's book "Finnegans Wake". The Basic English vocabulary is limited to 850
776
+ words and the redundancy is veryhigh. This is reflected in the expansion that
777
+ occurs when a passage is translated into Basic English. Joyceon the other hand
778
+ enlarges the vocabulary and is alleged to achieve a compression of semantic
779
+ content. The redundancy of a language is related to the existence of crossword
780
+ puzzles. If the redundancy is zero any sequence of letters is a reasonable text
781
+ in the language and any two-dimensional array of lettersforms a crossword
782
+ puzzle. If the redundancy is too high the language imposes too many constraints
783
+ for largecrossword puzzles to be possible. A more detailed analysis shows that
784
+ if we assume the constraints imposedby the language are of a rather chaotic and
785
+ random nature, large crossword puzzles are just possible whenthe redundancy is
786
+ 50%. If the redundancy is 33%, three-dimensional crossword puzzles should be
787
+ possible,etc. 8. REPRESENTATION OF THE ENCODING AND DECODING OPERATIONS We have
788
+ yet to represent mathematically the operations performed by the transmitter and
789
+ receiver in en-coding and decoding the information. Either of these will be
790
+ called a discrete transducer. The input to thetransducer is a sequence of input
791
+ symbols and its output a sequence of output symbols. The transducer mayhave an
792
+ internal memory so that its output depends not only on the present input symbol
793
+ but also on the pasthistory. We assume that the internal memory is finite,
794
+ i.e., there exist a finite number mof possible states ofthe transducer and that
795
+ its output is a function of the present state and the present input symbol. The
796
+ nextstate will be a second function of these two quantities. Thus a transducer
797
+ can be described by two functions: yn f xn n = ;
798
+
799
+ n1 g xn n = ;
800
+
801
+ + where xnis the nth input symbol, nis the state of the transducer when the nth
802
+ input symbol is introduced, ynis the output symbol (or sequence of output
803
+ symbols) produced when xnis introduced if the state is n. If the output symbols
804
+ of one transducer can be identified with the input symbols of a second, they
805
+ can be connected in tandem and the result is also a transducer. If there exists
806
+ a second transducer which operateson the output of the first and recovers the
807
+ original input, the first transducer will be called non-singular andthe second
808
+ will be called its inverse. Theorem 7:The output of a finite state transducer
809
+ driven by a finite state statistical source is a finite state statistical
810
+ source, with entropy (per unit time) less than or equal to that of the input.
811
+ If the transduceris non-singular they are equal. Let represent the state of the
812
+ source, which produces a sequence of symbols xi;
813
+
814
+ and let be the state of the transducer, which produces, in its output, blocks
815
+ of symbols y j. The combined system can be representedby the "product state
816
+ space" of pairs . Two points in the space 1 1 and 2 2 , are connected by ;
817
+
818
+ ;
819
+
820
+ ;
821
+
822
+ a line if 1 can produce an xwhich changes 1 to 2, and this line is given the
823
+ probability of that xin this case. The line is labeled with the block of y
824
+ jsymbols produced by the transducer. The entropy of the outputcan be calculated
825
+ as the weighted sum over the states. If we sum first on each resulting term is
826
+ less than or equal to the corresponding term for , hence the entropy is not
827
+ increased. If the transducer is non-singularlet its output be connected to the
828
+ inverse transducer. If H0 , H0 and H0 are the output entropies of the source, 1
829
+ 2 3 the first and second transducers respectively, then H0 H0 H0 H0 and
830
+ therefore H0 H0 . 1 2 3 = 1 1 = 2 15
831
+ ===============================================================================
832
+ Suppose we have a system of constraints on possible sequences of the type which
833
+ can be represented by s a linear graph as in Fig. 2. If probabilities p were
834
+ assigned to the various lines connecting state ito state j i j this would
835
+ become a source. There is one particular assignment which maximizes the
836
+ resulting entropy (seeAppendix 4). Theorem 8:Let the system of constraints
837
+ considered as a channel have a capacity C logW. If we = assign s B j s p W,`ij
838
+ i j= Bi s where is the duration of the sth symbol leading from state ito state
839
+ jand the Bisatisfy `i j s B i j i BjW,` = s j ;
840
+
841
+ then His maximized and equal to C. By proper assignment of the transition
842
+ probabilities the entropy of symbols on a channel can be maxi- mized at the
843
+ channel capacity. 9. THE FUNDAMENTAL THEOREM FOR A NOISELESS CHANNEL We will
844
+ now justify our interpretation of Has the rate of generating information by
845
+ proving that Hdeter-mines the channel capacity required with most efficient
846
+ coding. Theorem 9:Let a source have entropy Hbits per symbol and a channel have
847
+ a capacity Cbits per second . Then it is possible to encode the output of the
848
+ source in such a way as to transmit at the average C rate symbols per second
849
+ over the channel where is arbitrarily small. It is not possible to transmit at
850
+ , H C an average rate greater than . H C The converse part of the theorem, that
851
+ cannot be exceeded, may be proved by noting that the entropy H of the channel
852
+ input per second is equal to that of the source, since the transmitter must be
853
+ non-singular, andalso this entropy cannot exceed the channel capacity. Hence H0
854
+ Cand the number of symbols per second H0 H C H. = = = The first part of the
855
+ theorem will be proved in two different ways. The first method is to consider
856
+ the set of all sequences of Nsymbols produced by the source. For Nlarge we can
857
+ divide these into two groups,one containing less than 2 H N + members and the
858
+ second containing less than 2RNmembers (where Ris the logarithm of the number
859
+ of different symbols) and having a total probability less than . As Nincreases
860
+ and approach zero. The number of signals of duration Tin the channel is greater
861
+ than 2 C T , with small when Tis large. if we choose H T N = + C then there
862
+ will be a sufficient number of sequences of channel symbols for the high
863
+ probability group whenNand Tare sufficiently large (however small ) and also
864
+ some additional ones. The high probability group is coded in an arbitrary one-
865
+ to-one way into this set. The remaining sequences are represented by
866
+ largersequences, starting and ending with one of the sequences not used for the
867
+ high probability group. Thisspecial sequence acts as a start and stop signal
868
+ for a different code. In between a sufficient time is allowedto give enough
869
+ different sequences for all the low probability messages. This will require R
870
+ T1 N = + ' C where is small. The mean rate of transmission in message symbols
871
+ per second will then be greater than ' 1 , T T 1 , 1 H R 1 1 , + = , + + + ' :
872
+ N N C C 16
873
+ ===============================================================================
874
+ C As Nincreases , and approach zero and the rate approaches . ' H Another
875
+ method of performing this coding and thereby proving the theorem can be
876
+ described as follows: Arrange the messages of length Nin order of decreasing
877
+ probability and suppose their probabilities are p 1 1 p2 p3 pn. Let Ps s, pi;
878
+
879
+ that is Psis the cumulative probability up to, but not including, ps. = 1 We
880
+ first encode into a binary system. The binary code for message sis obtained by
881
+ expanding Psas a binarynumber. The expansion is carried out to msplaces, where
882
+ msis the integer satisfying: 1 1 log2 ms 1 log + : p 2 s ps Thus the messages
883
+ of high probability are represented by short codes and those of low probability
884
+ by longcodes. From these inequalities we have 1 1 ps 2ms 2ms1 : , The code for
885
+ Pswill differ from all succeeding ones in one or more of its msplaces, since
886
+ all the remainingPiare at least 1 larger and their binary expansions therefore
887
+ differ in the first m 2ms splaces. Consequently all the codes are different and
888
+ it is possible to recover the message from its code. If the channel sequences
889
+ arenot already sequences of binary digits, they can be ascribed binary numbers
890
+ in an arbitrary fashion and thebinary code thus translated into signals
891
+ suitable for the channel. The average number H0 of binary digits used per
892
+ symbol of original message is easily estimated. We have 1 H0 msps = : N But, 1
893
+ 1 1 1 1 log ps msps 1 log ps + N 2 p 2 s N N ps and therefore, 1 GN H0 GN + N
894
+ As Nincreases GNapproaches H, the entropy of the source and H0 approaches H. We
895
+ see from this that the inefficiency in coding, when only a finite delay of
896
+ Nsymbols is used, need not be greater than 1 plus the difference between the
897
+ true entropy Hand the entropy G N Ncalculated for sequences of length N. The
898
+ per cent excess time needed over the ideal is therefore less than GN 1 1 + , :
899
+ H HN This method of encoding is substantially the same as one found
900
+ independently by R. M. Fano.9 His method is to arrange the messages of length
901
+ Nin order of decreasing probability. Divide this series into twogroups of as
902
+ nearly equal probability as possible. If the message is in the first group its
903
+ first binary digitwill be 0, otherwise 1. The groups are similarly divided into
904
+ subsets of nearly equal probability and theparticular subset determines the
905
+ second binary digit. This process is continued until each subset containsonly
906
+ one message. It is easily seen that apart from minor differences (generally in
907
+ the last digit) this amountsto the same thing as the arithmetic process
908
+ described above. 10. DISCUSSION AND EXAMPLES In order to obtain the maximum
909
+ power transfer from a generator to a load, a transformer must in general
910
+ beintroduced so that the generator as seen from the load has the load
911
+ resistance. The situation here is roughlyanalogous. The transducer which does
912
+ the encoding should match the source to the channel in a statisticalsense. The
913
+ source as seen from the channel through the transducer should have the same
914
+ statistical structure 9Technical Report No. 65, The Research Laboratory of
915
+ Electronics, M.I.T., March 17, 1949. 17
916
+ ===============================================================================
917
+ as the source which maximizes the entropy in the channel. The content of
918
+ Theorem 9 is that, although anexact match is not in general possible, we can
919
+ approximate it as closely as desired. The ratio of the actualrate of
920
+ transmission to the capacity Cmay be called the efficiency of the coding
921
+ system. This is of courseequal to the ratio of the actual entropy of the
922
+ channel symbols to the maximum possible entropy. In general, ideal or nearly
923
+ ideal encoding requires a long delay in the transmitter and receiver. In the
924
+ noiseless case which we have been considering, the main function of this delay
925
+ is to allow reasonably goodmatching of probabilities to corresponding lengths
926
+ of sequences. With a good code the logarithm of thereciprocal probability of a
927
+ long message must be proportional to the duration of the corresponding signal,
928
+ infact log p1 , C , T must be small for all but a small fraction of the long
929
+ messages. If a source can produce only one particular message its entropy is
930
+ zero, and no channel is required. For example, a computing machine set up to
931
+ calculate the successive digits of produces a definite sequence with no chance
932
+ element. No channel is required to "transmit" this to another point. One could
933
+ construct asecond machine to compute the same sequence at the point. However,
934
+ this may be impractical. In such a casewe can choose to ignore some or all of
935
+ the statistical knowledge we have of the source. We might considerthe digits of
936
+ to be a random sequence in that we construct a system capable of sending any
937
+ sequence of digits. In a similar way we may choose to use some of our
938
+ statistical knowledge of English in constructinga code, but not all of it. In
939
+ such a case we consider the source with the maximum entropy subject to
940
+ thestatistical conditions we wish to retain. The entropy of this source
941
+ determines the channel capacity whichis necessary and sufficient. In the
942
+ example the only information retained is that all the digits are chosen from
943
+ the set 0 1 9. In the case of English one might wish to use the statistical
944
+ saving possible due to ;
945
+
946
+ ;
947
+
948
+ : : : ;
949
+
950
+ letter frequencies, but nothing else. The maximum entropy source is then the
951
+ first approximation to Englishand its entropy determines the required channel
952
+ capacity. As a simple example of some of these results consider a source which
953
+ produces a sequence of letters chosen from among A, B, C, Dwith probabilities 1
954
+ , 1 , 1 , 1 , successive symbols being chosen independently. 2 4 8 8 We have ,
955
+ H 1 log 1 1 log 1 2 log 1 = , 2 2 + 4 4 + 8 8 7 bits per symbol = 4 : Thus we
956
+ can approximate a coding system to encode messages from this source into binary
957
+ digits with anaverage of 7 binary digit per symbol. In this case we can
958
+ actually achieve the limiting value by the following 4 code (obtained by the
959
+ method of the second proof of Theorem 9): A 0 B 10 C 110 D 111 The average
960
+ number of binary digits used in encoding a sequence of Nsymbols will be 2 , N1
961
+ 1 1 2 3 7 N 2 + 4 + = : 8 4 It is easily seen that the binary digits 0, 1 have
962
+ probabilities 1 , 1 so the Hfor the coded sequences is one 2 2 bit per symbol.
963
+ Since, on the average, we have 7 binary symbols per original letter, the
964
+ entropies on a time 4 basis are the same. The maximum possible entropy for the
965
+ original set is log 4 2, occurring when A, B, C, = Dhave probabilities 1 , 1 ,
966
+ 1 , 1 . Hence the relative entropy is 7 . We can translate the binary sequences
967
+ into 4 4 4 4 8 the original set of symbols on a two-to-one basis by the
968
+ following table: 00 A0 01 B0 10 C0 11 D0 18
969
+ ===============================================================================
970
+ This double process then encodes the original message into the same symbols but
971
+ with an average compres-sion ratio 7 . 8 As a second example consider a source
972
+ which produces a sequence of A's and B's with probability pfor Aand qfor B. If
973
+ p qwe have H log pp1 p1 p , = , , plog p1 p1 p p , = = , , e : plog = : p In
974
+ such a case one can construct a fairly good coding of the message on a 0, 1
975
+ channel by sending a specialsequence, say 0000, for the infrequent symbol Aand
976
+ then a sequence indicating the numberof B's followingit. This could be
977
+ indicated by the binary representation with all numbers containing the special
978
+ sequencedeleted. All numbers up to 16 are represented as usual;
979
+
980
+ 16 is represented by the next binary number after 16which does not contain four
981
+ zeros, namely 17 10001, etc. = It can be shown that as p 0 the coding
982
+ approaches ideal provided the length of the special sequence is ! properly
983
+ adjusted. PART II: THE DISCRETE CHANNEL WITH NOISE 11. REPRESENTATION OF A
984
+ NOISY DISCRETE CHANNEL We now consider the case where the signal is perturbed
985
+ by noise during transmission or at one or the otherof the terminals. This means
986
+ that the received signal is not necessarily the same as that sent out by
987
+ thetransmitter. Two cases may be distinguished. If a particular transmitted
988
+ signal always produces the samereceived signal, i.e., the received signal is a
989
+ definite function of the transmitted signal, then the effect may becalled
990
+ distortion. If this function has an inverse -- no two transmitted signals
991
+ producing the same receivedsignal -- distortion may be corrected, at least in
992
+ principle, by merely performing the inverse functionaloperation on the received
993
+ signal. The case of interest here is that in which the signal does not always
994
+ undergo the same change in trans- mission. In this case we may assume the
995
+ received signal Eto be a function of the transmitted signal Sand asecond
996
+ variable, the noise N. E f S N = ;
997
+
998
+ The noise is considered to be a chance variable just as the message was above.
999
+ In general it may be repre-sented by a suitable stochastic process. The most
1000
+ general type of noisy discrete channel we shall consideris a generalization of
1001
+ the finite state noise-free channel described previously. We assume a finite
1002
+ number ofstates and a set of probabilities p i j ;
1003
+
1004
+ : ;
1005
+
1006
+ This is the probability, if the channel is in state and symbol iis transmitted,
1007
+ that symbol jwill be received and the channel left in state . Thus and range
1008
+ over the possible states, iover the possible transmitted signals and jover the
1009
+ possible received signals. In the case where successive symbols are
1010
+ independently per-turbed by the noise there is only one state, and the channel
1011
+ is described by the set of transition probabilities pi j, the probability of
1012
+ transmitted symbol ibeing received as j. If a noisy channel is fed by a source
1013
+ there are two statistical processes at work: the source and the noise. Thus
1014
+ there are a number of entropies that can be calculated. First there is the
1015
+ entropy H xof the source or of the input to the channel (these will be equal if
1016
+ the transmitter is non-singular). The entropy of theoutput of the channel,
1017
+ i.e., the received signal, will be denoted by H y. In the noiseless case H y H
1018
+ x. = The joint entropy of input and output will be H xy. Finally there are two
1019
+ conditional entropies Hx yand Hy x, the entropy of the output when the input is
1020
+ known and conversely. Among these quantities we have the relations H x y H x Hx
1021
+ y H y Hy x ;
1022
+
1023
+ = + = + : All of these entropies can be measured on a per-second or a per-
1024
+ symbol basis. 19
1025
+ ===============================================================================
1026
+ 12. EQUIVOCATION AND CHANNEL CAPACITY If the channel is noisy it is not in
1027
+ general possible to reconstruct the original message or the transmittedsignal
1028
+ with certaintyby any operation on the received signal E. There are, however,
1029
+ ways of transmittingthe information which are optimal in combating noise. This
1030
+ is the problem which we now consider. Suppose there are two possible symbols 0
1031
+ and 1, and we are transmitting at a rate of 1000 symbols per second with
1032
+ probabilities p 1 0 p1 . Thus our source is producing information at the rate
1033
+ of 1000 bits = = 2 per second. During transmission the noise introduces errors
1034
+ so that, on the average, 1 in 100 is receivedincorrectly (a 0 as 1, or 1 as 0).
1035
+ What is the rate of transmission of information? Certainly less than 1000bits
1036
+ per second since about 1% of the received symbols are incorrect. Our first
1037
+ impulse might be to saythe rate is 990 bits per second, merely subtracting the
1038
+ expected number of errors. This is not satisfactorysince it fails to take into
1039
+ account the recipient's lack of knowledge of where the errors occur. We may
1040
+ carryit to an extreme case and suppose the noise so great that the received
1041
+ symbols are entirely independent ofthe transmitted symbols. The probability of
1042
+ receiving 1 is 1 whatever was transmitted and similarly for 0. 2 Then about
1043
+ half of the received symbols are correct due to chance alone, and we would be
1044
+ giving the systemcredit for transmitting 500 bits per second while actually no
1045
+ information is being transmitted at all. Equally"good" transmission would be
1046
+ obtained by dispensing with the channel entirely and flipping a coin at
1047
+ thereceiving point. Evidently the proper correction to apply to the amount of
1048
+ information transmitted is the amount of this information which is missing in
1049
+ the received signal, or alternatively the uncertainty when we have receiveda
1050
+ signal of what was actually sent. From our previous discussion of entropy as a
1051
+ measure of uncertainty itseems reasonable to use the conditional entropy of the
1052
+ message, knowing the received signal, as a measureof this missing information.
1053
+ This is indeed the proper definition, as we shall see later. Following this
1054
+ ideathe rate of actual transmission, R, would be obtained by subtracting from
1055
+ the rate of production (i.e., theentropy of the source) the average rate of
1056
+ conditional entropy. R H x Hy x = , The conditional entropy Hy xwill, for
1057
+ convenience, be called the equivocation. It measures the average ambiguity of
1058
+ the received signal. In the example considered above, if a 0 is received the a
1059
+ posterioriprobability that a 0 was transmitted is .99, and that a 1 was
1060
+ transmitted is .01. These figures are reversed if a 1 is received. Hence Hy x
1061
+ 99 log 99 0 01 log0 01 = , : : + : : 081 bits/symbol = : or 81 bits per second.
1062
+ We may say that the system is transmitting at a rate 1000 81 919 bits per
1063
+ second. , = In the extreme case where a 0 is equally likely to be received as a
1064
+ 0 or 1 and similarly for 1, the a posterioriprobabilities are 1 , 1 and 2 2 H 1
1065
+ 1 y x log 1 log 1 = , 2 2 + 2 2 1 bit per symbol = or 1000 bits per second. The
1066
+ rate of transmission is then 0 as it should be. The following theorem gives a
1067
+ direct intuitive interpretation of the equivocation and also serves to justify
1068
+ it as the unique appropriate measure. We consider a communication system and an
1069
+ observer (or auxiliarydevice) who can see both what is sent and what is
1070
+ recovered (with errors due to noise). This observer notesthe errors in the
1071
+ recovered message and transmits data to the receiving point over a "correction
1072
+ channel" toenable the receiver to correct the errors. The situation is
1073
+ indicated schematically in Fig. 8. Theorem 10:If the correction channel has a
1074
+ capacity equal to Hy xit is possible to so encode the correction data as to
1075
+ send it over this channel and correct all but an arbitrarily small fraction of
1076
+ the errors.This is not possible if the channel capacity is less than Hy x. 20
1077
+ ===============================================================================
1078
+ CORRECTION DATA OBSERVER M M0 M SOURCE TRANSMITTER RECEIVER CORRECTING DEVICE
1079
+ Fig. 8 -- Schematic diagram of a correction system. Roughly then, Hy xis the
1080
+ amount of additional information that must be supplied per second at the
1081
+ receiving point to correct the received message. To prove the first part,
1082
+ consider long sequences of received message M0 and corresponding original
1083
+ message M. There will be logarithmically T Hy xof the M's which could
1084
+ reasonably have produced each M0. Thus we have T Hy xbinary digits to send each
1085
+ Tseconds. This can be done with frequency of errors on a channel of capacity Hy
1086
+ x. The second part can be proved by noting, first, that for any discrete chance
1087
+ variables x, y, z Hy x z Hy x ;
1088
+
1089
+ : The left-hand side can be expanded to give Hy z Hyz x Hy x + Hyz x Hy x Hy z
1090
+ Hy x H z , , : If we identify xas the output of the source, yas the received
1091
+ signal and zas the signal sent over the correctionchannel, then the right-hand
1092
+ side is the equivocation less the rate of transmission over the correction
1093
+ channel.If the capacity of this channel is less than the equivocation the
1094
+ right-hand side will be greater than zero andHyz x 0. But this is the
1095
+ uncertainty of what was sent, knowing both the received signal and the
1096
+ correction signal. If this is greater than zero the frequency of errors cannot
1097
+ be arbitrarily small. Example: Suppose the errors occur at random in a sequence
1098
+ of binary digits: probability pthat a digit is wrongand q 1 pthat it is right.
1099
+ These errors can be corrected if their position is known. Thus the = ,
1100
+ correction channel need only send information as to these positions. This
1101
+ amounts to transmittingfrom a source which produces binary digits with
1102
+ probability pfor 1 (incorrect) and qfor 0 (correct).This requires a channel of
1103
+ capacity plog p qlogq , + which is the equivocation of the original system. The
1104
+ rate of transmission Rcan be written in two other forms due to the identities
1105
+ noted above. We have R H x Hy x = , H y Hx y = , H x H y H x y = + , ;
1106
+
1107
+ : 21
1108
+ ===============================================================================
1109
+ The first defining expression has already been interpreted as the amount of
1110
+ information sent less the uncer-tainty of what was sent. The second measures
1111
+ the amount received less the part of this which is due to noise.The third is
1112
+ the sum of the two amounts less the joint entropy and therefore in a sense is
1113
+ the number of bitsper second common to the two. Thus all three expressions have
1114
+ a certain intuitive significance. The capacity Cof a noisy channel should be
1115
+ the maximum possible rate of transmission, i.e., the rate when the source is
1116
+ properly matched to the channel. We therefore define the channel capacity by ,
1117
+ C Max H x Hy x = , where the maximum is with respect to all possible
1118
+ information sources used as input to the channel. If thechannel is noiseless,
1119
+ Hy x 0. The definition is then equivalent to that already given for a noiseless
1120
+ channel = since the maximum entropy for the channel is its capacity. 13. THE
1121
+ FUNDAMENTAL THEOREM FOR A DISCRETE CHANNEL WITH NOISE It may seem surprising
1122
+ that we should define a definite capacity Cfor a noisy channel since we can
1123
+ neversend certain information in such a case. It is clear, however, that by
1124
+ sending the information in a redundantform the probability of errors can be
1125
+ reduced. For example, by repeating the message many times and by astatistical
1126
+ study of the different received versions of the message the probability of
1127
+ errors could be made verysmall. One would expect, however, that to make this
1128
+ probability of errors approach zero, the redundancyof the encoding must
1129
+ increase indefinitely, and the rate of transmission therefore approach zero.
1130
+ This is byno means true. If it were, there would not be a very well defined
1131
+ capacity, but only a capacity for a givenfrequency of errors, or a given
1132
+ equivocation;
1133
+
1134
+ the capacity going down as the error requirements are mademore stringent.
1135
+ Actually the capacity Cdefined above has a very definite significance. It is
1136
+ possible to sendinformation at the rate Cthrough the channel with as small a
1137
+ frequency of errors or equivocation as desiredby proper encoding. This
1138
+ statement is not true for any rate greater than C. If an attempt is made to
1139
+ transmitat a higher rate than C, say C R1, then there will necessarily be an
1140
+ equivocation equal to or greater than the + excess R1. Nature takes payment by
1141
+ requiring just that much uncertainty, so that we are not actually gettingany
1142
+ more than Cthrough correctly. The situation is indicated in Fig. 9. The rate of
1143
+ information into the channel is plotted horizontally and the equivocation
1144
+ vertically. Any point above the heavy line in the shaded region can be attained
1145
+ and thosebelow cannot. The points on the line cannot in general be attained,
1146
+ but there will usually be two points onthe line that can. These results are the
1147
+ main justification for the definition of Cand will now be proved. Theorem 11:
1148
+ Let a discrete channel have the capacity Cand a discrete source the entropy per
1149
+ second H. If H Cthere exists a coding system such that the output of the source
1150
+ can be transmitted over the channel with an arbitrarily small frequency of
1151
+ errors (or an arbitrarily small equivocation). If H Cit is possible to encode
1152
+ the source so that the equivocation is less than H C where is arbitrarily
1153
+ small. There is no , + method of encoding which gives an equivocation less than
1154
+ H C. , The method of proving the first part of this theorem is not by
1155
+ exhibiting a coding method having the desired properties, but by showing that
1156
+ such a code must exist in a certain group of codes. In fact we will ATTAINABLE
1157
+ Hy x REGION 1.0 = OPE SL C H x Fig. 9 -- The equivocation possible for a given
1158
+ input entropy to a channel. 22
1159
+ ===============================================================================
1160
+ average the frequency of errors over this group and show that this average can
1161
+ be made less than . If theaverage of a set of numbers is less than there must
1162
+ exist at least one in the set which is less than . This will establish the
1163
+ desired result. The capacity Cof a noisy channel has been defined as , C Max H
1164
+ x Hy x = , where xis the input and ythe output. The maximization is over all
1165
+ sources which might be used as input tothe channel. Let S0 be a source which
1166
+ achieves the maximum capacity C. If this maximum is not actually achieved by
1167
+ any source let S0 be a source which approximates to giving the maximum rate.
1168
+ Suppose S0 is used asinput to the channel. We consider the possible transmitted
1169
+ and received sequences of a long duration T. Thefollowing will be true: 1. The
1170
+ transmitted sequences fall into two classes, a high probability group with
1171
+ about 2T H x members and the remaining sequences of small total probability. 2.
1172
+ Similarly the received sequences have a high probability set of about 2T H y
1173
+ members and a low probability set of remaining sequences. 3. Each high
1174
+ probability output could be produced by about 2THy x inputs. The probability of
1175
+ all other cases has a small total probability. All the 's and 's implied by the
1176
+ words "small" and "about" in these statements approach zero as we allow Tto
1177
+ increase and S0 to approach the maximizing source. The situation is summarized
1178
+ in Fig. 10 where the input sequences are points on the left and output
1179
+ sequences points on the right. The fan of cross lines represents the range of
1180
+ possible causes for a typicaloutput. E M 2H x T HIGH PROBABILITY 2H y T
1181
+ MESSAGES HIGH PROBABILITY RECEIVED SIGNALS 2Hy x T REASONABLE CAUSES FOR EACH E
1182
+ 2Hx y T REASONABLE EFFECTS FOR EACH M Fig. 10 -- Schematic representation of
1183
+ the relations between inputs and outputs in a channel. Now suppose we have
1184
+ another source producing information at rate Rwith R C. In the period Tthis
1185
+ source will have 2TRhigh probability messages. We wish to associate these with
1186
+ a selection of the possiblechannel inputs in such a way as to get a small
1187
+ frequency of errors. We will set up this association in all 23
1188
+ ===============================================================================
1189
+ possible ways (using, however, only the high probability group of inputs as
1190
+ determined by the source S0)and average the frequency of errors for this large
1191
+ class of possible coding systems. This is the same ascalculating the frequency
1192
+ of errors for a random association of the messages and channel inputs of
1193
+ durationT. Suppose a particular output y1 is observed. What is the probability
1194
+ of more than one message in the setof possible causes of y x 1? There are 2T
1195
+ Rmessages distributed at random in 2T H points. The probability of a particular
1196
+ point being a message is thus 2T R H x , : The probability that none of the
1197
+ points in the fan is a message (apart from the actual originating message) is x
1198
+ 2T Hy P 1 2T R H x , = , : Now R H x Hy xso R H x Hy x with positive.
1199
+ Consequently , , = , , x 2T Hy P 1 2 THy x T , , = , approaches (as T ) ! 1 2 T
1200
+ , , : Hence the probability of an error approaches zero and the first part of
1201
+ the theorem is proved. The second part of the theorem is easily shown by noting
1202
+ that we could merely send Cbits per second from the source, completely
1203
+ neglecting the remainder of the information generated. At the receiver
1204
+ theneglected part gives an equivocation H x Cand the part transmitted need only
1205
+ add . This limit can also , be attained in many other ways, as will be shown
1206
+ when we consider the continuous case. The last statement of the theorem is a
1207
+ simple consequence of our definition of C. Suppose we can encode a source with
1208
+ H x C ain such a way as to obtain an equivocation Hy x a with positive. Then =
1209
+ + = , R H x C aand = = + H x Hy x C , = + with positive. This contradicts the
1210
+ definition of Cas the maximum of H x Hy x. , Actually more has been proved than
1211
+ was stated in the theorem. If the average of a set of numbers is p p within of
1212
+ of their maximum, a fraction of at most can be more than below the maximum.
1213
+ Since is arbitrarily small we can say that almost all the systems are
1214
+ arbitrarily close to the ideal. 14. DISCUSSION The demonstration of Theorem 11,
1215
+ while not a pure existence proof, has some of the deficiencies of suchproofs.
1216
+ An attempt to obtain a good approximation to ideal coding by following the
1217
+ method of the proof isgenerally impractical. In fact, apart from some rather
1218
+ trivial cases and certain limiting situations, no explicitdescription of a
1219
+ series of approximation to the ideal has been found. Probably this is no
1220
+ accident but isrelated to the difficulty of giving an explicit construction for
1221
+ a good approximation to a random sequence. An approximation to the ideal would
1222
+ have the property that if the signal is altered in a reasonable way by the
1223
+ noise, the original can still be recovered. In other words the alteration will
1224
+ not in general bring itcloser to another reasonable signal than the original.
1225
+ This is accomplished at the cost of a certain amount ofredundancy in the
1226
+ coding. The redundancy must be introduced in the proper way to combat the
1227
+ particularnoise structure involved. However, any redundancy in the source will
1228
+ usually help if it is utilized at thereceiving point. In particular, if the
1229
+ source already has a certain redundancy and no attempt is made toeliminate it
1230
+ in matching to the channel, this redundancy will help combat noise. For
1231
+ example, in a noiselesstelegraph channel one could save about 50% in time by
1232
+ proper encoding of the messages. This is not doneand most of the redundancy of
1233
+ English remains in the channel symbols. This has the advantage, however,of
1234
+ allowing considerable noise in the channel. A sizable fraction of the letters
1235
+ can be received incorrectlyand still reconstructed by the context. In fact this
1236
+ is probably not a bad approximation to the ideal in manycases, since the
1237
+ statistical structure of English is rather involved and the reasonable English
1238
+ sequences arenot too far (in the sense required for the theorem) from a random
1239
+ selection. 24
1240
+ ===============================================================================
1241
+ As in the noiseless case a delay is generally required to approach the ideal
1242
+ encoding. It now has the additional function of allowing a large sample of
1243
+ noise to affect the signal before any judgment is madeat the receiving point as
1244
+ to the original message. Increasing the sample size always sharpens the
1245
+ possiblestatistical assertions. The content of Theorem 11 and its proof can be
1246
+ formulated in a somewhat different way which exhibits the connection with the
1247
+ noiseless case more clearly. Consider the possible signals of duration Tand
1248
+ supposea subset of them is selected to be used. Let those in the subset all be
1249
+ used with equal probability, and supposethe receiver is constructed to select,
1250
+ as the original signal, the most probable cause from the subset, when
1251
+ aperturbed signal is received. We define N T qto be the maximum number of
1252
+ signals we can choose for the ;
1253
+
1254
+ subset such that the probability of an incorrect interpretation is less than or
1255
+ equal to q. log N T q Theorem 12:Lim ;
1256
+
1257
+ C, where Cis the channel capacity, provided that qdoes not equal 0 or = T T !
1258
+ 1. In other words, no matter how we set out limits of reliability, we can
1259
+ distinguish reliably in time T enough messages to correspond to about CTbits,
1260
+ when Tis sufficiently large. Theorem 12 can be comparedwith the definition of
1261
+ the capacity of a noiseless channel given in Section 1. 15. EXAMPLE OF A
1262
+ DISCRETE CHANNEL AND ITS CAPACITY A simple example of a discrete channel is
1263
+ indicated in Fig. 11. There are three possible symbols. The first isnever
1264
+ affected by noise. The second and third each have probability pof coming
1265
+ through undisturbed, andqof being changed into the other of the pair. We have
1266
+ (letting plog p qlogqand Pand Qbe the = , + p q TRANSMITTED RECEIVED SYMBOLS
1267
+ SYMBOLS q p Fig. 11 -- Example of a discrete channel. probabilities of using
1268
+ the first and second symbols) H x Plog P 2QlogQ = , , Hy x 2Q = : We wish to
1269
+ choose Pand Qin such a way as to maximize H x Hy x, subject to the constraint P
1270
+ 2Q 1. , + = Hence we consider U Plog P 2QlogQ 2Q P 2Q = , , , + + U 1 logP 0 =
1271
+ , , + = P U 2 2 logQ 2 2 0 = , , , + = : Q Eliminating log P log Q = + P Qe Q =
1272
+ = 25
1273
+ ===============================================================================
1274
+ 1 P Q = = : 2 2 + + The channel capacity is then 2 C log + = : Note how this
1275
+ checks the obvious values in the cases p 1 and p 1 . In the first, 1 and C log
1276
+ 3, = = 2 = = which is correct since the channel is then noiseless with three
1277
+ possible symbols. If p 1 , 2 and = 2 = C log 2. Here the second and third
1278
+ symbols cannot be distinguished at all and act together like one = symbol. The
1279
+ first symbol is used with probability P 1 and the second and third together
1280
+ with probability = 2 1 . This may be distributed between them in any desired
1281
+ way and still achieve the maximum capacity. 2 For intermediate values of pthe
1282
+ channel capacity will lie between log 2 and log 3. The distinction between the
1283
+ second and third symbols conveys some information but not as much as in the
1284
+ noiseless case.The first symbol is used somewhat more frequently than the other
1285
+ two because of its freedom from noise. 16. THE CHANNEL CAPACITY IN CERTAIN
1286
+ SPECIAL CASES If the noise affects successive channel symbols independently it
1287
+ can be described by a set of transitionprobabilities pi j. This is the
1288
+ probability, if symbol iis sent, that jwill be received. The maximum
1289
+ channelrate is then given by the maximum of PipijlogPipijPipijlogpij , + i j i
1290
+ i j ;
1291
+
1292
+ ;
1293
+
1294
+ where we vary the Pisubject to Pi 1. This leads by the method of Lagrange to
1295
+ the equations, = ps j ps jlog s 1 2 = = ;
1296
+
1297
+ ;
1298
+
1299
+ : : : : j i Pi pi j Multiplying by Psand summing on sshows that C. Let the
1300
+ inverse of ps j(if it exists) be hstso that = s hst psj t j. Then: =
1301
+ hstpsjlogpsjlogPipit Chst , = : s j i s ;
1302
+
1303
+ Hence: h i Pi pit exp Chsthst psjlog psj = , + i s s j ;
1304
+
1305
+ or, h i Pi hitexp Chsthstpsjlogpsj = , + : t s s j ;
1306
+
1307
+ This is the system of equations for determining the maximizing values of Pi,
1308
+ with Cto be determined so that Pi 1. When this is done Cwill be the channel
1309
+ capacity, and the Pithe proper probabilities for the = channel symbols to
1310
+ achieve this capacity. If each input symbol has the same set of probabilities
1311
+ on the lines emerging from it, and the same is true of each output symbol, the
1312
+ capacity can be easily calculated. Examples are shown in Fig. 12. In such a
1313
+ caseHx yis independent of the distribution of probabilities on the input
1314
+ symbols, and is given by pilog pi , where the piare the values of the
1315
+ transition probabilities from any input symbol. The channel capacity is Max H y
1316
+ Hx y Max H y pilogpi , = + : The maximum of H yis clearly log mwhere mis the
1317
+ number of output symbols, since it is possible to make them all equally
1318
+ probable by making the input symbols equally probable. The channel capacity is
1319
+ therefore C log m pilogpi = + : 26
1320
+ ===============================================================================
1321
+ 1 2 1 2 1 3 1 2 1 3 1 6 1 3 1 2 1 6 1 6 1 6 1 2 1 2 1 6 1 2 1 6 1 3 1 3 1 2 1 3
1322
+ 1 2 1 6 1 3 1 2 1 2 a b c Fig. 12 -- Examples of discrete channels with the
1323
+ same transition probabilities for each input and for each output. In Fig. 12a
1324
+ it would be C log 4 log2 log 2 = , = : This could be achieved by using only the
1325
+ 1st and 3d symbols. In Fig. 12b C log 4 2 log3 1 log6 = , 3 , 3 log 4 log3 1
1326
+ log2 = , , 3 5 log 1 2 3 = 3 : In Fig. 12c we have C log 3 1 log2 1 log3 1 log6
1327
+ = , 2 , 3 , 6 3 log = 1 1 1 : 2 2 3 3 6 6 Suppose the symbols fall into several
1328
+ groups such that the noise never causes a symbol in one group to be mistaken
1329
+ for a symbol in another group. Let the capacity for the nth group be Cn(in bits
1330
+ per second)when we use only the symbols in this group. Then it is easily shown
1331
+ that, for best use of the entire set, thetotal probability Pnof all symbols in
1332
+ the nth group should be 2Cn Pn= : 2Cn Within a group the probability is
1333
+ distributed just as it would be if these were the only symbols being used.The
1334
+ channel capacity is C log 2Cn = : 17. AN EXAMPLE OF EFFICIENT CODING The
1335
+ following example, although somewhat unrealistic, is a case in which exact
1336
+ matching to a noisy channelis possible. There are two channel symbols, 0 and 1,
1337
+ and the noise affects them in blocks of seven symbols.A block of seven is
1338
+ either transmitted without error, or exactly one symbol of the seven is
1339
+ incorrect. Theseeight possibilities are equally likely. We have C Max H y Hx y
1340
+ = , 1 7 8 log 1 = 7 + 8 8 4 bits/symbol = 7 : An efficient code, allowing
1341
+ complete correction of errors and transmitting at the rate C, is the following
1342
+ (found by a method due to R. Hamming): 27
1343
+ ===============================================================================
1344
+ Let a block of seven symbols be X1 X2 X7. Of these X3, X5, X6 and X7 are
1345
+ message symbols and ;
1346
+
1347
+ ;
1348
+
1349
+ : : : ;
1350
+
1351
+ chosen arbitrarily by the source. The other three are redundant and calculated
1352
+ as follows: X4 is chosen to make X4 X5 X6 X7 even = + + + X2 " " " " X2 X3 X6
1353
+ X7 " = + + + X1 " " " " X1 X3 X5 X7 " = + + + When a block of seven is received
1354
+ and are calculated and if even called zero, if odd called one. The ;
1355
+
1356
+ binary number then gives the subscript of the Xithat is incorrect (if 0 there
1357
+ was no error). APPENDIX 1 THE GROWTH OF THE NUMBER OF BLOCKS OF SYMBOLS WITH A
1358
+ FINITE STATE CONDITION Let Ni Lbe the number of blocks of symbols of length
1359
+ Lending in state i. Then we have , s N j L Ni L b = , i j i s ;
1360
+
1361
+ where b1 b2 bmare the length of the symbols which may be chosen in state iand
1362
+ lead to state j. These i j;
1363
+
1364
+ i j;
1365
+
1366
+ : : : ;
1367
+
1368
+ i j are linear difference equations and the behavior as L must be of the type !
1369
+ Nj A jW L = : Substituting in the difference equation s A bij jW L AiWL, = i s
1370
+ ;
1371
+
1372
+ or s A bij j AiW, = i s ;
1373
+
1374
+ s W b , i j i j Ai 0 , = : i s For this to be possible the determinant s D W a
1375
+ bij i j W, i j = j j = , s must vanish and this determines W, which is, of
1376
+ course, the largest real root of D 0. = The quantity Cis then given by log A jW
1377
+ L C Lim logW = L L = ! and we also note that the same growth properties result
1378
+ if we require that all blocks start in the same (arbi-trarily chosen) state.
1379
+ APPENDIX 2 DERIVATION OF H pilog pi = , 1 1 1 Let H A n. From condition (3) we
1380
+ can decompose a choice from smequally likely possi- ;
1381
+
1382
+ ;
1383
+
1384
+ : : : ;
1385
+
1386
+ = n n n bilities into a series of mchoices from sequally likely possibilities
1387
+ and obtain A sm mA s = : 28
1388
+ ===============================================================================
1389
+ Similarly A tn nA t = : We can choose narbitrarily large and find an mto
1390
+ satisfy sm tn s m1 + : Thus, taking logarithms and dividing by nlog s, m log t
1391
+ m 1 m log t or + , n log s n n n log s where is arbitrarily small. Now from the
1392
+ monotonic property of A n, A sm A tn A sm1 + mA s nA t m 1 A s + : Hence,
1393
+ dividing by nA s, m A t m 1 m A t or + , n A s n n n A s A t logt 2 A t Klogt ,
1394
+ = A s log s where Kmust be positive to satisfy (2). ni Now suppose we have a
1395
+ choice from npossibilities with commeasurable probabilities pi where = ni the
1396
+ niare integers. We can break down a choice from nipossibilities into a choice
1397
+ from npossibilitieswith probabilities p1 pnand then, if the ith was chosen, a
1398
+ choice from niwith equal probabilities. Using ;
1399
+
1400
+ : : : ;
1401
+
1402
+ condition (3) again, we equate the total choice from nias computed by two
1403
+ methods Klog ni H p1 pn K pilogni = ;
1404
+
1405
+ : : : ;
1406
+
1407
+ + : Hence h i H K pilogni pilogni = , ni K pilog K pilog pi = , = , : ni If the
1408
+ piare incommeasurable, they may be approximated by rationals and the same
1409
+ expression must holdby our continuity assumption. Thus the expression holds in
1410
+ general. The choice of coefficient Kis a matterof convenience and amounts to
1411
+ the choice of a unit of measure. APPENDIX 3 THEOREMS ON ERGODIC SOURCES If it
1412
+ is possible to go from any state with P 0 to any other along a path of
1413
+ probability p 0, the system is ergodic and the strong law of large numbers can
1414
+ be applied. Thus the number of times a given path pi jinthe network is
1415
+ traversed in a long sequence of length Nis about proportional to the
1416
+ probability of being ati, say Pi, and then choosing this path, Pi pi jN. If Nis
1417
+ large enough the probability of percentage error in this is less than so that
1418
+ for all but a set of small probability the actual numbers lie within the limits
1419
+ Pi pi j N : Hence nearly all sequences have a probability pgiven by P N p p
1420
+ ipij = i j 29
1421
+ ===============================================================================
1422
+ log p and is limited by N log p Pipij log pi j = N or log p Pipijlogpij , : N
1423
+ This proves Theorem 3. Theorem 4 follows immediately from this on calculating
1424
+ upper and lower bounds for n qbased on the possible range of values of pin
1425
+ Theorem 3. In the mixed (not ergodic) case if L piLi = and the entropies of the
1426
+ components are H1 H2 Hnwe have the Theorem:Lim logn q qis a decreasing step
1427
+ function, N N = ' ! s1 s , q Hs in the interval i q i ' = : 1 1 To prove
1428
+ Theorems 5 and 6 first note that FNis monotonic decreasing because increasing
1429
+ Nadds a subscript to a conditional entropy. A simple substitution for pB S in
1430
+ the definition of F i j Nshows that FN NGN N 1 GN1 = , , , 1 and summing this
1431
+ for all Ngives GN Fn. Hence GN FNand GNmonotonic decreasing. Also they = N must
1432
+ approach the same limit. By using Theorem 3 we see that Lim GN H. = N !
1433
+ APPENDIX 4 MAXIMIZING THE RATE FOR A SYSTEM OF CONSTRAINTS Suppose we have a
1434
+ set of constraints on sequences of symbols that is of the finite state type and
1435
+ can be s represented therefore by a linear graph. Let be the lengths of the
1436
+ various symbols that can occur in `i j s passing from state ito state j. What
1437
+ distribution of probabilities P ifor the different states and p for i j
1438
+ choosing symbol sin state iand going to state jmaximizes the rate of generating
1439
+ information under theseconstraints? The constraints define a discrete channel
1440
+ and the maximum rate must be less than or equal tothe capacity Cof this
1441
+ channel, since if all blocks of large length were equally likely, this rate
1442
+ would result,and if possible this would be best. We will show that this rate
1443
+ can be achieved by proper choice of the Piand s p . i j The rate in question is
1444
+ s s P i p log p N , i j i j = : s s P M i pij`i j s s s Let i j . Evidently for
1445
+ a maximum p kexp . The constraints on maximization are Pi ` = s`i j i j= `i j =
1446
+ 1, j pi j 1, Pi pi j i j 0. Hence we maximize = , = Pipijlog pij , U Pi ipij
1447
+ jPi pij ij = P + + + , i pi j i j ` i U MPi1 log pi j NPi i j + + ` i iPi 0 = ,
1448
+ + = : pi j M2 + + 30
1449
+ ===============================================================================
1450
+ Solving for pi j pi j AiB jD,`ij = : Since p 1 i j 1 A, BjD,`ij = ;
1451
+
1452
+ i = j j B jD,`ij pi j= : s BsD,`is The correct value of Dis the capacity Cand
1453
+ the B jare solutions of B i j i BjC,` = for then B j pi j C,`ij = Bi Bj Pi
1454
+ C,`ij Pj = Bi or Pi Pj C,`ij= : Bi B j So that if isatisfy iC,`ij j = Pi Bi i =
1455
+ : Both the sets of equations for Biand ican be satisfied since Cis such that
1456
+ C,`ij i j 0 j , j = : In this case the rate is B B P j j i pi jlog C,`ij P B i
1457
+ pi jlog i B C i , = , Pi pi j i j Pipij ij ` ` but
1458
+ PipijlogBjlogBiPjlogBjPilogBi0 , = , = j Hence the rate is Cand as this could
1459
+ never be exceeded this is the maximum, justifying the assumed solution. 31
1460
+ ===============================================================================
1461
+ PART III: MATHEMATICAL PRELIMINARIES In this final installment of the paper we
1462
+ consider the case where the signals or the messages or both arecontinuously
1463
+ variable, in contrast with the discrete nature assumed heretofore. To a
1464
+ considerable extent thecontinuous case can be obtained through a limiting
1465
+ process from the discrete case by dividing the continuumof messages and signals
1466
+ into a large but finite number of small regions and calculating the various
1467
+ parametersinvolved on a discrete basis. As the size of the regions is decreased
1468
+ these parameters in general approach aslimits the proper values for the
1469
+ continuous case. There are, however, a few new effects that appear and alsoa
1470
+ general change of emphasis in the direction of specialization of the general
1471
+ results to particular cases. We will not attempt, in the continuous case, to
1472
+ obtain our results with the greatest generality, or with the extreme rigor of
1473
+ pure mathematics, since this would involve a great deal of abstract measure
1474
+ theoryand would obscure the main thread of the analysis. A preliminary study,
1475
+ however, indicates that the theorycan be formulated in a completely axiomatic
1476
+ and rigorous manner which includes both the continuous anddiscrete cases and
1477
+ many others. The occasional liberties taken with limiting processes in the
1478
+ present analysiscan be justified in all cases of practical interest. 18. SETS
1479
+ AND ENSEMBLES OF FUNCTIONS We shall have to deal in the continuous case with
1480
+ sets of functions and ensembles of functions. A set offunctions, as the name
1481
+ implies, is merely a class or collection of functions, generally of one
1482
+ variable, time.It can be specified by giving an explicit representation of the
1483
+ various functions in the set, or implicitly bygiving a property which functions
1484
+ in the set possess and others do not. Some examples are: 1. The set of
1485
+ functions: f t sin t = + : Each particular value of determines a particular
1486
+ function in the set. 2. The set of all functions of time containing no
1487
+ frequencies over Wcycles per second. 3. The set of all functions limited in
1488
+ band to Wand in amplitude to A. 4. The set of all English speech signals as
1489
+ functions of time. An ensembleof functions is a set of functions together with
1490
+ a probability measure whereby we may determine the probability of a function in
1491
+ the set having certain properties.1 For example with the set, f t sin t = + ;
1492
+
1493
+ we may give a probability distribution for , P . The set then becomes an
1494
+ ensemble. Some further examples of ensembles of functions are: 1. A finite set
1495
+ of functions fk t(k 1 2 n) with the probability of fkbeing pk. = ;
1496
+
1497
+ ;
1498
+
1499
+ : : : ;
1500
+
1501
+ 2. A finite dimensional family of functions f 1 2 n;
1502
+
1503
+ t ;
1504
+
1505
+ ;
1506
+
1507
+ : : : ;
1508
+
1509
+ with a probability distribution on the parameters i: p 1 n ;
1510
+
1511
+ : : : ;
1512
+
1513
+ : For example we could consider the ensemble defined by n f a1 an1 n;
1514
+
1515
+ t aisini t i ;
1516
+
1517
+ : : : ;
1518
+
1519
+ ;
1520
+
1521
+ ;
1522
+
1523
+ : : : ;
1524
+
1525
+ = ! + i1 = with the amplitudes aidistributed normally and independently, and
1526
+ the phases idistributed uniformly (from 0 to 2 ) and independently. 1In
1527
+ mathematical terminology the functions belong to a measure space whose total
1528
+ measure is unity. 32
1529
+ ===============================================================================
1530
+ 3. The ensemble + sin 2W t n f a , i t an ;
1531
+
1532
+ = 2W t n n , =, p with the ainormal and independent all with the same standard
1533
+ deviation N. This is a representation of "white" noise, band limited to the
1534
+ band from 0 to Wcycles per second and with average power N.2 4. Let points be
1535
+ distributed on the taxis according to a Poisson distribution. At each selected
1536
+ point the function f tis placed and the different functions added, giving the
1537
+ ensemble f t tk + k =, where the tkare the points of the Poisson distribution.
1538
+ This ensemble can be considered as a type ofimpulse or shot noise where all the
1539
+ impulses are identical. 5. The set of English speech functions with the
1540
+ probability measure given by the frequency of occurrence in ordinary use. An
1541
+ ensemble of functions f tis stationaryif the same ensemble results when all
1542
+ functions are shifted any fixed amount in time. The ensemble f t sin t = + is
1543
+ stationary if is distributed uniformly from 0 to 2 . If we shift each function
1544
+ by t1 we obtain f t t1 sin t t1 + = + + sin t = + ' with distributed uniformly
1545
+ from 0 to 2 . Each function has changed but the ensemble as a whole is '
1546
+ invariant under the translation. The other examples given above are also
1547
+ stationary. An ensemble is ergodicif it is stationary, and there is no subset
1548
+ of the functions in the set with a probability different from 0 and 1 which is
1549
+ stationary. The ensemble sin t + is ergodic. No subset of these functions of
1550
+ probability 0 1 is transformed into itself under all time trans- 6= ;
1551
+
1552
+ lations. On the other hand the ensemble asin t + with adistributed normally and
1553
+ uniform is stationary but not ergodic. The subset of these functions with
1554
+ abetween 0 and 1 for example is stationary. Of the examples given, 3 and 4 are
1555
+ ergodic, and 5 may perhaps be considered so. If an ensemble is ergodic we may
1556
+ say roughly that each function in the set is typical of the ensemble. More
1557
+ precisely it isknown that with an ergodic ensemble an average of any statistic
1558
+ over the ensemble is equal (with probability1) to an average over the time
1559
+ translations of a particular function of the set.3 Roughly speaking,
1560
+ eachfunction can be expected, as time progresses, to go through, with the
1561
+ proper frequency, all the convolutionsof any of the functions in the set. 2This
1562
+ representation can be used as a definition of band limited white noise. It has
1563
+ certain advantages in that it involves fewer limiting operations than do
1564
+ definitions that have been used in the past. The name "white noise," already
1565
+ firmly entrenched in theliterature, is perhaps somewhat unfortunate. In optics
1566
+ white light means either any continuous spectrum as contrasted with a
1567
+ pointspectrum, or a spectrum which is flat with wavelength(which is not the
1568
+ same as a spectrum flat with frequency). 3This is the famous ergodic theorem or
1569
+ rather one aspect of this theorem which was proved in somewhat different
1570
+ formulations by Birkoff, von Neumann, and Koopman, and subsequently generalized
1571
+ by Wiener, Hopf, Hurewicz and others. The literature onergodic theory is quite
1572
+ extensive and the reader is referred to the papers of these writers for precise
1573
+ and general formulations;
1574
+
1575
+ e.g.,E. Hopf, "Ergodentheorie," Ergebnisse der Mathematik und ihrer
1576
+ Grenzgebiete,v. 5;
1577
+
1578
+ "On Causality Statistics and Probability," Journalof Mathematics and Physics,v.
1579
+ XIII, No. 1, 1934;
1580
+
1581
+ N. Wiener, "The Ergodic Theorem," Duke Mathematical Journal,v. 5, 1939. 33
1582
+ ===============================================================================
1583
+ Just as we may perform various operations on numbers or functions to obtain new
1584
+ numbers or functions, we can perform operations on ensembles to obtain new
1585
+ ensembles. Suppose, for example, we have anensemble of functions f tand an
1586
+ operator Twhich gives for each function f ta resulting function g t: g t T f t
1587
+ = : Probability measure is defined for the set g tby means of that for the set
1588
+ f t. The probability of a certain subset of the g tfunctions is equal to that
1589
+ of the subset of the f tfunctions which produce members of the given subset of
1590
+ gfunctions under the operation T. Physically this corresponds to passing the
1591
+ ensemblethrough some device, for example, a filter, a rectifier or a modulator.
1592
+ The output functions of the deviceform the ensemble g t. A device or operator
1593
+ Twill be called invariant if shifting the input merely shifts the output, i.e.,
1594
+ if g t T f t = implies g t t1 T f t t1 + = + for all f tand all t1. It is
1595
+ easily shown (see Appendix 5 that if Tis invariant and the input ensemble is
1596
+ stationary then the output ensemble is stationary. Likewise if the input is
1597
+ ergodic the output will also beergodic. A filter or a rectifier is invariant
1598
+ under all time translations. The operation of modulation is not since the
1599
+ carrier phase gives a certain time structure. However, modulation is invariant
1600
+ under all translations whichare multiples of the period of the carrier. Wiener
1601
+ has pointed out the intimate relation between the invariance of physical
1602
+ devices under time translations and Fourier theory.4 He has shown, in fact,
1603
+ that if a device is linear as well as invariant Fourieranalysis is then the
1604
+ appropriate mathematical tool for dealing with the problem. An ensemble of
1605
+ functions is the appropriate mathematical representation of the messages
1606
+ produced by a continuous source (for example, speech), of the signals produced
1607
+ by a transmitter, and of the perturbingnoise. Communication theory is properly
1608
+ concerned, as has been emphasized by Wiener, not with operationson particular
1609
+ functions, but with operations on ensembles of functions. A communication
1610
+ system is designednot for a particular speech function and still less for a
1611
+ sine wave, but for the ensemble of speech functions. 19. BAND LIMITED ENSEMBLES
1612
+ OF FUNCTIONS If a function of time f tis limited to the band from 0 to Wcycles
1613
+ per second it is completely determined by giving its ordinates at a series of
1614
+ discrete points spaced 1 seconds apart in the manner indicated by the 2W
1615
+ following result.5 Theorem 13:Let f tcontain no frequencies over W. Then sin 2W
1616
+ t n f t X , n = 2W t n , , where n Xn f = : 2W 4Communication theory is heavily
1617
+ indebted to Wiener for much of its basic philosophy and theory. His classic
1618
+ NDRC report, The Interpolation, Extrapolation and Smoothing of Stationary Time
1619
+ Series(Wiley, 1949), contains the first clear-cut formulation ofcommunication
1620
+ theory as a statistical problem, the study of operations on time series. This
1621
+ work, although chiefly concerned with thelinear prediction and filtering
1622
+ problem, is an important collateral reference in connection with the present
1623
+ paper. We may also referhere to Wiener's Cybernetics(Wiley, 1948), dealing with
1624
+ the general problems of communication and control. 5For a proof of this theorem
1625
+ and further discussion see the author's paper "Communication in the Presence of
1626
+ Noise" published in the Proceedings of the Institute of Radio Engineers,v. 37,
1627
+ No. 1, Jan., 1949, pp. 10�21. 34
1628
+ ===============================================================================
1629
+ In this expansion f tis represented as a sum of orthogonal functions. The
1630
+ coefficients Xnof the various terms can be considered as coordinates in an
1631
+ infinite dimensional "function space." In this space eachfunction corresponds
1632
+ to precisely one point and each point to one function. A function can be
1633
+ considered to be substantially limited to a time Tif all the ordinates
1634
+ Xnoutside this interval of time are zero. In this case all but 2TWof the
1635
+ coordinates will be zero. Thus functions limited toa band Wand duration
1636
+ Tcorrespond to points in a space of 2TWdimensions. A subset of the functions of
1637
+ band Wand duration Tcorresponds to a region in this space. For example, the
1638
+ functions whose total energy is less than or equal to Ecorrespond to points in
1639
+ a 2TWdimensional sphere p with radius r 2W E. = An ensembleof functions of
1640
+ limited duration and band will be represented by a probability distribution p
1641
+ x1 xnin the corresponding ndimensional space. If the ensemble is not limited in
1642
+ time we can consider ;
1643
+
1644
+ : : : ;
1645
+
1646
+ the 2TWcoordinates in a given interval Tto represent substantially the part of
1647
+ the function in the interval Tand the probability distribution p x1 xnto give
1648
+ the statistical structure of the ensemble for intervals of ;
1649
+
1650
+ : : : ;
1651
+
1652
+ that duration. 20. ENTROPY OF A CONTINUOUS DISTRIBUTION The entropy of a
1653
+ discrete set of probabilities p1 pnhas been defined as: ;
1654
+
1655
+ : : : ;
1656
+
1657
+ H pilogpi = , : In an analogous manner we define the entropy of a continuous
1658
+ distribution with the density distributionfunction p xby: Z H p xlog p x dx = ,
1659
+ : , With an ndimensional distribution p x1 xnwe have ;
1660
+
1661
+ : : : ;
1662
+
1663
+ Z Z H p x1 xnlog p x1 xn dx1 dxn = , ;
1664
+
1665
+ : : : ;
1666
+
1667
+ ;
1668
+
1669
+ : : : ;
1670
+
1671
+ : If we have two arguments xand y(which may themselves be multidimensional) the
1672
+ joint and conditionalentropies of p x yare given by ;
1673
+
1674
+ Z Z H x y p x ylog p x y dx dy ;
1675
+
1676
+ = , ;
1677
+
1678
+ ;
1679
+
1680
+ and Z Z p x y ;
1681
+
1682
+ Hx y p x ylog dx dy = , ;
1683
+
1684
+ p x Z Z p x y H ;
1685
+
1686
+ y x p x ylog dx dy = , ;
1687
+
1688
+ p y where Z p x p x y dy = ;
1689
+
1690
+ Z p y p x y dx = ;
1691
+
1692
+ : The entropies of continuous distributions have most (but not all) of the
1693
+ properties of the discrete case. In particular we have the following: 1. If xis
1694
+ limited to a certain volume vin its space, then H xis a maximum and equal to
1695
+ log vwhen p x is constant (1 v) in the volume. = 35
1696
+ ===============================================================================
1697
+ 2. With any two variables x, ywe have H x y H x H y ;
1698
+
1699
+ + with equality if (and only if) xand yare independent, i.e., p x y p x p y
1700
+ (apart possibly from a ;
1701
+
1702
+ = set of points of probability zero). 3. Consider a generalized averaging
1703
+ operation of the following type: Z p0 y a x y p x dx = ;
1704
+
1705
+ with Z Z a x y dx a x y dy 1 a x y 0 ;
1706
+
1707
+ = ;
1708
+
1709
+ = ;
1710
+
1711
+ ;
1712
+
1713
+ : Then the entropy of the averaged distribution p0 yis equal to or greater than
1714
+ that of the original distribution p x. 4. We have H x y H x Hx y H y Hy x ;
1715
+
1716
+ = + = + and Hx y H y : 5. Let p xbe a one-dimensional distribution. The form of
1717
+ p xgiving a maximum entropy subject to the condition that the standard
1718
+ deviation of xbe fixed at is Gaussian. To show this we must maximize Z H x p
1719
+ xlog p x dx = , with Z Z 2 p x x2 dx and 1 p x dx = = as constraints. This
1720
+ requires, by the calculus of variations, maximizing Z p xlog p x p x x2 p x dx
1721
+ , + + : The condition for this is 1 log p x x2 0 , , + + = and consequently
1722
+ (adjusting the constants to satisfy the constraints) 1 p x e x2 2 2 , = p = : 2
1723
+ Similarly in ndimensions, suppose the second order moments of p x1 xnare fixed
1724
+ at Ai j: ;
1725
+
1726
+ : : : ;
1727
+
1728
+ Z Z Ai j xix j p x1 xn dx1 dxn = ;
1729
+
1730
+ : : : ;
1731
+
1732
+ : Then the maximum entropy occurs (by a similar calculation) when p x1 xnis the
1733
+ ndimensional ;
1734
+
1735
+ : : : ;
1736
+
1737
+ Gaussian distribution with the second order moments Ai j. 36
1738
+ ===============================================================================
1739
+ 6. The entropy of a one-dimensional Gaussian distribution whose standard
1740
+ deviation is is given by p H x log 2 e = : This is calculated as follows: 1 p x
1741
+ e x2 2 2 , = p = 2 x2 p log p x log 2 , = + 2 2 Z H x p xlog p x dx = , Z Z x2
1742
+ p p xlog 2 dx p x dx = + 2 2 2 p log 2 = + 2 2 p p log 2 log e = + p log 2 e =
1743
+ : Similarly the ndimensional Gaussian distribution with associated quadratic
1744
+ form ai jis given by 1 a i j2 j j p x 1 1 xn exp aijxixj ;
1745
+
1746
+ : : : ;
1747
+
1748
+ = , 2 n2 2 = and the entropy can be calculated as 1 H log 2 e n2 = a, i j 2 = j
1749
+ j where ai jis the determinant whose elements are ai j. j j 7. If xis limited
1750
+ to a half line (p x 0 for x 0) and the first moment of xis fixed at a: = Z a p
1751
+ x x dx = ;
1752
+
1753
+ 0 then the maximum entropy occurs when 1 p x e x a , = = a and is equal to log
1754
+ ea. 8. There is one important difference between the continuous and discrete
1755
+ entropies. In the discrete case the entropy measures in an absoluteway the
1756
+ randomness of the chance variable. In the continuouscase the measurement is
1757
+ relative to the coordinate system. If we change coordinates the entropy willin
1758
+ general change. In fact if we change to coordinates y1 ynthe new entropy is
1759
+ given by Z Z x x H y p x1 xn J log p x1 xn J dy1 dyn = ;
1760
+
1761
+ : : : ;
1762
+
1763
+ ;
1764
+
1765
+ : : : ;
1766
+
1767
+ y y , where J xis the Jacobian of the coordinate transformation. On expanding
1768
+ the logarithm and chang- y ing the variables to x1 xn, we obtain: Z Z x H y H x
1769
+ p x1 xnlog J dx1 dxn = , ;
1770
+
1771
+ : : : ;
1772
+
1773
+ : : : : y 37
1774
+ ===============================================================================
1775
+ Thus the new entropy is the old entropy less the expected logarithm of the
1776
+ Jacobian. In the continuouscase the entropy can be considered a measure of
1777
+ randomness relative to an assumed standard, namelythe coordinate system chosen
1778
+ with each small volume element dx1 dxngiven equal weight. When we change the
1779
+ coordinate system the entropy in the new system measures the randomness when
1780
+ equalvolume elements dy1 dynin the new system are given equal weight. In spite
1781
+ of this dependence on the coordinate system the entropy concept is as important
1782
+ in the con-tinuous case as the discrete case. This is due to the fact that the
1783
+ derived concepts of information rateand channel capacity depend on the
1784
+ differenceof two entropies and this difference does notdependon the coordinate
1785
+ frame, each of the two terms being changed by the same amount. The entropy of a
1786
+ continuous distribution can be negative. The scale of measurements sets an
1787
+ arbitraryzero corresponding to a uniform distribution over a unit volume. A
1788
+ distribution which is more confinedthan this has less entropy and will be
1789
+ negative. The rates and capacities will, however, always be non-negative. 9. A
1790
+ particular case of changing coordinates is the linear transformation y j aijxi
1791
+ = : i In this case the Jacobian is simply the determinant a 1 , i j and j j H y
1792
+ H x log ai j = + j j: In the case of a rotation of coordinates (or any measure
1793
+ preserving transformation) J 1 and H y = = H x. 21. ENTROPY OF AN ENSEMBLE OF
1794
+ FUNCTIONS Consider an ergodic ensemble of functions limited to a certain band
1795
+ of width Wcycles per second. Let p x1 xn ;
1796
+
1797
+ : : : ;
1798
+
1799
+ be the density distribution function for amplitudes x1 xnat nsuccessive sample
1800
+ points. We define the ;
1801
+
1802
+ : : : ;
1803
+
1804
+ entropy of the ensemble per degree of freedom by 1 Z Z H0 Lim p x1 xnlog p x1
1805
+ xn dx1 dxn = , ;
1806
+
1807
+ : : : ;
1808
+
1809
+ ;
1810
+
1811
+ : : : ;
1812
+
1813
+ : : : : n n ! We may also define an entropy Hper second by dividing, not by n,
1814
+ but by the time Tin seconds for nsamples. Since n 2TW, H 2W H0. = = With white
1815
+ thermal noise pis Gaussian and we have p H0 log 2 eN = ;
1816
+
1817
+ H Wlog 2 eN = : For a given average power N, white noise has the maximum
1818
+ possible entropy. This follows from the maximizing properties of the Gaussian
1819
+ distribution noted above. The entropy for a continuous stochastic process has
1820
+ many properties analogous to that for discrete pro- cesses. In the discrete
1821
+ case the entropy was related to the logarithm of the probabilityof long
1822
+ sequences,and to the numberof reasonably probable sequences of long length. In
1823
+ the continuous case it is related ina similar fashion to the logarithm of the
1824
+ probability densityfor a long series of samples, and the volumeofreasonably
1825
+ high probability in the function space. More precisely, if we assume p x1
1826
+ xncontinuous in all the xifor all n, then for sufficiently large n ;
1827
+
1828
+ : : : ;
1829
+
1830
+ log p H0 , n 38
1831
+ ===============================================================================
1832
+ for all choices of x1 xnapart from a set whose total probability is less than ,
1833
+ with and arbitrarily ;
1834
+
1835
+ : : : ;
1836
+
1837
+ small. This follows form the ergodic property if we divide the space into a
1838
+ large number of small cells. The relation of Hto volume can be stated as
1839
+ follows: Under the same assumptions consider the n dimensional space
1840
+ corresponding to p x1 xn. Let Vn qbe the smallest volume in this space which ;
1841
+
1842
+ : : : ;
1843
+
1844
+ includes in its interior a total probability q. Then logVn q Lim H0 = n n !
1845
+ provided qdoes not equal 0 or 1. These results show that for large nthere is a
1846
+ rather well-defined volume (at least in the logarithmic sense) of high
1847
+ probability, and that within this volume the probability density is relatively
1848
+ uniform (again in thelogarithmic sense). In the white noise case the
1849
+ distribution function is given by 1 1 p x1 xn exp x2 ;
1850
+
1851
+ : : : ;
1852
+
1853
+ = , 2 N n2 i: = 2N Since this depends only on x2 the surfaces of equal
1854
+ probability density are spheres and the entire distri- i p bution has spherical
1855
+ symmetry. The region of high probability is a sphere of radius nN. As n the ! p
1856
+ probability of being outside a sphere of radius n N approaches zero and 1 times
1857
+ the logarithm of the + n p volume of the sphere approaches log 2 eN. In the
1858
+ continuous case it is convenient to work not with the entropy Hof an ensemble
1859
+ but with a derived quantity which we will call the entropy power. This is
1860
+ defined as the power in a white noise limited to thesame band as the original
1861
+ ensemble and having the same entropy. In other words if H0 is the entropy of
1862
+ anensemble its entropy power is 1 N1 exp 2H0 = : 2 e In the geometrical picture
1863
+ this amounts to measuring the high probability volume by the squared radius of
1864
+ asphere having the same volume. Since white noise has the maximum entropy for a
1865
+ given power, the entropypower of any noise is less than or equal to its actual
1866
+ power. 22. ENTROPY LOSS IN LINEAR FILTERS Theorem 14:If an ensemble having an
1867
+ entropy H1 per degree of freedom in band Wis passed through a filter with
1868
+ characteristic Y fthe output ensemble has an entropy 1 Z H 2 2 H1 log Y f d f =
1869
+ + j j : W W The operation of the filter is essentially a linear transformation
1870
+ of coordinates. If we think of the different frequency components as the
1871
+ original coordinate system, the new frequency components are merely the oldones
1872
+ multiplied by factors. The coordinate transformation matrix is thus essentially
1873
+ diagonalized in termsof these coordinates. The Jacobian of the transformation
1874
+ is (for nsine and ncosine components) n J Y f2 i = j j i1 = where the fiare
1875
+ equally spaced through the band W. This becomes in the limit 1 Z exp log Y f2 d
1876
+ f j j : W W Since Jis constant its average value is the same quantity and
1877
+ applying the theorem on the change of entropywith a change of coordinates, the
1878
+ result follows. We may also phrase it in terms of the entropy power. Thusif the
1879
+ entropy power of the first ensemble is N1 that of the second is 1 Z N 2 1 exp
1880
+ log Y f d f j j : W W 39
1881
+ ===============================================================================
1882
+ TABLE I ENTROPY ENTROPY GAIN POWER POWER GAIN IMPULSE RESPONSE FACTOR IN
1883
+ DECIBELS 1 1 1 sin2 t2 , ! 8 69 = e2 , : t2 2 = ! 0 1 1 1 2 2 4 sint cos t , !
1884
+ 5 33 2 , : e t3 , t2 ! 0 1 1 1 3 cos t 1 cos t sint , , ! 0 411 3 87 6 : , : t4
1885
+ , 2t2 + t3 ! 0 1 1 p 1 2 2 2 J1 t , ! 2 67 e , : 2 t ! 0 1 1 1 1 8 69 cos 1 t
1886
+ cos t : , , e2 , t2 ! 0 1 The final entropy power is the initial entropy power
1887
+ multiplied by the geometric mean gain of the filter. Ifthe gain is measured in
1888
+ db, then the output entropy power will be increased by the arithmetic mean
1889
+ dbgainover W. In Table I the entropy power loss has been calculated (and also
1890
+ expressed in db) for a number of ideal gain characteristics. The impulsive
1891
+ responses of these filters are also given for W 2 , with phase assumed = to be
1892
+ 0. The entropy loss for many other cases can be obtained from these results.
1893
+ For example the entropy power factor 1 e2 for the first case also applies to
1894
+ any gain characteristic obtain from 1 by a measure = , ! preserving
1895
+ transformation of the axis. In particular a linearly increasing gain G , or a
1896
+ "saw tooth" ! ! = ! characteristic between 0 and 1 have the same entropy loss.
1897
+ The reciprocal gain has the reciprocal factor.Thus 1 has the factor e2. Raising
1898
+ the gain to any power raises the factor to this power. =! 23. ENTROPY OF A SUM
1899
+ OF TWO ENSEMBLES If we have two ensembles of functions f tand g twe can form a
1900
+ new ensemble by "addition." Suppose the first ensemble has the probability
1901
+ density function p x1 xnand the second q x1 xn. Then the ;
1902
+
1903
+ : : : ;
1904
+
1905
+ ;
1906
+
1907
+ : : : ;
1908
+
1909
+ 40
1910
+ ===============================================================================
1911
+ density function for the sum is given by the convolution: Z Z r x1 xn p y1 yn q
1912
+ x1 y1 xn yn dy1 dyn ;
1913
+
1914
+ : : : ;
1915
+
1916
+ = ;
1917
+
1918
+ : : : ;
1919
+
1920
+ , ;
1921
+
1922
+ : : : ;
1923
+
1924
+ , : Physically this corresponds to adding the noises or signals represented by
1925
+ the original ensembles of func-tions. The following result is derived in
1926
+ Appendix 6. Theorem 15:Let the average power of two ensembles be N1 and N2 and
1927
+ let their entropy powers be N1 and N2. Then the entropy power of the sum, N3,
1928
+ is bounded by N1 N2 N3 N1 N2 + + : White Gaussian noise has the peculiar
1929
+ property that it can absorb any other noise or signal ensemble which may be
1930
+ added to it with a resultant entropy power approximately equal to the sum of
1931
+ the white noisepower and the signal power (measured from the average signal
1932
+ value, which is normally zero), provided thesignal power is small, in a certain
1933
+ sense, compared to noise. Consider the function space associated with these
1934
+ ensembles having ndimensions. The white noise corresponds to the spherical
1935
+ Gaussian distribution in this space. The signal ensemble corresponds to
1936
+ anotherprobability distribution, not necessarily Gaussian or spherical. Let the
1937
+ second moments of this distributionabout its center of gravity be ai j. That
1938
+ is, if p x1 xnis the density distribution function ;
1939
+
1940
+ : : : ;
1941
+
1942
+ Z Z ai j p xi i x j j dx1 dxn = , , where the iare the coordinates of the
1943
+ center of gravity. Now ai jis a positive definite quadratic form, and we can
1944
+ rotate our coordinate system to align it with the principal directions of this
1945
+ form. ai jis then reducedto diagonal form bii. We require that each biibe small
1946
+ compared to N, the squared radius of the sphericaldistribution. In this case
1947
+ the convolution of the noise and signal produce approximately a Gaussian
1948
+ distribution whose corresponding quadratic form is N bii + : The entropy power
1949
+ of this distribution is h i1 n = N bii + or approximately h i1 n = N n b n1 ,
1950
+ ii N = + 1 : N bii = + : n The last term is the signal power, while the first
1951
+ is the noise power. PART IV: THE CONTINUOUS CHANNEL 24. THE CAPACITY OF A
1952
+ CONTINUOUS CHANNEL In a continuous channel the input or transmitted signals
1953
+ will be continuous functions of time f tbelonging to a certain set, and the
1954
+ output or received signals will be perturbed versions of these. We will
1955
+ consideronly the case where both transmitted and received signals are limited
1956
+ to a certain band W. They can thenbe specified, for a time T, by 2TWnumbers,
1957
+ and their statistical structure by finite dimensional distributionfunctions.
1958
+ Thus the statistics of the transmitted signal will be determined by P x1 xn P x
1959
+ ;
1960
+
1961
+ : : : ;
1962
+
1963
+ = 41
1964
+ ===============================================================================
1965
+ and those of the noise by the conditional probability distribution Px y y P y 1
1966
+ xn 1 n x ;
1967
+
1968
+ : : : ;
1969
+
1970
+ = : ;
1971
+
1972
+ :::;
1973
+
1974
+ The rate of transmission of information for a continuous channel is defined in
1975
+ a way analogous to that for a discrete channel, namely R H x Hy x = , where H
1976
+ xis the entropy of the input and Hy xthe equivocation. The channel capacity Cis
1977
+ defined as the maximum of Rwhen we vary the input over all possible ensembles.
1978
+ This means that in a finite dimensionalapproximation we must vary P x P x1
1979
+ xnand maximize = ;
1980
+
1981
+ : : : ;
1982
+
1983
+ Z ZZ P x y P xlog P x dx P x ylog ;
1984
+
1985
+ dx dy , + ;
1986
+
1987
+ : P y This can be written Z Z P x y P x ylog ;
1988
+
1989
+ dx dy ;
1990
+
1991
+ P x P y Z Z Z using the fact that P x ylog P x dx dy P xlog P x dx. The channel
1992
+ capacity is thus expressed as ;
1993
+
1994
+ = follows: 1 ZZ P x y C Lim Max P x ylog ;
1995
+
1996
+ dx dy = ;
1997
+
1998
+ : T P x T P x P y ! It is obvious in this form that Rand Care independent of
1999
+ the coordinate system since the numerator P x y and denominator in log ;
2000
+
2001
+ will be multiplied by the same factors when xand yare transformed in P x P y
2002
+ any one-to-one way. This integral expression for Cis more general than H x Hy
2003
+ x. Properly interpreted , (see Appendix 7) it will always exist while H x Hy
2004
+ xmay assume an indeterminate form in some , , cases. This occurs, for example,
2005
+ if xis limited to a surface of fewer dimensions than nin its
2006
+ ndimensionalapproximation. If the logarithmic base used in computing H xand Hy
2007
+ xis two then Cis the maximum number of binary digits that can be sent per
2008
+ second over the channel with arbitrarily small equivocation, just as inthe
2009
+ discrete case. This can be seen physically by dividing the space of signals
2010
+ into a large number ofsmall cells, sufficiently small so that the probability
2011
+ density Px yof signal xbeing perturbed to point yis substantially constant over
2012
+ a cell (either of xor y). If the cells are considered as distinct points the
2013
+ situation isessentially the same as a discrete channel and the proofs used
2014
+ there will apply. But it is clear physically thatthis quantizing of the volume
2015
+ into individual points cannot in any practical situation alter the final
2016
+ answersignificantly, provided the regions are sufficiently small. Thus the
2017
+ capacity will be the limit of the capacitiesfor the discrete subdivisions and
2018
+ this is just the continuous capacity defined above. On the mathematical side it
2019
+ can be shown first (see Appendix 7) that if uis the message, xis the signal,
2020
+ yis the received signal (perturbed by noise) and vis the recovered message then
2021
+ H x Hy x H u Hv u , , regardless of what operations are performed on uto obtain
2022
+ xor on yto obtain v. Thus no matter how weencode the binary digits to obtain
2023
+ the signal, or how we decode the received signal to recover the message,the
2024
+ discrete rate for the binary digits does not exceed the channel capacity we
2025
+ have defined. On the otherhand, it is possible under very general conditions to
2026
+ find a coding system for transmitting binary digits at therate Cwith as small
2027
+ an equivocation or frequency of errors as desired. This is true, for example,
2028
+ if, when wetake a finite dimensional approximating space for the signal
2029
+ functions, P x yis continuous in both xand y ;
2030
+
2031
+ except at a set of points of probability zero. An important special case occurs
2032
+ when the noise is added to the signal and is independent of it (in the
2033
+ probability sense). Then Px yis a function only of the difference n y x, = , Px
2034
+ y Q y x = , 42
2035
+ ===============================================================================
2036
+ and we can assign a definite entropy to the noise (independent of the
2037
+ statistics of the signal), namely theentropy of the distribution Q n. This
2038
+ entropy will be denoted by H n. Theorem 16:If the signal and noise are
2039
+ independent and the received signal is the sum of the transmitted signal and
2040
+ the noise then the rate of transmission is R H y H n = , ;
2041
+
2042
+ i.e., the entropy of the received signal less the entropy of the noise. The
2043
+ channel capacity is C Max H y H n = , : P x We have, since y x n: = + H x y H x
2044
+ n ;
2045
+
2046
+ = ;
2047
+
2048
+ : Expanding the left side and using the fact that xand nare independent H y Hy
2049
+ x H x H n + = + : Hence R H x Hy x H y H n = , = , : Since H nis independent of
2050
+ P x, maximizing Rrequires maximizing H y, the entropy of the received signal.
2051
+ If there are certain constraints on the ensemble of transmitted signals, the
2052
+ entropy of the receivedsignal must be maximized subject to these constraints.
2053
+ 25. CHANNEL CAPACITY WITH AN AVERAGE POWER LIMITATION A simple application of
2054
+ Theorem 16 is the case when the noise is a white thermal noise and the
2055
+ transmittedsignals are limited to a certain average power P. Then the received
2056
+ signals have an average power P N + where Nis the average noise power. The
2057
+ maximum entropy for the received signals occurs when they alsoform a white
2058
+ noise ensemble since this is the greatest possible entropy for a power P Nand
2059
+ can be obtained + by a suitable choice of transmitted signals, namely if they
2060
+ form a white noise ensemble of power P. Theentropy (per second) of the received
2061
+ ensemble is then H y Wlog 2 e P N = + ;
2062
+
2063
+ and the noise entropy is H n Wlog 2 eN = : The channel capacity is P N C H y H
2064
+ n Wlog + = , = : N Summarizing we have the following: Theorem 17:The capacity
2065
+ of a channel of band Wperturbed by white thermal noise power Nwhen the average
2066
+ transmitter power is limited to Pis given by P N + C Wlog = : N This means that
2067
+ by sufficiently involved encoding systems we can transmit binary digits at the
2068
+ rate P N Wlog + 2 bits per second, with arbitrarily small frequency of errors.
2069
+ It is not possible to transmit at a N higher rate by any encoding system
2070
+ without a definite positive frequency of errors. To approximate this limiting
2071
+ rate of transmission the transmitted signals must approximate, in statistical
2072
+ properties, a white noise.6 A system which approaches the ideal rate may be
2073
+ described as follows: Let 6This and other properties of the white noise case
2074
+ are discussed from the geometrical point of view in "Communication in the
2075
+ Presence of Noise," loc. cit. 43
2076
+ ===============================================================================
2077
+ M 2ssamples of white noise be constructed each of duration T. These are
2078
+ assigned binary numbers from = 0 to M 1. At the transmitter the message
2079
+ sequences are broken up into groups of sand for each group , the corresponding
2080
+ noise sample is transmitted as the signal. At the receiver the Msamples are
2081
+ known andthe actual received signal (perturbed by noise) is compared with each
2082
+ of them. The sample which has theleast R.M.S. discrepancy from the received
2083
+ signal is chosen as the transmitted signal and the correspondingbinary number
2084
+ reconstructed. This process amounts to choosing the most probable (a
2085
+ posteriori) signal.The number Mof noise samples used will depend on the
2086
+ tolerable frequency of errors, but for almost all selections of samples we have
2087
+ log M T P N ;
2088
+
2089
+ + Lim Lim Wlog = ;
2090
+
2091
+ 0 T T N ! ! so that no matter how small is chosen, we can, by taking
2092
+ Tsufficiently large, transmit as near as we wish P N to TWlog + binary digits
2093
+ in the time T. N P N Formulas similar to C Wlog + for the white noise case have
2094
+ been developed independently = N by several other writers, although with
2095
+ somewhat different interpretations. We may mention the work ofN. Wiener,7 W. G.
2096
+ Tuller,8 and H. Sullivan in this connection. In the case of an arbitrary
2097
+ perturbing noise (not necessarily white thermal noise) it does not appear that
2098
+ the maximizing problem involved in determining the channel capacity Ccan be
2099
+ solved explicitly. However,upper and lower bounds can be set for Cin terms of
2100
+ the average noise power Nthe noise entropy power N1.These bounds are
2101
+ sufficiently close together in most practical cases to furnish a satisfactory
2102
+ solution to theproblem. Theorem 18:The capacity of a channel of band Wperturbed
2103
+ by an arbitrary noise is bounded by the inequalities P N1 P N Wlog + C Wlog +
2104
+ N1 N1 where P average transmitter power = N average noise power = N1 entropy
2105
+ power of the noise. = Here again the average power of the perturbed signals
2106
+ will be P N. The maximum entropy for this + power would occur if the received
2107
+ signal were white noise and would be Wlog 2 e P N. It may not + be possible to
2108
+ achieve this;
2109
+
2110
+ i.e., there may not be any ensemble of transmitted signals which, added to
2111
+ theperturbing noise, produce a white thermal noise at the receiver, but at
2112
+ least this sets an upper bound to H y. We have, therefore C Max H y H n = ,
2113
+ Wlog 2 e P N Wlog 2 eN1 + , : This is the upper limit given in the theorem. The
2114
+ lower limit can be obtained by considering the rate if wemake the transmitted
2115
+ signal a white noise, of power P. In this case the entropy power of the
2116
+ received signalmust be at least as great as that of a white noise of power P N1
2117
+ since we have shown in in a previous + theorem that the entropy power of the
2118
+ sum of two ensembles is greater than or equal to the sum of theindividual
2119
+ entropy powers. Hence Max H y Wlog 2 e P N1 + 7Cybernetics, loc.
2120
+ cit.8"Theoretical Limitations on the Rate of Transmission of Information,"
2121
+ Proceedings of the Institute of Radio Engineers,v. 37, No. 5, May, 1949, pp.
2122
+ 468�78. 44
2123
+ ===============================================================================
2124
+ and C Wlog 2 e P N1 Wlog 2 eN1 + , P N1 + Wlog = : N1 As Pincreases, the upper
2125
+ and lower bounds approach each other, so we have as an asymptotic rate P N Wlog
2126
+ + : N1 If the noise is itself white, N N1 and the result reduces to the formula
2127
+ proved previously: = P C Wlog 1 = + : N If the noise is Gaussian but with a
2128
+ spectrum which is not necessarily flat, N1 is the geometric mean of the noise
2129
+ power over the various frequencies in the band W. Thus 1 Z N1 exp log N f d f =
2130
+ W W where N fis the noise power at frequency f. Theorem 19:If we set the
2131
+ capacity for a given transmitter power Pequal to P N + , C Wlog = N1 then is
2132
+ monotonic decreasing as Pincreases and approaches 0 as a limit. Suppose that
2133
+ for a given power P1 the channel capacity is P1 N 1 Wlog + , : N1 This means
2134
+ that the best signal distribution, say p x, when added to the noise
2135
+ distribution q x, gives a received distribution r ywhose entropy power is P1 N
2136
+ 1 . Let us increase the power to P1 Pby + , + adding a white noise of power Pto
2137
+ the signal. The entropy of the received signal is now at least H y Wlog 2 e P1
2138
+ N 1 P = + , + by application of the theorem on the minimum entropy power of a
2139
+ sum. Hence, since we can attain theHindicated, the entropy of the maximizing
2140
+ distribution must be at least as great and must be monotonic decreasing. To
2141
+ show that 0 as P consider a signal which is white noise with a large P.
2142
+ Whatever ! ! the perturbing noise, the received signal will be approximately a
2143
+ white noise, if Pis sufficiently large, in thesense of having an entropy power
2144
+ approaching P N. + 26. THE CHANNEL CAPACITY WITH A PEAK POWER LIMITATION In
2145
+ some applications the transmitter is limited not by the average power output
2146
+ but by the peak instantaneouspower. The problem of calculating the channel
2147
+ capacity is then that of maximizing (by variation of theensemble of transmitted
2148
+ symbols) H y H n , p subject to the constraint that all the functions f tin the
2149
+ ensemble be less than or equal to S, say, for all t. A constraint of this type
2150
+ does not work out as well mathematically as the average power limitation. The S
2151
+ most we have obtained for this case is a lower bound valid for all , an
2152
+ "asymptotic" upper bound (valid N S S for large ) and an asymptotic value of
2153
+ Cfor small. N N 45
2154
+ ===============================================================================
2155
+ Theorem 20:The channel capacity Cfor a band Wperturbed by white thermal noise
2156
+ of power Nis bounded by 2 S C Wlog ;
2157
+
2158
+ e3 N S where Sis the peak allowed transmitter power. For sufficiently large N 2
2159
+ S N + C Wlog e 1 + N S where is arbitrarily small. As 0 (and provided the band
2160
+ Wstarts at 0) ! N . S C Wlog 1 1 + ! : N S We wish to maximize the entropy of
2161
+ the received signal. If is large this will occur very nearly when N we maximize
2162
+ the entropy of the transmitted ensemble. The asymptotic upper bound is obtained
2163
+ by relaxing the conditions on the ensemble. Let us suppose that the power is
2164
+ limited to Snot at every instant of time, but only at the sample points. The
2165
+ maximum entropy ofthe transmitted ensemble under these weakened conditions is
2166
+ certainly greater than or equal to that under theoriginal conditions. This
2167
+ altered problem can be solved easily. The maximum entropy occurs if the
2168
+ different p p samples are independent and have a distribution function which is
2169
+ constant from Sto S. The entropy , + can be calculated as Wlog 4S: The received
2170
+ signal will then have an entropy less than Wlog 4S 2 eN1 + + S with 0 as and
2171
+ the channel capacity is obtained by subtracting the entropy of the white noise,
2172
+ ! ! N Wlog 2 eN: 2 S N + Wlog 4S 2 eN1 Wlog 2 eN Wlog e 1 + + , = + : N This is
2173
+ the desired upper bound to the channel capacity. To obtain a lower bound
2174
+ consider the same ensemble of functions. Let these functions be passed through
2175
+ an ideal filter with a triangular transfer characteristic. The gain is to be
2176
+ unity at frequency 0 and declinelinearly down to gain 0 at frequency W. We
2177
+ first show that the output functions of the filter have a peak sin 2 W t power
2178
+ limitation Sat all times (not just the sample points). First we note that a
2179
+ pulse going into 2 W t the filter produces 1 sin2 W t 2 W t2 in the output.
2180
+ This function is never negative. The input function (in the general case) can
2181
+ be thought of asthe sum of a series of shifted functions sin 2 W t a 2 W t p
2182
+ where a, the amplitude of the sample, is not greater than S. Hence the output
2183
+ is the sum of shifted functions of the non-negative form above with the same
2184
+ coefficients. These functions being non-negative, the greatest p positive value
2185
+ for any tis obtained when all the coefficients ahave their maximum positive
2186
+ values, i.e., S. p In this case the input function was a constant of amplitude
2187
+ Sand since the filter has unit gain for D.C., the output is the same. Hence the
2188
+ output ensemble has a peak power S. 46
2189
+ ===============================================================================
2190
+ The entropy of the output ensemble can be calculated from that of the input
2191
+ ensemble by using the theorem dealing with such a situation. The output entropy
2192
+ is equal to the input entropy plus the geometricalmean gain of the filter: Z W
2193
+ Z W W f2 log G2 d f log , d f 2W = = , : 0 0 W Hence the output entropy is 4S
2194
+ Wlog 4S 2W Wlog , = e2 and the channel capacity is greater than 2 S Wlog : e3 N
2195
+ S We now wish to show that, for small (peak signal power over average white
2196
+ noise power), the channel N capacity is approximately S C Wlog 1 = + : N . S S
2197
+ More precisely C Wlog 1 1 as 0. Since the average signal power Pis less than or
2198
+ equal + ! ! N N S to the peak S, it follows that for all N P S C Wlog 1 Wlog 1
2199
+ + + : N N S Therefore, if we can find an ensemble of functions such that they
2200
+ correspond to a rate nearly Wlog 1 + Nand are limited to band Wand peak Sthe
2201
+ result will be proved. Consider the ensemble of functions of the p p following
2202
+ type. A series of tsamples have the same value, either Sor S, then the next
2203
+ tsamples have + , p p the same value, etc. The value for a series is chosen at
2204
+ random, probability 1 for Sand 1 for S. If 2 + 2 , this ensemble be passed
2205
+ through a filter with triangular gain characteristic (unit gain at D.C.), the
2206
+ output ispeak limited to S. Furthermore the average power is nearly Sand can be
2207
+ made to approach this by taking t sufficiently large. The entropy of the sum of
2208
+ this and the thermal noise can be found by applying the theoremon the sum of a
2209
+ noise and a small signal. This theorem will apply if S p t N S is sufficiently
2210
+ small. This can be ensured by taking small enough (after tis chosen). The
2211
+ entropy power N will be S Nto as close an approximation as desired, and hence
2212
+ the rate of transmission as near as we wish + to S N Wlog + : N PART V: THE
2213
+ RATE FOR A CONTINUOUS SOURCE 27. FIDELITY EVALUATION FUNCTIONS In the case of a
2214
+ discrete source of information we were able to determine a definite rate of
2215
+ generatinginformation, namely the entropy of the underlying stochastic process.
2216
+ With a continuous source the situationis considerably more involved. In the
2217
+ first place a continuously variable quantity can assume an infinitenumber of
2218
+ values and requires, therefore, an infinite number of binary digits for exact
2219
+ specification. Thismeans that to transmit the output of a continuous source
2220
+ with exact recoveryat the receiving point requires, 47
2221
+ ===============================================================================
2222
+ in general, a channel of infinite capacity (in bits per second). Since,
2223
+ ordinarily, channels have a certainamount of noise, and therefore a finite
2224
+ capacity, exact transmission is impossible. This, however, evades the real
2225
+ issue. Practically, we are not interested in exact transmission when we have a
2226
+ continuous source, but only in transmission to within a certain tolerance. The
2227
+ question is, can weassign a definite rate to a continuous source when we
2228
+ require only a certain fidelity of recovery, measured ina suitable way. Of
2229
+ course, as the fidelity requirements are increased the rate will increase. It
2230
+ will be shownthat we can, in very general cases, define such a rate, having the
2231
+ property that it is possible, by properlyencoding the information, to transmit
2232
+ it over a channel whose capacity is equal to the rate in question, andsatisfy
2233
+ the fidelity requirements. A channel of smaller capacity is insufficient. It is
2234
+ first necessary to give a general mathematical formulation of the idea of
2235
+ fidelity of transmission. Consider the set of messages of a long duration, say
2236
+ Tseconds. The source is described by giving theprobability density, in the
2237
+ associated space, that the source will select the message in question P x. A
2238
+ given communication system is described (from the external point of view) by
2239
+ giving the conditional probabilityPx ythat if message xis produced by the
2240
+ source the recovered message at the receiving point will be y. The system as a
2241
+ whole (including source and transmission system) is described by the
2242
+ probability function P x y ;
2243
+
2244
+ of having message xand final output y. If this function is known, the complete
2245
+ characteristics of the systemfrom the point of view of fidelity are known. Any
2246
+ evaluation of fidelity must correspond mathematicallyto an operation applied to
2247
+ P x y. This operation must at least have the properties of a simple ordering of
2248
+ ;
2249
+
2250
+ systems;
2251
+
2252
+ i.e., it must be possible to say of two systems represented by P1 x yand P2 x
2253
+ ythat, according to ;
2254
+
2255
+ ;
2256
+
2257
+ our fidelity criterion, either (1) the first has higher fidelity, (2) the
2258
+ second has higher fidelity, or (3) they haveequal fidelity. This means that a
2259
+ criterion of fidelity can be represented by a numerically valued function: , v
2260
+ P x y ;
2261
+
2262
+ whose argument ranges over possible probability functions P x y. ;
2263
+
2264
+ , We will now show that under very general and reasonable assumptions the
2265
+ function v P x y can be ;
2266
+
2267
+ written in a seemingly much more specialized form, namely as an average of a
2268
+ function x yover the set ;
2269
+
2270
+ of possible values of xand y: Z Z , v P x y P x y x y dx dy ;
2271
+
2272
+ = ;
2273
+
2274
+ ;
2275
+
2276
+ : To obtain this we need only assume (1) that the source and system are ergodic
2277
+ so that a very long samplewill be, with probability nearly 1, typical of the
2278
+ ensemble, and (2) that the evaluation is "reasonable" in thesense that it is
2279
+ possible, by observing a typical input and output x1 and y1, to form a
2280
+ tentative evaluationon the basis of these samples;
2281
+
2282
+ and if these samples are increased in duration the tentative evaluation
2283
+ will,with probability 1, approach the exact evaluation based on a full
2284
+ knowledge of P x y. Let the tentative ;
2285
+
2286
+ evaluation be x y. Then the function x yapproaches (as T ) a constant for
2287
+ almost all x ywhich ;
2288
+
2289
+ ;
2290
+
2291
+ ! ;
2292
+
2293
+ are in the high probability region corresponding to the system: , x y v P x y ;
2294
+
2295
+ ! ;
2296
+
2297
+ and we may also write Z Z x y P x y x y dx dy ;
2298
+
2299
+ ! ;
2300
+
2301
+ ;
2302
+
2303
+ since Z Z P x y dx dy 1 ;
2304
+
2305
+ = : This establishes the desired result. The function x yhas the general nature
2306
+ of a "distance" between xand y.9 It measures how undesirable ;
2307
+
2308
+ it is (according to our fidelity criterion) to receive ywhen xis transmitted.
2309
+ The general result given abovecan be restated as follows: Any reasonable
2310
+ evaluation can be represented as an average of a distance functionover the set
2311
+ of messages and recovered messages xand yweighted according to the probability
2312
+ P x yof ;
2313
+
2314
+ getting the pair in question, provided the duration Tof the messages be taken
2315
+ sufficiently large. The following are simple examples of evaluation functions:
2316
+ 9It is not a "metric" in the strict sense, however, since in general it does
2317
+ not satisfy either x y y xor x y y z x z. ;
2318
+
2319
+ = ;
2320
+
2321
+ ;
2322
+
2323
+ + ;
2324
+
2325
+ ;
2326
+
2327
+ 48
2328
+ ===============================================================================
2329
+ 1. R.M.S. criterion. , 2 v x t y t = , : In this very commonly used measure of
2330
+ fidelity the distance function x yis (apart from a constant ;
2331
+
2332
+ factor) the square of the ordinary Euclidean distance between the points xand
2333
+ yin the associatedfunction space. 1 Z T 2 x y x t y t dt ;
2334
+
2335
+ = , : T0 2. Frequency weighted R.M.S. criterion. More generally one can apply
2336
+ different weights to the different frequency components before using an R.M.S.
2337
+ measure of fidelity. This is equivalent to passing thedifference x t y tthrough
2338
+ a shaping filter and then determining the average power in the output. , Thus
2339
+ let e t x t y t = , and Z f t e k t d = , , then 1 Z T x y f t2 dt ;
2340
+
2341
+ = : T0 3. Absolute error criterion. 1 Z T x y x t y t dt ;
2342
+
2343
+ = , : T0 4. The structure of the ear and brain determine implicitly an
2344
+ evaluation, or rather a number of evaluations, appropriate in the case of
2345
+ speech or music transmission. There is, for example, an
2346
+ "intelligibility"criterion in which x yis equal to the relative frequency of
2347
+ incorrectly interpreted words when ;
2348
+
2349
+ message x tis received as y t. Although we cannot give an explicit
2350
+ representation of x yin these ;
2351
+
2352
+ cases it could, in principle, be determined by sufficient experimentation. Some
2353
+ of its properties followfrom well-known experimental results in hearing, e.g.,
2354
+ the ear is relatively insensitive to phase and thesensitivity to amplitude and
2355
+ frequency is roughly logarithmic. 5. The discrete case can be considered as a
2356
+ specialization in which we have tacitly assumed an evaluation based on the
2357
+ frequency of errors. The function x yis then defined as the number of symbols
2358
+ in the ;
2359
+
2360
+ sequence ydiffering from the corresponding symbols in xdivided by the total
2361
+ number of symbols inx. 28. THE RATE FOR A SOURCE RELATIVE TO A FIDELITY
2362
+ EVALUATION We are now in a position to define a rate of generating information
2363
+ for a continuous source. We are givenP xfor the source and an evaluation
2364
+ vdetermined by a distance function x ywhich will be assumed ;
2365
+
2366
+ continuous in both xand y. With a particular system P x ythe quality is
2367
+ measured by ;
2368
+
2369
+ Z Z v x y P x y dx dy = ;
2370
+
2371
+ ;
2372
+
2373
+ : Furthermore the rate of flow of binary digits corresponding to P x yis ;
2374
+
2375
+ Z Z P x y R P x ylog ;
2376
+
2377
+ dx dy = ;
2378
+
2379
+ : P x P y We define the rate R1 of generating information for a given quality
2380
+ v1 of reproduction to be the minimum ofRwhen we keep vfixed at v1 and vary Px
2381
+ y. That is: Z Z P x y R ;
2382
+
2383
+ 1 Min P x ylog dx dy = ;
2384
+
2385
+ Px y P x P y 49
2386
+ ===============================================================================
2387
+ subject to the constraint: Z Z v1 P x y x y dx dy = ;
2388
+
2389
+ ;
2390
+
2391
+ : This means that we consider, in effect, all the communication systems that
2392
+ might be used and that transmit with the required fidelity. The rate of
2393
+ transmission in bits per second is calculated for each oneand we choose that
2394
+ having the least rate. This latter rate is the rate we assign the source for
2395
+ the fidelity inquestion. The justification of this definition lies in the
2396
+ following result: Theorem 21:If a source has a rate R1 for a valuation v1 it is
2397
+ possible to encode the output of the source and transmit it over a channel of
2398
+ capacity Cwith fidelity as near v1 as desired provided R1 C. This is not
2399
+ possible if R1 C. The last statement in the theorem follows immediately from
2400
+ the definition of R1 and previous results. If it were not true we could
2401
+ transmit more than Cbits per second over a channel of capacity C. The first
2402
+ partof the theorem is proved by a method analogous to that used for Theorem 11.
2403
+ We may, in the first place,divide the x yspace into a large number of small
2404
+ cells and represent the situation as a discrete case. This ;
2405
+
2406
+ will not change the evaluation function by more than an arbitrarily small
2407
+ amount (when the cells are verysmall) because of the continuity assumed for x
2408
+ y. Suppose that P1 x yis the particular system which ;
2409
+
2410
+ ;
2411
+
2412
+ minimizes the rate and gives R1. We choose from the high probability y's a set
2413
+ at random containing 2 R T 1+ members where 0 as T . With large Teach chosen
2414
+ point will be connected by a high probability ! ! line (as in Fig. 10) to a set
2415
+ of x's. A calculation similar to that used in proving Theorem 11 shows that
2416
+ withlarge Talmost all x's are covered by the fans from the chosen ypoints for
2417
+ almost all choices of the y's. Thecommunication system to be used operates as
2418
+ follows: The selected points are assigned binary numbers.When a message xis
2419
+ originated it will (with probability approaching 1 as T ) lie within at least
2420
+ one ! of the fans. The corresponding binary number is transmitted (or one of
2421
+ them chosen arbitrarily if there areseveral) over the channel by suitable
2422
+ coding means to give a small probability of error. Since R1 Cthis is possible.
2423
+ At the receiving point the corresponding yis reconstructed and used as the
2424
+ recovered message. The evaluation v0 for this system can be made arbitrarily
2425
+ close to v 1 1 by taking Tsufficiently large. This is due to the fact that for
2426
+ each long sample of message x tand recovered message y tthe evaluation
2427
+ approaches v1 (with probability 1). It is interesting to note that, in this
2428
+ system, the noise in the recovered message is actually produced by a kind of
2429
+ general quantizing at the transmitter and not produced by the noise in the
2430
+ channel. It is more or lessanalogous to the quantizing noise in PCM. 29. THE
2431
+ CALCULATION OF RATES The definition of the rate is similar in many respects to
2432
+ the definition of channel capacity. In the former Z Z P x y R Min P x ylog ;
2433
+
2434
+ dx dy = ;
2435
+
2436
+ Px y P x P y Z Z with P xand v1 P x y x y dx dyfixed. In the latter = ;
2437
+
2438
+ ;
2439
+
2440
+ Z Z P x y ;
2441
+
2442
+ C Max P x ylog dx dy = ;
2443
+
2444
+ P x P x P y with Px yfixed and possibly one or more other constraints (e.g., an
2445
+ average power limitation) of the form R R K P x y x y dx dy. = ;
2446
+
2447
+ ;
2448
+
2449
+ A partial solution of the general maximizing problem for determining the rate
2450
+ of a source can be given. Using Lagrange's method we consider Z Z P x y ;
2451
+
2452
+ P x ylog P x y x y x P x y dx dy ;
2453
+
2454
+ + ;
2455
+
2456
+ ;
2457
+
2458
+ + ;
2459
+
2460
+ : P x P y 50
2461
+ ===============================================================================
2462
+ The variational equation (when we take the first variation on P x y) leads to ;
2463
+
2464
+ P x y ;
2465
+
2466
+ y x B x e, = where is determined to give the required fidelity and B xis chosen
2467
+ to satisfy Z B x e x y , ;
2468
+
2469
+ dx 1 = : This shows that, with best encoding, the conditional probability of a
2470
+ certain cause for various received y, Py xwill decline exponentially with the
2471
+ distance function x ybetween the xand yin question. ;
2472
+
2473
+ In the special case where the distance function x ydepends only on the (vector)
2474
+ difference between x ;
2475
+
2476
+ and y, x y x y ;
2477
+
2478
+ = , we have Z B x e x y , , dx 1 = : Hence B xis constant, say , and P x y , y
2479
+ x e, = : Unfortunately these formal solutions are difficult to evaluate in
2480
+ particular cases and seem to be of little value.In fact, the actual calculation
2481
+ of rates has been carried out in only a few very simple cases. If the distance
2482
+ function x yis the mean square discrepancy between xand yand the message
2483
+ ensemble ;
2484
+
2485
+ is white noise, the rate can be determined. In that case we have R Min H x Hy x
2486
+ H x MaxHy x = , = , with N x y2. But the Max Hy xoccurs when y xis a white
2487
+ noise, and is equal to W1 log 2 eNwhere = , , W1 is the bandwidth of the
2488
+ message ensemble. Therefore R W1 log 2 eQ W1 log 2 eN = , Q W1 log = N where
2489
+ Qis the average message power. This proves the following: Theorem 22:The rate
2490
+ for a white noise source of power Qand band W1 relative to an R.M.S. measure of
2491
+ fidelity is Q R W1 log = N where Nis the allowed mean square error between
2492
+ original and recovered messages. More generally with any message source we can
2493
+ obtain inequalities bounding the rate relative to a mean square error
2494
+ criterion. Theorem 23:The rate for any source of band W1 is bounded by Q1 Q W1
2495
+ log R W1 log N N where Qis the average power of the source, Q1 its entropy
2496
+ power and Nthe allowed mean square error. The lower bound follows from the fact
2497
+ that the Max H 2 y x for a given x y Noccurs in the white , = noise case. The
2498
+ upper bound results if we place points (used in the proof of Theorem 21) not in
2499
+ the best way p but at random in a sphere of radius Q N. , 51
2500
+ ===============================================================================
2501
+ ACKNOWLEDGMENTS The writer is indebted to his colleagues at the Laboratories,
2502
+ particularly to Dr. H. W. Bode, Dr. J. R. Pierce,Dr. B. McMillan, and Dr. B. M.
2503
+ Oliver for many helpful suggestions and criticisms during the course of
2504
+ thiswork. Credit should also be given to Professor N. Wiener, whose elegant
2505
+ solution of the problems of filteringand prediction of stationary ensembles has
2506
+ considerably influenced the writer's thinking in this field. APPENDIX 5 Let S1
2507
+ be any measurable subset of the gensemble, and S2 the subset of the fensemble
2508
+ which gives S1under the operation T. Then S1 T S2 = : Let H be the operator
2509
+ which shifts all functions in a set by the time . Then HS1 HT S2 T HS2 = =
2510
+ since Tis invariant and therefore commutes with H. Hence if m Sis the
2511
+ probability measure of the set S m HS1 m T HS2 m HS2 = = m S2 m S1 = = where
2512
+ the second equality is by definition of measure in the gspace, the third since
2513
+ the fensemble isstationary, and the last by definition of gmeasure again. To
2514
+ prove that the ergodic property is preserved under invariant operations, let S1
2515
+ be a subset of the g ensemble which is invariant under H, and let S2 be the set
2516
+ of all functions fwhich transform into S1. Then HS1 HT S2 T HS2 S1 = = = so
2517
+ that HS2 is included in S2 for all . Now, since m HS2 m S1 = this implies HS2
2518
+ S2 = for all with m S2 0 1. This contradiction shows that S1 does not exist. 6=
2519
+ ;
2520
+
2521
+ APPENDIX 6 The upper bound, N3 N1 N2, is due to the fact that the maximum
2522
+ possible entropy for a power N1 N2 + + occurs when we have a white noise of
2523
+ this power. In this case the entropy power is N1 N2. + To obtain the lower
2524
+ bound, suppose we have two distributions in ndimensions p xiand q xiwith
2525
+ entropy powers N1 and N2. What form should pand qhave to minimize the entropy
2526
+ power N3 of theirconvolution r xi: Z r xi p yi q xi yi dyi = , : The entropy H3
2527
+ of ris given by Z H3 r xilog r xi dxi = , : We wish to minimize this subject to
2528
+ the constraints Z H1 p xilog p xi dxi = , Z H2 q xilog q xi dxi = , : 52
2529
+ ===============================================================================
2530
+ We consider then Z U r xlog r x p xlog p x q xlog q x dx = , + + Z U 1 logr x r
2531
+ x 1 log p x p x 1 logq x q x dx = , + + + + + : If p xis varied at a particular
2532
+ argument xi si, the variation in r xis = r x q xi si = , and Z U q xi silog r
2533
+ xi dxi log p si 0 = , , , = and similarly when qis varied. Hence the conditions
2534
+ for a minimum are Z q xi silog r xi dxi log p si , = , Z p xi silog r xi dxi
2535
+ log q si , = , : If we multiply the first by p siand the second by q siand
2536
+ integrate with respect to siwe obtain H3 H1 = , H3 H2 = , or solving for and
2537
+ and replacing in the equations Z H1 q xi silog r xi dxi H3 log p si , = , Z H2
2538
+ p xi silog r xi dxi H3 log q si , = , : Now suppose p xiand q xiare normal A n2
2539
+ = i j j j p x 1 i exp Aijxixj = , 2 n2 2 = B n2 = i j j j q x 1 i exp Bijxixj =
2540
+ , : 2 n2 2 = Then r xiwill also be normal with quadratic form Ci j. If the
2541
+ inverses of these forms are ai j, bi j, ci jthen ci j ai j bi j = + : We wish
2542
+ to show that these functions satisfy the minimizing conditions if and only if
2543
+ ai j Kbi jand thus = give the minimum H3 under the constraints. First we have n
2544
+ 1 log r x 1 i log Ci j Cijxixj = j j , 2 2 2 Z n 1 q x 1 1 i silog r xi dxi log
2545
+ Ci j CijsisjCijbij , = j j , , : 2 2 2 2 This should equal H3 n 1 log A 1 i j
2546
+ Aijsisj j j , H 2 1 2 2 H1 H1 which requires Ai j Ci j. In this case Ai j Bi
2547
+ jand both equations reduce to identities. = = H3 H2 53
2548
+ ===============================================================================
2549
+ APPENDIX 7 The following will indicate a more general and more rigorous
2550
+ approach to the central definitions of commu-nication theory. Consider a
2551
+ probability measure space whose elements are ordered pairs x y. The variables ;
2552
+
2553
+ x, yare to be identified as the possible transmitted and received signals of
2554
+ some long duration T. Let us callthe set of all points whose xbelongs to a
2555
+ subset S1 of xpoints the strip over S1, and similarly the set whoseybelong to
2556
+ S2 the strip over S2. We divide xand yinto a collection of non-overlapping
2557
+ measurable subsetsXiand Yiapproximate to the rate of transmission Rby 1 P Xi Yi
2558
+ R ;
2559
+
2560
+ 1 P Xi Yilog = ;
2561
+
2562
+ T P X P Y i i i where P Xi is the probability measure of the strip over Xi P Yi
2563
+ is the probability measure of the strip over Yi P Xi Yi is the probability
2564
+ measure of the intersection of the strips ;
2565
+
2566
+ : A further subdivision can never decrease R1. For let X1 be divided into X1 X0
2567
+ X00 and let = 1 + 1 P Y1 a P X1 b c = = + P X0 b P X0 Y1 d 1 = 1;
2568
+
2569
+ = P X00 c P X00 Y1 e 1 = 1 ;
2570
+
2571
+ = P X1 Y1 d e ;
2572
+
2573
+ = + : Then in the sum we have replaced (for the X1, Y1 intersection) d e d e +
2574
+ d elog by dlog elog + + : a b c ab ac + It is easily shown that with the
2575
+ limitation we have on b, c, d, e, d e d e + ddee + b c bdce + and consequently
2576
+ the sum is increased. Thus the various possible subdivisions form a directed
2577
+ set, withRmonotonic increasing with refinement of the subdivision. We may
2578
+ define Runambiguously as the leastupper bound for R1 and write it 1 ZZ P x y R
2579
+ P x ylog ;
2580
+
2581
+ dx dy = ;
2582
+
2583
+ : T P x P y This integral, understood in the above sense, includes both the
2584
+ continuous and discrete cases and of coursemany others which cannot be
2585
+ represented in either form. It is trivial in this formulation that if xand
2586
+ uarein one-to-one correspondence, the rate from uto yis equal to that from xto
2587
+ y. If vis any function of y(notnecessarily with an inverse) then the rate from
2588
+ xto yis greater than or equal to that from xto vsince, inthe calculation of the
2589
+ approximations, the subdivisions of yare essentially a finer subdivision of
2590
+ those forv. More generally if yand vare related not functionally but
2591
+ statistically, i.e., we have a probability measurespace y v, then R x v R x y.
2592
+ This means that any operation applied to the received signal, even though ;
2593
+
2594
+ ;
2595
+
2596
+ ;
2597
+
2598
+ it involves statistical elements, does not increase R. Another notion which
2599
+ should be defined precisely in an abstract formulation of the theory is that of
2600
+ "dimension rate," that is the average number of dimensions required per second
2601
+ to specify a member ofan ensemble. In the band limited case 2Wnumbers per
2602
+ second are sufficient. A general definition can beframed as follows. Let f tbe
2603
+ an ensemble of functions and let T f t f t be a metric measuring ;
2604
+
2605
+ 54
2606
+ ===============================================================================
2607
+ the "distance" from fto fover the time T(for example the R.M.S. discrepancy
2608
+ over this interval.) LetN Tbe the least number of elements fwhich can be chosen
2609
+ such that all elements of the ensemble ;
2610
+
2611
+ ;
2612
+
2613
+ apart from a set of measure are within the distance of at least one of those
2614
+ chosen. Thus we are covering the space to within apart from a set of small
2615
+ measure . We define the dimension rate for the ensemble by the triple limit log
2616
+ N T Lim Lim Lim ;
2617
+
2618
+ ;
2619
+
2620
+ = : 0 0 T Tlog ! ! ! This is a generalization of the measure type definitions
2621
+ of dimension in topology, and agrees with the intu-itive dimension rate for
2622
+ simple ensembles where the desired result is obvious. 55
2623
+ ===============================================================================
2624
+ ************ DDooccuummeenntt OOuuttlliinnee ************
2625
+ * A Mathematical Theory of Communication
2626
+ * Introduction
2627
+ * Part I: Discrete Noiseless Systems
2628
+ o The Discrete Noiseless Channel
2629
+ o The Discrete Source of Information
2630
+ o The Series of Approximations to English
2631
+ o Graphical Representations of a Markoff Process
2632
+ o Ergodic and Mixed Sources
2633
+ o Choice, Uncertainty and Entropy
2634
+ o Representation of the Encoding and Decoding Operation
2635
+ o The Fundamental Theorem of a Noiseless Channel
2636
+ o Discussion and Examples
2637
+ * Part II: The Discrete Channel with Noise
2638
+ o Representation of a Noisy Discrete Channel
2639
+ o The Fundamental Theorem for a Discrete Channel with Noise
2640
+ o Discussion
2641
+ o Example of a Discrete Channel and its Capacity
2642
+ o The Channel Capacity in Certain Special Cases
2643
+ o An Example of Efficient Coding
2644
+ o A1. The Growth of the Number of Blocks of Symbols with a Finite
2645
+ State Condition
2646
+ o A2. The Derivation of Entropy
2647
+ o A3. Theorems on Ergodic Sources
2648
+ o A4. Maximizing the Rate for a System of Constraints
2649
+ * Part III: Mathematical Prelininaries
2650
+ o Sets and Ensembles of Functions
2651
+ o Band Limited Ensembles of Functions
2652
+ o Entropy of a Continuous Distribution
2653
+ o Entropy of an Ensemble of Functions
2654
+ o Entropy Loss in Linear Filters
2655
+ o Entropy of a Sum of Two Ensembles
2656
+ * Part IV: The Continuous Channel
2657
+ o The Capacity of a Continuous Channel
2658
+ o Channel Capacity with an Average Power Limitation
2659
+ o The Channel Capacity with a Peak Power Limitation
2660
+ * Part V: The Rate for a Continuous Source
2661
+ o Fidelity Evaluation Functions
2662
+ o The Rate for a Source Relative to a Fidelity Evaluation
2663
+ o The Calculation of Rates
2664
+ o A5
2665
+ o A6
2666
+ o A7
2667
+ ===============================================================================