RubyGems - rwdgutenberg - Versions diffs - 0.09 → 0.12 - Mend

rwdgutenberg 0.09 → 0.12

Files changed (131) hide show

data/lib/rwdtinker/rwdtinkertools.rb ADDED Viewed

@@ -0,0 +1,24 @@
+module RwdtinkerTools
+# tools to use in rwdtinker
+   def RwdtinkerTools.tail(filename, lines=12)
+           begin
+                   tmpFile = File.open(filename, 'r')
+	           return     tmpFile.readlines.reverse!.slice(0,lines)
+                   tmpFile.close
+          rescue
+              	return "error in opening log"
+ 	        $rwdtinkerlog.error "RwdtinkerTools.tail: file open error"
+          end
+    end
+end

data/{extras → lib}/zip/ioextras.rb RENAMED Viewed

File without changes

data/{extras → lib}/zip/stdrubyext.rb RENAMED Viewed

File without changes

data/{extras → lib}/zip/tempfile_bugfixed.rb RENAMED Viewed

File without changes

data/{extras → lib}/zip/zip.rb RENAMED Viewed

@@ -4,8 +4,8 @@ require 'singleton'
 require 'tempfile'
 require 'ftools'
 require 'zlib'
-require 'extras/zip/stdrubyext'
-require 'extras/zip/ioextras'
+require 'lib/zip/stdrubyext'
+require 'lib/zip/ioextras'
 if Tempfile.superclass == SimpleDelegator
   require 'zip/tempfile_bugfixed'

data/{extras → lib}/zip/zipfilesystem.rb RENAMED Viewed

File without changes

data/{extras → lib}/zip/ziprequire.rb RENAMED Viewed

File without changes

data/rwd_files/Books/marip10.lnk ADDED Viewed

@@ -0,0 +1,6 @@
+rwd_files/Books/marip10.txt
+#Their Mariposa Legend

data/{Books → rwd_files/Books}/marip10.txt RENAMED Viewed

File without changes

data/{Books → rwd_files/Books}/shannon1948.html RENAMED Viewed

File without changes

data/{Books/Shannon.gut → rwd_files/Books/shannon1948.lnk} RENAMED Viewed

@@ -1,4 +1,4 @@
-Books/shannon1948.html
+rwd_files/Books/shannon1948.txt
 #Theory of Communications

data/rwd_files/Books/shannon1948.txt ADDED Viewed

@@ -0,0 +1,2667 @@
+Reprinted with corrections from The Bell System Technical Journal,
+Vol. 27, pp. 379�423, 623�656, July, October, 1948.
+A Mathematical Theory of Communication
+By C. E. SHANNON
+INTRODUCTION The recent development of various methods of modulation such as
+PCM and PPM which exchange bandwidth for signal-to-noise ratio has intensified
+the interest in a general theory of communication. A T basis for such a theory
+is contained in the important papers of Nyquist1 and Hartley2 on this subject.
+In thepresent paper we will extend the theory to include a number of new
+factors, in particular the effect of noisein the channel, and the savings
+possible due to the statistical structure of the original message and due to
+thenature of the final destination of the information. The fundamental problem
+of communication is that of reproducing at one point either exactly or ap-
+proximately a message selected at another point. Frequently the messages have
+meaning; that is they referto or are correlated according to some system with
+certain physical or conceptual entities. These semanticaspects of communication
+are irrelevant to the engineering problem. The significant aspect is that the
+actualmessage is one selected from a setof possible messages. The system must
+be designed to operate for eachpossible selection, not just the one which will
+actually be chosen since this is unknown at the time of design. If the number
+of messages in the set is finite then this number or any monotonic function of
+this number can be regarded as a measure of the information produced when one
+message is chosen from the set, allchoices being equally likely. As was pointed
+out by Hartley the most natural choice is the logarithmicfunction. Although
+this definition must be generalized considerably when we consider the influence
+of thestatistics of the message and when we have a continuous range of
+messages, we will in all cases use anessentially logarithmic measure. The
+logarithmic measure is more convenient for various reasons: 1. It is
+practically more useful. Parameters of engineering importance such as time,
+bandwidth, number of relays, etc., tend to vary linearly with the logarithm of
+the number of possibilities. For example,adding one relay to a group doubles
+the number of possible states of the relays. It adds 1 to the base 2logarithm
+of this number. Doubling the time roughly squares the number of possible
+messages, ordoubles the logarithm, etc. 2. It is nearer to our intuitive
+feeling as to the proper measure. This is closely related to (1) since we in-
+tuitively measures entities by linear comparison with common standards. One
+feels, for example, thattwo punched cards should have twice the capacity of one
+for information storage, and two identicalchannels twice the capacity of one
+for transmitting information. 3. It is mathematically more suitable. Many of
+the limiting operations are simple in terms of the loga- rithm but would
+require clumsy restatement in terms of the number of possibilities. The choice
+of a logarithmic base corresponds to the choice of a unit for measuring
+information. If the base 2 is used the resulting units may be called binary
+digits, or more briefly bits,a word suggested byJ. W. Tukey. A device with two
+stable positions, such as a relay or a flip-flop circuit, can store one bit
+ofinformation. Nsuch devices can store Nbits, since the total number of
+possible states is 2Nand log2 2N N. = If the base 10 is used the units may be
+called decimal digits. Since log2 M log log = 10 M= 10 2 3 32 log = : 10 M;
+1Nyquist, H., "Certain Factors Affecting Telegraph Speed," Bell System
+Technical Journal,April 1924, p. 324;
+"Certain Topics in Telegraph Transmission Theory," A.I.E.E. Trans.,v. 47, April
+1928, p. 617. 2Hartley, R. V. L., "Transmission of Information," Bell System
+Technical Journal,July 1928, p. 535. 1
+===============================================================================
+INFORMATION SOURCE TRANSMITTER RECEIVER DESTINATION SIGNAL RECEIVED SIGNAL
+MESSAGE MESSAGE NOISE SOURCE Fig. 1 -- Schematic diagram of a general
+communication system. a decimal digit is about 3 1 bits. A digit wheel on a
+desk computing machine has ten stable positions and 3 therefore has a storage
+capacity of one decimal digit. In analytical work where integration and
+differentiationare involved the base eis sometimes useful. The resulting units
+of information will be called natural units.Change from the base ato base
+bmerely requires multiplication by logb a. By a communication system we will
+mean a system of the type indicated schematically in Fig. 1. It consists of
+essentially five parts: 1. An information sourcewhich produces a message or
+sequence of messages to be communicated to the receiving terminal. The message
+may be of various types: (a) A sequence of letters as in a telegraphof teletype
+system;
+(b) A single function of time f tas in radio or telephony;
+(c) A function of time and other variables as in black and white television -
+- here the message may be thought of as afunction f x y tof two space
+coordinates and time, the light intensity at point x yand time ton a ;
+;
+;
+pickup tube plate;
+(d) Two or more functions of time, say f t, g t, h t-- this is the case in
+"three- dimensional" sound transmission or if the system is intended to service
+several individual channels inmultiplex;
+(e) Several functions of several variables -- in color television the message
+consists of threefunctions f x y t, g x y t, h x y tdefined in a three-
+dimensional continuum -- we may also think ;
+;
+;
+;
+;
+;
+of these three functions as components of a vector field defined in the region
+-- similarly, severalblack and white television sources would produce
+"messages" consisting of a number of functionsof three variables;
+(f) Various combinations also occur, for example in television with an
+associatedaudio channel. 2. A transmitterwhich operates on the message in some
+way to produce a signal suitable for trans- mission over the channel. In
+telephony this operation consists merely of changing sound pressureinto a
+proportional electrical current. In telegraphy we have an encoding operation
+which producesa sequence of dots, dashes and spaces on the channel
+corresponding to the message. In a multiplexPCM system the different speech
+functions must be sampled, compressed, quantized and encoded,and finally
+interleaved properly to construct the signal. Vocoder systems, television and
+frequencymodulation are other examples of complex operations applied to the
+message to obtain the signal. 3. The channelis merely the medium used to
+transmit the signal from transmitter to receiver. It may be a pair of wires, a
+coaxial cable, a band of radio frequencies, a beam of light, etc. 4. The
+receiverordinarily performs the inverse operation of that done by the
+transmitter, reconstructing the message from the signal. 5. The destinationis
+the person (or thing) for whom the message is intended. We wish to consider
+certain general problems involving communication systems. To do this it is
+first necessary to represent the various elements involved as mathematical
+entities, suitably idealized from their 2
+===============================================================================
+physical counterparts. We may roughly classify communication systems into three
+main categories: discrete,continuous and mixed. By a discrete system we will
+mean one in which both the message and the signalare a sequence of discrete
+symbols. A typical case is telegraphy where the message is a sequence of
+lettersand the signal a sequence of dots, dashes and spaces. A continuous
+system is one in which the message andsignal are both treated as continuous
+functions, e.g., radio or television. A mixed system is one in whichboth
+discrete and continuous variables appear, e.g., PCM transmission of speech. We
+first consider the discrete case. This case has applications not only in
+communication theory, but also in the theory of computing machines, the design
+of telephone exchanges and other fields. In additionthe discrete case forms a
+foundation for the continuous and mixed cases which will be treated in the
+secondhalf of the paper. PART I: DISCRETE NOISELESS SYSTEMS 1. THE DISCRETE
+NOISELESS CHANNEL Teletype and telegraphy are two simple examples of a discrete
+channel for transmitting information. Gen-erally, a discrete channel will mean
+a system whereby a sequence of choices from a finite set of elementarysymbols
+S1 Sncan be transmitted from one point to another. Each of the symbols Siis
+assumed to have ;
+: : : ;
+a certain duration in time tiseconds (not necessarily the same for different
+Si, for example the dots anddashes in telegraphy). It is not required that all
+possible sequences of the Sibe capable of transmission onthe system;
+certain sequences only may be allowed. These will be possible signals for the
+channel. Thusin telegraphy suppose the symbols are: (1) A dot, consisting of
+line closure for a unit of time and then lineopen for a unit of time;
+(2) A dash, consisting of three time units of closure and one unit open;
+(3) A letterspace consisting of, say, three units of line open;
+(4) A word space of six units of line open. We might placethe restriction on
+allowable sequences that no spaces follow each other (for if two letter spaces
+are adjacent,it is identical with a word space). The question we now consider
+is how one can measure the capacity ofsuch a channel to transmit information.
+In the teletype case where all symbols are of the same duration, and any
+sequence of the 32 symbols is allowed the answer is easy. Each symbol
+represents five bits of information. If the system transmits nsymbols per
+second it is natural to say that the channel has a capacity of 5nbits per
+second. This does notmean that the teletype channel will always be transmitting
+information at this rate -- this is the maximumpossible rate and whether or not
+the actual rate reaches this maximum depends on the source of informationwhich
+feeds the channel, as will appear later. In the more general case with
+different lengths of symbols and constraints on the allowed sequences, we make
+the following definition:Definition: The capacity Cof a discrete channel is
+given by log N T C Lim = T T ! where N Tis the number of allowed signals of
+duration T. It is easily seen that in the teletype case this reduces to the
+previous result. It can be shown that the limit in question will exist as a
+finite number in most cases of interest. Suppose all sequences of the symbolsS1
+Snare allowed and these symbols have durations t1 tn. What is the channel
+capacity? If N t ;
+: : : ;
+;
+: : : ;
+represents the number of sequences of duration twe have N t N t t1 N t t2 N t
+tn = , + , + + , : The total number is equal to the sum of the numbers of
+sequences ending in S1 S2 Snand these are ;
+;
+: : : ;
+N t t1 N t t2 N t tn, respectively. According to a well-known result in finite
+differences, N t , ;
+, ;
+: : : ;
+, is then asymptotic for large tto Xtwhere X 0 0 is the largest real solution
+of the characteristic equation: X t t tn , 1 X, 2 X, 1 + + + = 3
+===============================================================================
+and therefore C log X0 = : In case there are restrictions on allowed sequences
+we may still often obtain a difference equation of this type and find Cfrom the
+characteristic equation. In the telegraphy case mentioned above N t N t 2 N t 4
+N t 5 N t 7 N t 8 N t 10 = , + , + , + , + , + , as we see by counting
+sequences of symbols according to the last or next to the last symbol
+occurring.Hence Cis
+log
+2
+4
+5
+7
+8
+10
+0 where 0 is the positive root of 1 . Solving this we find , = + + + + + C 0
+539. = : A very general type of restriction which may be placed on allowed
+sequences is the following: We imagine a number of possible states a1 a2 am.
+For each state only certain symbols from the set S1 Sn ;
+;
+: : : ;
+;
+: : : ;
+can be transmitted (different subsets for the different states). When one of
+these has been transmitted thestate changes to a new state depending both on
+the old state and the particular symbol transmitted. Thetelegraph case is a
+simple example of this. There are two states depending on whether or not a
+space wasthe last symbol transmitted. If so, then only a dot or a dash can be
+sent next and the state always changes.If not, any symbol can be transmitted
+and the state changes if a space is sent, otherwise it remains the same.The
+conditions can be indicated in a linear graph as shown in Fig. 2. The junction
+points correspond to the DASH DOT DOT LETTER SPACE DASH WORD SPACE Fig. 2 -
+- Graphical representation of the constraints on telegraph symbols. states and
+the lines indicate the symbols possible in a state and the resulting state. In
+Appendix 1 it is shownthat if the conditions on allowed sequences can be
+described in this form Cwill exist and can be calculatedin accordance with the
+following result: s Theorem 1:Let b be the duration of the sth symbol which is
+allowable in state iand leads to state j. i j Then the channel capacity Cis
+equal to logWwhere Wis the largest real root of the determinant equation: s W b
+, i j i j 0 , = s where i j 1 if i jand is zero otherwise. = = For example, in
+the telegraph case (Fig. 2) the determinant is: 1 W2 4 , W, , + 0 W3 6 2 4 = :
+, W, W, W, 1 + + , On expansion this leads to the equation given above for this
+case. 2. THE DISCRETE SOURCE OF INFORMATION We have seen that under very
+general conditions the logarithm of the number of possible signals in a
+discretechannel increases linearly with time. The capacity to transmit
+information can be specified by giving thisrate of increase, the number of bits
+per second required to specify the particular signal used. We now consider the
+information source. How is an information source to be described
+mathematically, and how much information in bits per second is produced in a
+given source? The main point at issue is theeffect of statistical knowledge
+about the source in reducing the required capacity of the channel, by the use 4
+===============================================================================
+of proper encoding of the information. In telegraphy, for example, the messages
+to be transmitted consist ofsequences of letters. These sequences, however, are
+not completely random. In general, they form sentencesand have the statistical
+structure of, say, English. The letter E occurs more frequently than Q, the
+sequenceTH more frequently than XP, etc. The existence of this structure allows
+one to make a saving in time (orchannel capacity) by properly encoding the
+message sequences into signal sequences. This is already doneto a limited
+extent in telegraphy by using the shortest channel symbol, a dot, for the most
+common Englishletter E;
+while the infrequent letters, Q, X, Z are represented by longer sequences of
+dots and dashes. Thisidea is carried still further in certain commercial codes
+where common words and phrases are representedby four- or five-letter code
+groups with a considerable saving in average time. The standardized greetingand
+anniversary telegrams now in use extend this to the point of encoding a
+sentence or two into a relativelyshort sequence of numbers. We can think of a
+discrete source as generating the message, symbol by symbol. It will choose
+succes- sive symbols according to certain probabilities depending, in general,
+on preceding choices as well as theparticular symbols in question. A physical
+system, or a mathematical model of a system which producessuch a sequence of
+symbols governed by a set of probabilities, is known as a stochastic process.3
+We mayconsider a discrete source, therefore, to be represented by a stochastic
+process. Conversely, any stochasticprocess which produces a discrete sequence
+of symbols chosen from a finite set may be considered a discretesource. This
+will include such cases as: 1. Natural written languages such as English,
+German, Chinese. 2. Continuous information sources that have been rendered
+discrete by some quantizing process. For example, the quantized speech from a
+PCM transmitter, or a quantized television signal. 3. Mathematical cases where
+we merely define abstractly a stochastic process which generates a se- quence
+of symbols. The following are examples of this last type of source. (A) Suppose
+we have five letters A, B, C, D, E which are chosen each with probability .2,
+successive choices being independent. This would lead to a sequence of which
+the following is a typicalexample. B D C B C E C C C A D C B D D A A E C E E AA
+B B D A E E C A C E E B A E E C B C E A D. This was constructed with the use of
+a table of random numbers.4 (B) Using the same five letters let the
+probabilities be .4, .1, .2, .2, .1, respectively, with successive choices
+independent. A typical message from this source is then: A A A C D C B D C E A
+A D A D A C E D AE A D C A B E D A D D C E C A A A A A D. (C) A more
+complicated structure is obtained if successive symbols are not chosen
+independently but their probabilities depend on preceding letters. In the
+simplest case of this type a choicedepends only on the preceding letter and not
+on ones before that. The statistical structure canthen be described by a set of
+transition probabilities pi j, the probability that letter iis followed by
+letter j. The indices iand jrange over all the possible symbols. A second
+equivalent way ofspecifying the structure is to give the "digram" probabilities
+p i j, i.e., the relative frequency of ;
+the digram i j. The letter frequencies p i, (the probability of letter i), the
+transition probabilities 3See, for example, S. Chandrasekhar, "Stochastic
+Problems in Physics and Astronomy," Reviews of Modern Physics, v. 15, No. 1,
+January 1943, p. 1. 4Kendall and Smith, Tables of Random Sampling
+Numbers,Cambridge, 1939. 5
+===============================================================================
+pi jand the digram probabilities p i jare related by the following formulas: ;
+p i p i jp j ip j pj i = ;
+= ;
+= j j j p i j p i pi j ;
+= pi jp ip i j1 = = ;
+= : j i i j ;
+As a specific example suppose there are three letters A, B, C with the
+probability tables: pi j j i p i p i j j ;
+A B C A B C A 0 4 1 A 9 A 0 4 1 5 5 27 15 15 i B 1 1 0 B 16 i B 8 8 0 2 2 27 27
+27 C 1 2 1 C 2 C 1 4 1 2 5 10 27 27 135 135 A typical message from this source
+is the following: A B B A B A B A B A B A B A B B B A B B B B B A B A B A B A B
+A B B B A C A C A BB A B B B B A B B A B A C B B B A B A. The next increase in
+complexity would involve trigram frequencies but no more. The choice ofa letter
+would depend on the preceding two letters but not on the message before that
+point. Aset of trigram frequencies p i j kor equivalently a set of transition
+probabilities pi j kwould ;
+;
+be required. Continuing in this way one obtains successively more complicated
+stochastic pro-cesses. In the general n-gram case a set of n-gram probabilities
+p i1 i2 inor of transition ;
+;
+: : : ;
+probabilities pi i is required to specify the statistical structure. 1 i i n ;
+2;
+:::;
+n1 , (D) Stochastic processes can also be defined which produce a text
+consisting of a sequence of "words." Suppose there are five letters A, B, C, D,
+E and 16 "words" in the language withassociated probabilities: .10 A .16 BEBE
+.11 CABED .04 DEB .04 ADEB .04 BED .05 CEED .15 DEED .05 ADEE .02 BEED .08 DAB
+.01 EAB .01 BADD .05 CA .04 DAD .05 EE Suppose successive "words" are chosen
+independently and are separated by a space. A typicalmessage might be: DAB EE A
+BEBE DEED DEB ADEE ADEE EE DEB BEBE BEBE BEBE ADEE BED DEEDDEED CEED ADEE A
+DEED DEED BEBE CABED BEBE BED DAB DEED ADEB. If all the words are of finite
+length this process is equivalent to one of the preceding type, butthe
+description may be simpler in terms of the word structure and probabilities. We
+may alsogeneralize here and introduce transition probabilities between words,
+etc. These artificial languages are useful in constructing simple problems and
+examples to illustrate vari- ous possibilities. We can also approximate to a
+natural language by means of a series of simple artificiallanguages. The zero-
+order approximation is obtained by choosing all letters with the same
+probability andindependently. The first-order approximation is obtained by
+choosing successive letters independently buteach letter having the same
+probability that it has in the natural language.5 Thus, in the first-order ap-
+proximation to English, E is chosen with probability .12 (its frequency in
+normal English) and W withprobability .02, but there is no influence between
+adjacent letters and no tendency to form the preferred 5Letter, digram and
+trigram frequencies are given in Secret and Urgentby Fletcher Pratt, Blue
+Ribbon Books, 1939. Word frequen- cies are tabulated in Relative Frequency of
+English Speech Sounds,G. Dewey, Harvard University Press, 1923. 6
+===============================================================================
+digrams such as TH, ED, etc. In the second-order approximation, digram
+structure is introduced. After aletter is chosen, the next one is chosen in
+accordance with the frequencies with which the various lettersfollow the first
+one. This requires a table of digram frequencies pi j. In the third-order
+approximation, trigram structure is introduced. Each letter is chosen with
+probabilities which depend on the preceding twoletters. 3. THE SERIES OF
+APPROXIMATIONS TO ENGLISH To give a visual idea of how this series of processes
+approaches a language, typical sequences in the approx-imations to English have
+been constructed and are given below. In all cases we have assumed a 27-
+symbol"alphabet," the 26 letters and a space. 1. Zero-order approximation
+(symbols independent and equiprobable). XFOML RXKHRJFFJUJ ZLPWCFWKCYJ
+FFJEYVKCQSGHYD QPAAMKBZAACIBZL-HJQD. 2. First-order approximation (symbols
+independent but with frequencies of English text). OCRO HLI RGWR NMIELWIS EU LL
+NBNESEBYA TH EEI ALHENHTTPA OOBTTVANAH BRL. 3. Second-order approximation
+(digram structure as in English). ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY
+ACHIN D ILONASIVE TU-COOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE. 4.
+Third-order approximation (trigram structure as in English). IN NO IST LAT WHEY
+CRATICT FROURE BIRS GROCID PONDENOME OF DEMONS-TURES OF THE REPTAGIN IS
+REGOACTIONA OF CRE. 5. First-order word approximation. Rather than continue
+with tetragram, , n-gram structure it is easier : : : and better to jump at
+this point to word units. Here words are chosen independently but with
+theirappropriate frequencies. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME
+CAN DIFFERENT NAT-URAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO
+FURNISHESTHE LINE MESSAGE HAD BE THESE. 6. Second-order word approximation. The
+word transition probabilities are correct but no further struc- ture is
+included. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHAR-
+ACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THATTHE TIME OF
+WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. The resemblance to ordinary
+English text increases quite noticeably at each of the above steps. Note that
+these samples have reasonably good structure out to about twice the range that
+is taken into account in theirconstruction. Thus in (3) the statistical process
+insures reasonable text for two-letter sequences, but four-letter sequences
+from the sample can usually be fitted into good sentences. In (6) sequences of
+four or morewords can easily be placed in sentences without unusual or strained
+constructions. The particular sequenceof ten words "attack on an English writer
+that the character of this" is not at all unreasonable. It appears thenthat a
+sufficiently complex stochastic process will give a satisfactory representation
+of a discrete source. The first two samples were constructed by the use of a
+book of random numbers in conjunction with (for example 2) a table of letter
+frequencies. This method might have been continued for (3), (4) and (5),since
+digram, trigram and word frequency tables are available, but a simpler
+equivalent method was used. 7
+===============================================================================
+To construct (3) for example, one opens a book at random and selects a letter
+at random on the page. Thisletter is recorded. The book is then opened to
+another page and one reads until this letter is encountered.The succeeding
+letter is then recorded. Turning to another page this second letter is searched
+for and thesucceeding letter recorded, etc. A similar process was used for (4),
+(5) and (6). It would be interesting iffurther approximations could be
+constructed, but the labor involved becomes enormous at the next stage. 4.
+GRAPHICAL REPRESENTATION OF A MARKOFF PROCESS Stochastic processes of the type
+described above are known mathematically as discrete Markoff processesand have
+been extensively studied in the literature.6 The general case can be described
+as follows: Thereexist a finite number of possible "states" of a system;
+S1 S2 Sn. In addition there is a set of transition ;
+;
+: : : ;
+probabilities;
+pi jthe probability that if the system is in state Siit will next go to state S
+j. To make this Markoff process into an information source we need only assume
+that a letter is produced for each transitionfrom one state to another. The
+states will correspond to the "residue of influence" from preceding letters.
+The situation can be represented graphically as shown in Figs. 3, 4 and 5. The
+"states" are the junction A .1 .4 B E .2 .1 C D .2 Fig. 3 -- A graph
+corresponding to the source in example B. points in the graph and the
+probabilities and letters produced for a transition are given beside the
+correspond-ing line. Figure 3 is for the example B in Section 2, while Fig. 4
+corresponds to the example C. In Fig. 3 B C A A .8 .2 .5 .5 B C B .4 .5 .1 Fig.
+4 -- A graph corresponding to the source in example C. there is only one state
+since successive letters are independent. In Fig. 4 there are as many states as
+letters.If a trigram example were constructed there would be at most n2 states
+corresponding to the possible pairsof letters preceding the one being chosen.
+Figure 5 is a graph for the case of word structure in example D.Here S
+corresponds to the "space" symbol. 5. ERGODIC AND MIXED SOURCES As we have
+indicated above a discrete source for our purposes can be considered to be
+represented by aMarkoff process. Among the possible discrete Markoff processes
+there is a group with special propertiesof significance in communication
+theory. This special class consists of the "ergodic" processes and weshall call
+the corresponding sources ergodic sources. Although a rigorous definition of an
+ergodic process issomewhat involved, the general idea is simple. In an ergodic
+process every sequence produced by the process 6For a detailed treatment see M.
+Fr�echet, M�ethode des fonctions arbitraires. Th�eorie des �ev�enements en
+cha^ine dans le cas d'un nombre fini d'�etats possibles. Paris, Gauthier-
+Villars, 1938. 8
+===============================================================================
+is the same in statistical properties. Thus the letter frequencies, digram
+frequencies, etc., obtained fromparticular sequences, will, as the lengths of
+the sequences increase, approach definite limits independentof the particular
+sequence. Actually this is not true of every sequence but the set for which it
+is false hasprobability zero. Roughly the ergodic property means statistical
+homogeneity. All the examples of artificial languages given above are ergodic.
+This property is related to the structure of the corresponding graph. If the
+graph has the following two properties7 the corresponding process willbe
+ergodic: 1. The graph does not consist of two isolated parts A and B such that
+it is impossible to go from junction points in part A to junction points in
+part B along lines of the graph in the direction of arrows and alsoimpossible
+to go from junctions in part B to junctions in part A. 2. A closed series of
+lines in the graph with all arrows on the lines pointing in the same
+orientation will be called a "circuit." The "length" of a circuit is the number
+of lines in it. Thus in Fig. 5 series BEBESis a circuit of length 5. The second
+property required is that the greatest common divisor of the lengthsof all
+circuits in the graph be one. D E B E S A B E E D A B D E S B D E C A E E B B D
+E A D B E E A S Fig. 5 -- A graph corresponding to the source in example D. If
+the first condition is satisfied but the second one violated by having the
+greatest common divisor equal to d 1, the sequences have a certain type of
+periodic structure. The various sequences fall into ddifferent classes which
+are statistically the same apart from a shift of the origin (i.e., which letter
+in the sequence iscalled letter 1). By a shift of from 0 up to d 1 any sequence
+can be made statistically equivalent to any , other. A simple example with d 2
+is the following: There are three possible letters a b c. Letter ais = ;
+;
+followed with either bor cwith probabilities 1 and 2 respectively. Either bor
+cis always followed by letter 3 3 a. Thus a typical sequence is a b a c a c a c
+a b a c a b a b a c a c: This type of situation is not of much importance for
+our work. If the first condition is violated the graph may be separated into a
+set of subgraphs each of which satisfies the first condition. We will assume
+that the second condition is also satisfied for each subgraph. We have inthis
+case what may be called a "mixed" source made up of a number of pure
+components. The componentscorrespond to the various subgraphs. If L1, L2, L3
+are the component sources we may write ;
+: : : L p1L1 p2L2 p3L3 = + + + 7These are restatements in terms of the graph of
+conditions given in Fr�echet. 9
+===============================================================================
+where piis the probability of the component source Li. Physically the situation
+represented is this: There are several different sources L1, L2, L3 which are ;
+: : : each of homogeneous statistical structure (i.e., they are ergodic). We do
+not know a prioriwhich is to beused, but once the sequence starts in a given
+pure component Li, it continues indefinitely according to thestatistical
+structure of that component. As an example one may take two of the processes
+defined above and assume p1 2 and p2 8. A = : = : sequence from the mixed
+source L 2L1 8L2 = : + : would be obtained by choosing first L1 or L2 with
+probabilities .2 and .8 and after this choice generating asequence from
+whichever was chosen. Except when the contrary is stated we shall assume a
+source to be ergodic. This assumption enables one to identify averages along a
+sequence with averages over the ensemble of possible sequences (the
+probabilityof a discrepancy being zero). For example the relative frequency of
+the letter A in a particular infinitesequence will be, with probability one,
+equal to its relative frequency in the ensemble of sequences. If Piis the
+probability of state iand pi jthe transition probability to state j, then for
+the process to be stationary it is clear that the Pimust satisfy equilibrium
+conditions: Pj Pipi j = : i In the ergodic case it can be shown that with any
+starting conditions the probabilities Pj Nof being in state jafter Nsymbols,
+approach the equilibrium values as N . ! 6. CHOICE, UNCERTAINTY AND ENTROPY We
+have represented a discrete information source as a Markoff process. Can we
+define a quantity whichwill measure, in some sense, how much information is
+"produced" by such a process, or better, at what rateinformation is produced?
+Suppose we have a set of possible events whose probabilities of occurrence are
+p1 p2 pn. These ;
+;
+: : : ;
+probabilities are known but that is all we know concerning which event will
+occur. Can we find a measureof how much "choice" is involved in the selection
+of the event or of how uncertain we are of the outcome? If there is such a
+measure, say H p1 p2 pn, it is reasonable to require of it the following
+properties: ;
+;
+: : : ;
+1. Hshould be continuous in the pi. 2. If all the p 1 iare equal, pi , then
+Hshould be a monotonic increasing function of n. With equally = n likely events
+there is more choice, or uncertainty, when there are more possible events. 3.
+If a choice be broken down into two successive choices, the original Hshould be
+the weighted sum of the individual values of H. The meaning of this is
+illustrated in Fig. 6. At the left we have three 1 2 1 2 1 2 1 3 2 3 1 3 1 2 1
+6 1 3 1 6 Fig. 6 -- Decomposition of a choice from three possibilities.
+possibilities p 1 1 1 1 , p2 , p3 . On the right we first choose between two
+possibilities each with = 2 = 3 = 6 probability 1 , and if the second occurs
+make another choice with probabilities 2 , 1 . The final results 2 3 3 have the
+same probabilities as before. We require, in this special case, that H1 1 1 H1
+1 1 H2 1 2 ;
+3 ;
+6 = 2 ;
+2 + 2 3 ;
+3 : The coefficient 1 is because this second choice only occurs half the time.
+2 10
+===============================================================================
+In Appendix 2, the following result is established: Theorem 2:The only
+Hsatisfying the three above assumptions is of the form: n H K pilog pi = , i1 =
+where Kis a positive constant. This theorem, and the assumptions required for
+its proof, are in no way necessary for the present theory. It is given chiefly
+to lend a certain plausibility to some of our later definitions. The real
+justification of thesedefinitions, however, will reside in their implications.
+Quantities of the form H pilog pi(the constant Kmerely amounts to a choice of a
+unit of measure) = , play a central role in information theory as measures of
+information, choice and uncertainty. The form of Hwill be recognized as that of
+entropy as defined in certain formulations of statistical mechanics8 where
+piisthe probability of a system being in cell iof its phase space. His then,
+for example, the Hin Boltzmann'sfamous Htheorem. We shall call H pilog pithe
+entropy of the set of probabilities p1 pn. If xis a = , ;
+: : : ;
+chance variable we will write H xfor its entropy;
+thus xis not an argument of a function but a label for a number, to
+differentiate it from H ysay, the entropy of the chance variable y. The entropy
+in the case of two possibilities with probabilities pand q 1 p, namely = , H
+plog p qlogq = , + is plotted in Fig. 7 as a function of p. 1.0 .9 .8 .7 H BITS
+.6 .5 .4 .3 .2 .1 0 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 p Fig. 7 -- Entropy in the
+case of two possibilities with probabilities pand 1 p. , The quantity Hhas a
+number of interesting properties which further substantiate it as a reasonable
+measure of choice or information. 1. H 0 if and only if all the pibut one are
+zero, this one having the value unity. Thus only when we = are certain of the
+outcome does Hvanish. Otherwise His positive. 2. For a given n, His a maximum
+and equal to log nwhen all the piare equal (i.e., 1 ). This is also n
+intuitively the most uncertain situation. 8See, for example, R. C. Tolman,
+Principles of Statistical Mechanics,Oxford, Clarendon, 1938. 11
+===============================================================================
+3. Suppose there are two events, xand y, in question with mpossibilities for
+the first and nfor the second. Let p i jbe the probability of the joint
+occurrence of ifor the first and jfor the second. The entropy of the ;
+joint event is H x y p i jlog p i j ;
+= , ;
+;
+i j ;
+while H x p i jlogp i j = , ;
+;
+i j j ;
+H y p i jlogp i j = , ;
+;
+: i j i ;
+It is easily shown that H x y H x H y ;
++ with equality only if the events are independent (i.e., p i j p i p j). The
+uncertainty of a joint event is ;
+= less than or equal to the sum of the individual uncertainties. 4. Any change
+toward equalization of the probabilities p1 p2 pnincreases H. Thus if p1 p2 and
+;
+;
+: : : ;
+we increase p1, decreasing p2 an equal amount so that p1 and p2 are more nearly
+equal, then Hincreases.More generally, if we perform any "averaging" operation
+on the piof the form p0 i ai j p j = j where i ai j 1, and all ai j 0, then
+Hincreases (except in the special case where this transfor- = j ai j= mation
+amounts to no more than a permutation of the p jwith Hof course remaining the
+same). 5. Suppose there are two chance events xand yas in 3, not necessarily
+independent. For any particular value ithat xcan assume there is a conditional
+probability pi jthat yhas the value j. This is given by p i j ;
+pi j = : j p i j ;
+We define the conditional entropyof y, Hx yas the average of the entropy of
+yfor each value of x, weighted according to the probability of getting that
+particular x. That is Hx y p i jlogpi j = , ;
+: i j ;
+This quantity measures how uncertain we are of yon the average when we know x.
+Substituting the value of pi jwe obtain Hx y p i jlog p i jp i jlogp i j = , ;
+;
++ ;
+;
+i j i j j ;
+;
+H x y H x = ;
+, or H x y H x Hx y ;
+= + : The uncertainty (or entropy) of the joint event x yis the uncertainty of
+xplus the uncertainty of ywhen xis ;
+known. 6. From 3 and 5 we have H x H y H x y H x Hx y + ;
+= + : Hence H y Hx y : The uncertainty of yis never increased by knowledge of
+x. It will be decreased unless xand yare independentevents, in which case it is
+not changed. 12
+===============================================================================
+7. THE ENTROPY OF AN INFORMATION SOURCE Consider a discrete source of the
+finite state type considered above. For each possible state ithere will be aset
+of probabilities pi jof producing the various possible symbols j. Thus there is
+an entropy Hifor each state. The entropy of the source will be defined as the
+average of these Hiweighted in accordance with theprobability of occurrence of
+the states in question: H PiHi = iPipi jlogpi j = , : i j ;
+This is the entropy of the source per symbol of text. If the Markoff process is
+proceeding at a definite timerate there is also an entropy per second H0 fiHi =
+i where fiis the average frequency (occurrences per second) of state i. Clearly
+H0 mH = where mis the average number of symbols produced per second. Hor H0
+measures the amount of informa-tion generated by the source per symbol or per
+second. If the logarithmic base is 2, they will represent bitsper symbol or per
+second. If successive symbols are independent then His simply pilog piwhere
+piis the probability of sym- , bol i. Suppose in this case we consider a long
+message of Nsymbols. It will contain with high probabilityabout p1Noccurrences
+of the first symbol, p2Noccurrences of the second, etc. Hence the probability
+of thisparticular message will be roughly p pp1N pp2N ppnN = 1 2 n or log p: N
+pilog pi = i log p: NH = , log 1 p H = : = : N His thus approximately the
+logarithm of the reciprocal probability of a typical long sequence divided by
+thenumber of symbols in the sequence. The same result holds for any source.
+Stated more precisely we have(see Appendix 3): Theorem 3:Given any 0 and 0, we
+can find an N0 such that the sequences of any length N N0 fall into two
+classes: 1. A set whose total probability is less than . 2. The remainder, all
+of whose members have probabilities satisfying the inequality log p1 , H , : N
+log p1 , In other words we are almost certain to have very close to Hwhen Nis
+large. N A closely related result deals with the number of sequences of various
+probabilities. Consider again the sequences of length Nand let them be arranged
+in order of decreasing probability. We define n qto be the number we must take
+from this set starting with the most probable one in order to accumulate a
+totalprobability qfor those taken. 13
+===============================================================================
+Theorem 4: log n q Lim H = N N ! when qdoes not equal 0 or 1. We may interpret
+log n qas the number of bits required to specify the sequence when we consider
+only log n q the most probable sequences with a total probability q. Then is
+the number of bits per symbol for N the specification. The theorem says that
+for large Nthis will be independent of qand equal to H. The rateof growth of
+the logarithm of the number of reasonably probable sequences is given by H,
+regardless of ourinterpretation of "reasonably probable." Due to these results,
+which are proved in Appendix 3, it is possiblefor most purposes to treat the
+long sequences as though there were just 2HNof them, each with a probability2
+HN , . The next two theorems show that Hand H0 can be determined by limiting
+operations directly from the statistics of the message sequences, without
+reference to the states and transition probabilities betweenstates. Theorem 5:
+Let p Bibe the probability of a sequence Biof symbols from the source. Let 1 GN
+p Bilogp Bi = , N i where the sum is over all sequences Bicontaining Nsymbols.
+Then GNis a monotonic decreasing functionof Nand Lim GN H = : N ! Theorem 6:Let
+p Bi S jbe the probability of sequence Bifollowed by symbol S jand pB S j ;
+i = p Bi S j p Bibe the conditional probability of S jafter Bi. Let ;
+= FN p Bi SjlogpB Sj = , ;
+i i j ;
+where the sum is over all blocks Biof N 1 symbols and over all symbols S j.
+Then FNis a monotonic , decreasing function of N, FN NGN N 1 GN1 = , , ;
+, 1 N GN Fn = ;
+N n1 = FN GN ;
+and LimN FN H. = ! These results are derived in Appendix 3. They show that a
+series of approximations to Hcan be obtained by considering only the
+statistical structure of the sequences extending over 1 2 Nsymbols. FNis the ;
+;
+: : : ;
+better approximation. In fact FNis the entropy of the Nth order approximation
+to the source of the typediscussed above. If there are no statistical
+influences extending over more than Nsymbols, that is if theconditional
+probability of the next symbol knowing the preceding N 1 is not changed by a
+knowledge of , any before that, then FN H. FNof course is the conditional
+entropy of the next symbol when the N 1 = , preceding ones are known, while
+GNis the entropy per symbol of blocks of Nsymbols. The ratio of the entropy of
+a source to the maximum value it could have while still restricted to the same
+symbols will be called its relative entropy. This is the maximum compression
+possible when we encode intothe same alphabet. One minus the relative entropy
+is the redundancy. The redundancy of ordinary English,not considering
+statistical structure over greater distances than about eight letters, is
+roughly 50%. Thismeans that when we write English half of what we write is
+determined by the structure of the language andhalf is chosen freely. The
+figure 50% was found by several independent methods which all gave results in
+14
+===============================================================================
+this neighborhood. One is by calculation of the entropy of the approximations
+to English. A second methodis to delete a certain fraction of the letters from
+a sample of English text and then let someone attempt torestore them. If they
+can be restored when 50% are deleted the redundancy must be greater than 50%.
+Athird method depends on certain known results in cryptography. Two extremes of
+redundancy in English prose are represented by Basic English and by James
+Joyce's book "Finnegans Wake". The Basic English vocabulary is limited to 850
+words and the redundancy is veryhigh. This is reflected in the expansion that
+occurs when a passage is translated into Basic English. Joyceon the other hand
+enlarges the vocabulary and is alleged to achieve a compression of semantic
+content. The redundancy of a language is related to the existence of crossword
+puzzles. If the redundancy is zero any sequence of letters is a reasonable text
+in the language and any two-dimensional array of lettersforms a crossword
+puzzle. If the redundancy is too high the language imposes too many constraints
+for largecrossword puzzles to be possible. A more detailed analysis shows that
+if we assume the constraints imposedby the language are of a rather chaotic and
+random nature, large crossword puzzles are just possible whenthe redundancy is
+50%. If the redundancy is 33%, three-dimensional crossword puzzles should be
+possible,etc. 8. REPRESENTATION OF THE ENCODING AND DECODING OPERATIONS We have
+yet to represent mathematically the operations performed by the transmitter and
+receiver in en-coding and decoding the information. Either of these will be
+called a discrete transducer. The input to thetransducer is a sequence of input
+symbols and its output a sequence of output symbols. The transducer mayhave an
+internal memory so that its output depends not only on the present input symbol
+but also on the pasthistory. We assume that the internal memory is finite,
+i.e., there exist a finite number mof possible states ofthe transducer and that
+its output is a function of the present state and the present input symbol. The
+nextstate will be a second function of these two quantities. Thus a transducer
+can be described by two functions: yn f xn n = ;
+n1 g xn n = ;
++ where xnis the nth input symbol, nis the state of the transducer when the nth
+input symbol is introduced, ynis the output symbol (or sequence of output
+symbols) produced when xnis introduced if the state is n. If the output symbols
+of one transducer can be identified with the input symbols of a second, they
+can be connected in tandem and the result is also a transducer. If there exists
+a second transducer which operateson the output of the first and recovers the
+original input, the first transducer will be called non-singular andthe second
+will be called its inverse. Theorem 7:The output of a finite state transducer
+driven by a finite state statistical source is a finite state statistical
+source, with entropy (per unit time) less than or equal to that of the input.
+If the transduceris non-singular they are equal. Let represent the state of the
+source, which produces a sequence of symbols xi;
+and let be the state of the transducer, which produces, in its output, blocks
+of symbols y j. The combined system can be representedby the "product state
+space" of pairs . Two points in the space 1 1 and 2 2 , are connected by ;
+;
+;
+a line if 1 can produce an xwhich changes 1 to 2, and this line is given the
+probability of that xin this case. The line is labeled with the block of y
+jsymbols produced by the transducer. The entropy of the outputcan be calculated
+as the weighted sum over the states. If we sum first on each resulting term is
+less than or equal to the corresponding term for , hence the entropy is not
+increased. If the transducer is non-singularlet its output be connected to the
+inverse transducer. If H0 , H0 and H0 are the output entropies of the source, 1
+2 3 the first and second transducers respectively, then H0 H0 H0 H0 and
+therefore H0 H0 . 1 2 3 = 1 1 = 2 15
+===============================================================================
+Suppose we have a system of constraints on possible sequences of the type which
+can be represented by s a linear graph as in Fig. 2. If probabilities p were
+assigned to the various lines connecting state ito state j i j this would
+become a source. There is one particular assignment which maximizes the
+resulting entropy (seeAppendix 4). Theorem 8:Let the system of constraints
+considered as a channel have a capacity C logW. If we = assign s B j s p W,`ij
+i j= Bi s where is the duration of the sth symbol leading from state ito state
+jand the Bisatisfy `i j s B i j i BjW,` = s j ;
+then His maximized and equal to C. By proper assignment of the transition
+probabilities the entropy of symbols on a channel can be maxi- mized at the
+channel capacity. 9. THE FUNDAMENTAL THEOREM FOR A NOISELESS CHANNEL We will
+now justify our interpretation of Has the rate of generating information by
+proving that Hdeter-mines the channel capacity required with most efficient
+coding. Theorem 9:Let a source have entropy Hbits per symbol and a channel have
+a capacity Cbits per second . Then it is possible to encode the output of the
+source in such a way as to transmit at the average C rate symbols per second
+over the channel where is arbitrarily small. It is not possible to transmit at
+, H C an average rate greater than . H C The converse part of the theorem, that
+cannot be exceeded, may be proved by noting that the entropy H of the channel
+input per second is equal to that of the source, since the transmitter must be
+non-singular, andalso this entropy cannot exceed the channel capacity. Hence H0
+Cand the number of symbols per second H0 H C H. = = = The first part of the
+theorem will be proved in two different ways. The first method is to consider
+the set of all sequences of Nsymbols produced by the source. For Nlarge we can
+divide these into two groups,one containing less than 2 H N + members and the
+second containing less than 2RNmembers (where Ris the logarithm of the number
+of different symbols) and having a total probability less than . As Nincreases
+and approach zero. The number of signals of duration Tin the channel is greater
+than 2 C T , with small when Tis large. if we choose H T N = + C then there
+will be a sufficient number of sequences of channel symbols for the high
+probability group whenNand Tare sufficiently large (however small ) and also
+some additional ones. The high probability group is coded in an arbitrary one-
+to-one way into this set. The remaining sequences are represented by
+largersequences, starting and ending with one of the sequences not used for the
+high probability group. Thisspecial sequence acts as a start and stop signal
+for a different code. In between a sufficient time is allowedto give enough
+different sequences for all the low probability messages. This will require R
+T1 N = + ' C where is small. The mean rate of transmission in message symbols
+per second will then be greater than ' 1 , T T 1 , 1 H R 1 1 , + = , + + + ' :
+N N C C 16
+===============================================================================
+C As Nincreases , and approach zero and the rate approaches . ' H Another
+method of performing this coding and thereby proving the theorem can be
+described as follows: Arrange the messages of length Nin order of decreasing
+probability and suppose their probabilities are p 1 1 p2 p3 pn. Let Ps s, pi;
+that is Psis the cumulative probability up to, but not including, ps. = 1 We
+first encode into a binary system. The binary code for message sis obtained by
+expanding Psas a binarynumber. The expansion is carried out to msplaces, where
+msis the integer satisfying: 1 1 log2 ms 1 log + : p 2 s ps Thus the messages
+of high probability are represented by short codes and those of low probability
+by longcodes. From these inequalities we have 1 1 ps 2ms 2ms1 : , The code for
+Pswill differ from all succeeding ones in one or more of its msplaces, since
+all the remainingPiare at least 1 larger and their binary expansions therefore
+differ in the first m 2ms splaces. Consequently all the codes are different and
+it is possible to recover the message from its code. If the channel sequences
+arenot already sequences of binary digits, they can be ascribed binary numbers
+in an arbitrary fashion and thebinary code thus translated into signals
+suitable for the channel. The average number H0 of binary digits used per
+symbol of original message is easily estimated. We have 1 H0 msps = : N But, 1
+1 1 1 1 log ps msps 1 log ps + N 2 p 2 s N N ps and therefore, 1 GN H0 GN + N
+As Nincreases GNapproaches H, the entropy of the source and H0 approaches H. We
+see from this that the inefficiency in coding, when only a finite delay of
+Nsymbols is used, need not be greater than 1 plus the difference between the
+true entropy Hand the entropy G N Ncalculated for sequences of length N. The
+per cent excess time needed over the ideal is therefore less than GN 1 1 + , :
+H HN This method of encoding is substantially the same as one found
+independently by R. M. Fano.9 His method is to arrange the messages of length
+Nin order of decreasing probability. Divide this series into twogroups of as
+nearly equal probability as possible. If the message is in the first group its
+first binary digitwill be 0, otherwise 1. The groups are similarly divided into
+subsets of nearly equal probability and theparticular subset determines the
+second binary digit. This process is continued until each subset containsonly
+one message. It is easily seen that apart from minor differences (generally in
+the last digit) this amountsto the same thing as the arithmetic process
+described above. 10. DISCUSSION AND EXAMPLES In order to obtain the maximum
+power transfer from a generator to a load, a transformer must in general
+beintroduced so that the generator as seen from the load has the load
+resistance. The situation here is roughlyanalogous. The transducer which does
+the encoding should match the source to the channel in a statisticalsense. The
+source as seen from the channel through the transducer should have the same
+statistical structure 9Technical Report No. 65, The Research Laboratory of
+Electronics, M.I.T., March 17, 1949. 17
+===============================================================================
+as the source which maximizes the entropy in the channel. The content of
+Theorem 9 is that, although anexact match is not in general possible, we can
+approximate it as closely as desired. The ratio of the actualrate of
+transmission to the capacity Cmay be called the efficiency of the coding
+system. This is of courseequal to the ratio of the actual entropy of the
+channel symbols to the maximum possible entropy. In general, ideal or nearly
+ideal encoding requires a long delay in the transmitter and receiver. In the
+noiseless case which we have been considering, the main function of this delay
+is to allow reasonably goodmatching of probabilities to corresponding lengths
+of sequences. With a good code the logarithm of thereciprocal probability of a
+long message must be proportional to the duration of the corresponding signal,
+infact log p1 , C , T must be small for all but a small fraction of the long
+messages. If a source can produce only one particular message its entropy is
+zero, and no channel is required. For example, a computing machine set up to
+calculate the successive digits of produces a definite sequence with no chance
+element. No channel is required to "transmit" this to another point. One could
+construct asecond machine to compute the same sequence at the point. However,
+this may be impractical. In such a casewe can choose to ignore some or all of
+the statistical knowledge we have of the source. We might considerthe digits of
+to be a random sequence in that we construct a system capable of sending any
+sequence of digits. In a similar way we may choose to use some of our
+statistical knowledge of English in constructinga code, but not all of it. In
+such a case we consider the source with the maximum entropy subject to
+thestatistical conditions we wish to retain. The entropy of this source
+determines the channel capacity whichis necessary and sufficient. In the
+example the only information retained is that all the digits are chosen from
+the set 0 1 9. In the case of English one might wish to use the statistical
+saving possible due to ;
+;
+: : : ;
+letter frequencies, but nothing else. The maximum entropy source is then the
+first approximation to Englishand its entropy determines the required channel
+capacity. As a simple example of some of these results consider a source which
+produces a sequence of letters chosen from among A, B, C, Dwith probabilities 1
+, 1 , 1 , 1 , successive symbols being chosen independently. 2 4 8 8 We have ,
+H 1 log 1 1 log 1 2 log 1 = , 2 2 + 4 4 + 8 8 7 bits per symbol = 4 : Thus we
+can approximate a coding system to encode messages from this source into binary
+digits with anaverage of 7 binary digit per symbol. In this case we can
+actually achieve the limiting value by the following 4 code (obtained by the
+method of the second proof of Theorem 9): A 0 B 10 C 110 D 111 The average
+number of binary digits used in encoding a sequence of Nsymbols will be 2 , N1
+1 1 2 3 7 N 2 + 4 + = : 8 4 It is easily seen that the binary digits 0, 1 have
+probabilities 1 , 1 so the Hfor the coded sequences is one 2 2 bit per symbol.
+Since, on the average, we have 7 binary symbols per original letter, the
+entropies on a time 4 basis are the same. The maximum possible entropy for the
+original set is log 4 2, occurring when A, B, C, = Dhave probabilities 1 , 1 ,
+1 , 1 . Hence the relative entropy is 7 . We can translate the binary sequences
+into 4 4 4 4 8 the original set of symbols on a two-to-one basis by the
+following table: 00 A0 01 B0 10 C0 11 D0 18
+===============================================================================
+This double process then encodes the original message into the same symbols but
+with an average compres-sion ratio 7 . 8 As a second example consider a source
+which produces a sequence of A's and B's with probability pfor Aand qfor B. If
+p qwe have H log pp1 p1 p , = , , plog p1 p1 p p , = = , , e : plog = : p In
+such a case one can construct a fairly good coding of the message on a 0, 1
+channel by sending a specialsequence, say 0000, for the infrequent symbol Aand
+then a sequence indicating the numberof B's followingit. This could be
+indicated by the binary representation with all numbers containing the special
+sequencedeleted. All numbers up to 16 are represented as usual;
+16 is represented by the next binary number after 16which does not contain four
+zeros, namely 17 10001, etc. = It can be shown that as p 0 the coding
+approaches ideal provided the length of the special sequence is ! properly
+adjusted. PART II: THE DISCRETE CHANNEL WITH NOISE 11. REPRESENTATION OF A
+NOISY DISCRETE CHANNEL We now consider the case where the signal is perturbed
+by noise during transmission or at one or the otherof the terminals. This means
+that the received signal is not necessarily the same as that sent out by
+thetransmitter. Two cases may be distinguished. If a particular transmitted
+signal always produces the samereceived signal, i.e., the received signal is a
+definite function of the transmitted signal, then the effect may becalled
+distortion. If this function has an inverse -- no two transmitted signals
+producing the same receivedsignal -- distortion may be corrected, at least in
+principle, by merely performing the inverse functionaloperation on the received
+signal. The case of interest here is that in which the signal does not always
+undergo the same change in trans- mission. In this case we may assume the
+received signal Eto be a function of the transmitted signal Sand asecond
+variable, the noise N. E f S N = ;
+The noise is considered to be a chance variable just as the message was above.
+In general it may be repre-sented by a suitable stochastic process. The most
+general type of noisy discrete channel we shall consideris a generalization of
+the finite state noise-free channel described previously. We assume a finite
+number ofstates and a set of probabilities p i j ;
+: ;
+This is the probability, if the channel is in state and symbol iis transmitted,
+that symbol jwill be received and the channel left in state . Thus and range
+over the possible states, iover the possible transmitted signals and jover the
+possible received signals. In the case where successive symbols are
+independently per-turbed by the noise there is only one state, and the channel
+is described by the set of transition probabilities pi j, the probability of
+transmitted symbol ibeing received as j. If a noisy channel is fed by a source
+there are two statistical processes at work: the source and the noise. Thus
+there are a number of entropies that can be calculated. First there is the
+entropy H xof the source or of the input to the channel (these will be equal if
+the transmitter is non-singular). The entropy of theoutput of the channel,
+i.e., the received signal, will be denoted by H y. In the noiseless case H y H
+x. = The joint entropy of input and output will be H xy. Finally there are two
+conditional entropies Hx yand Hy x, the entropy of the output when the input is
+known and conversely. Among these quantities we have the relations H x y H x Hx
+y H y Hy x ;
+= + = + : All of these entropies can be measured on a per-second or a per-
+symbol basis. 19
+===============================================================================
+12. EQUIVOCATION AND CHANNEL CAPACITY If the channel is noisy it is not in
+general possible to reconstruct the original message or the transmittedsignal
+with certaintyby any operation on the received signal E. There are, however,
+ways of transmittingthe information which are optimal in combating noise. This
+is the problem which we now consider. Suppose there are two possible symbols 0
+and 1, and we are transmitting at a rate of 1000 symbols per second with
+probabilities p 1 0 p1 . Thus our source is producing information at the rate
+of 1000 bits = = 2 per second. During transmission the noise introduces errors
+so that, on the average, 1 in 100 is receivedincorrectly (a 0 as 1, or 1 as 0).
+What is the rate of transmission of information? Certainly less than 1000bits
+per second since about 1% of the received symbols are incorrect. Our first
+impulse might be to saythe rate is 990 bits per second, merely subtracting the
+expected number of errors. This is not satisfactorysince it fails to take into
+account the recipient's lack of knowledge of where the errors occur. We may
+carryit to an extreme case and suppose the noise so great that the received
+symbols are entirely independent ofthe transmitted symbols. The probability of
+receiving 1 is 1 whatever was transmitted and similarly for 0. 2 Then about
+half of the received symbols are correct due to chance alone, and we would be
+giving the systemcredit for transmitting 500 bits per second while actually no
+information is being transmitted at all. Equally"good" transmission would be
+obtained by dispensing with the channel entirely and flipping a coin at
+thereceiving point. Evidently the proper correction to apply to the amount of
+information transmitted is the amount of this information which is missing in
+the received signal, or alternatively the uncertainty when we have receiveda
+signal of what was actually sent. From our previous discussion of entropy as a
+measure of uncertainty itseems reasonable to use the conditional entropy of the
+message, knowing the received signal, as a measureof this missing information.
+This is indeed the proper definition, as we shall see later. Following this
+ideathe rate of actual transmission, R, would be obtained by subtracting from
+the rate of production (i.e., theentropy of the source) the average rate of
+conditional entropy. R H x Hy x = , The conditional entropy Hy xwill, for
+convenience, be called the equivocation. It measures the average ambiguity of
+the received signal. In the example considered above, if a 0 is received the a
+posterioriprobability that a 0 was transmitted is .99, and that a 1 was
+transmitted is .01. These figures are reversed if a 1 is received. Hence Hy x
+99 log 99 0 01 log0 01 = , : : + : : 081 bits/symbol = : or 81 bits per second.
+We may say that the system is transmitting at a rate 1000 81 919 bits per
+second. , = In the extreme case where a 0 is equally likely to be received as a
+0 or 1 and similarly for 1, the a posterioriprobabilities are 1 , 1 and 2 2 H 1
+1 y x log 1 log 1 = , 2 2 + 2 2 1 bit per symbol = or 1000 bits per second. The
+rate of transmission is then 0 as it should be. The following theorem gives a
+direct intuitive interpretation of the equivocation and also serves to justify
+it as the unique appropriate measure. We consider a communication system and an
+observer (or auxiliarydevice) who can see both what is sent and what is
+recovered (with errors due to noise). This observer notesthe errors in the
+recovered message and transmits data to the receiving point over a "correction
+channel" toenable the receiver to correct the errors. The situation is
+indicated schematically in Fig. 8. Theorem 10:If the correction channel has a
+capacity equal to Hy xit is possible to so encode the correction data as to
+send it over this channel and correct all but an arbitrarily small fraction of
+the errors.This is not possible if the channel capacity is less than Hy x. 20
+===============================================================================
+CORRECTION DATA OBSERVER M M0 M SOURCE TRANSMITTER RECEIVER CORRECTING DEVICE
+Fig. 8 -- Schematic diagram of a correction system. Roughly then, Hy xis the
+amount of additional information that must be supplied per second at the
+receiving point to correct the received message. To prove the first part,
+consider long sequences of received message M0 and corresponding original
+message M. There will be logarithmically T Hy xof the M's which could
+reasonably have produced each M0. Thus we have T Hy xbinary digits to send each
+Tseconds. This can be done with frequency of errors on a channel of capacity Hy
+x. The second part can be proved by noting, first, that for any discrete chance
+variables x, y, z Hy x z Hy x ;
+: The left-hand side can be expanded to give Hy z Hyz x Hy x + Hyz x Hy x Hy z
+Hy x H z , , : If we identify xas the output of the source, yas the received
+signal and zas the signal sent over the correctionchannel, then the right-hand
+side is the equivocation less the rate of transmission over the correction
+channel.If the capacity of this channel is less than the equivocation the
+right-hand side will be greater than zero andHyz x 0. But this is the
+uncertainty of what was sent, knowing both the received signal and the
+correction signal. If this is greater than zero the frequency of errors cannot
+be arbitrarily small. Example: Suppose the errors occur at random in a sequence
+of binary digits: probability pthat a digit is wrongand q 1 pthat it is right.
+These errors can be corrected if their position is known. Thus the = ,
+correction channel need only send information as to these positions. This
+amounts to transmittingfrom a source which produces binary digits with
+probability pfor 1 (incorrect) and qfor 0 (correct).This requires a channel of
+capacity plog p qlogq , + which is the equivocation of the original system. The
+rate of transmission Rcan be written in two other forms due to the identities
+noted above. We have R H x Hy x = , H y Hx y = , H x H y H x y = + , ;
+: 21
+===============================================================================
+The first defining expression has already been interpreted as the amount of
+information sent less the uncer-tainty of what was sent. The second measures
+the amount received less the part of this which is due to noise.The third is
+the sum of the two amounts less the joint entropy and therefore in a sense is
+the number of bitsper second common to the two. Thus all three expressions have
+a certain intuitive significance. The capacity Cof a noisy channel should be
+the maximum possible rate of transmission, i.e., the rate when the source is
+properly matched to the channel. We therefore define the channel capacity by ,
+C Max H x Hy x = , where the maximum is with respect to all possible
+information sources used as input to the channel. If thechannel is noiseless,
+Hy x 0. The definition is then equivalent to that already given for a noiseless
+channel = since the maximum entropy for the channel is its capacity. 13. THE
+FUNDAMENTAL THEOREM FOR A DISCRETE CHANNEL WITH NOISE It may seem surprising
+that we should define a definite capacity Cfor a noisy channel since we can
+neversend certain information in such a case. It is clear, however, that by
+sending the information in a redundantform the probability of errors can be
+reduced. For example, by repeating the message many times and by astatistical
+study of the different received versions of the message the probability of
+errors could be made verysmall. One would expect, however, that to make this
+probability of errors approach zero, the redundancyof the encoding must
+increase indefinitely, and the rate of transmission therefore approach zero.
+This is byno means true. If it were, there would not be a very well defined
+capacity, but only a capacity for a givenfrequency of errors, or a given
+equivocation;
+the capacity going down as the error requirements are mademore stringent.
+Actually the capacity Cdefined above has a very definite significance. It is
+possible to sendinformation at the rate Cthrough the channel with as small a
+frequency of errors or equivocation as desiredby proper encoding. This
+statement is not true for any rate greater than C. If an attempt is made to
+transmitat a higher rate than C, say C R1, then there will necessarily be an
+equivocation equal to or greater than the + excess R1. Nature takes payment by
+requiring just that much uncertainty, so that we are not actually gettingany
+more than Cthrough correctly. The situation is indicated in Fig. 9. The rate of
+information into the channel is plotted horizontally and the equivocation
+vertically. Any point above the heavy line in the shaded region can be attained
+and thosebelow cannot. The points on the line cannot in general be attained,
+but there will usually be two points onthe line that can. These results are the
+main justification for the definition of Cand will now be proved. Theorem 11:
+Let a discrete channel have the capacity Cand a discrete source the entropy per
+second H. If H Cthere exists a coding system such that the output of the source
+can be transmitted over the channel with an arbitrarily small frequency of
+errors (or an arbitrarily small equivocation). If H Cit is possible to encode
+the source so that the equivocation is less than H C where is arbitrarily
+small. There is no , + method of encoding which gives an equivocation less than
+H C. , The method of proving the first part of this theorem is not by
+exhibiting a coding method having the desired properties, but by showing that
+such a code must exist in a certain group of codes. In fact we will ATTAINABLE
+Hy x REGION 1.0 = OPE SL C H x Fig. 9 -- The equivocation possible for a given
+input entropy to a channel. 22
+===============================================================================
+average the frequency of errors over this group and show that this average can
+be made less than . If theaverage of a set of numbers is less than there must
+exist at least one in the set which is less than . This will establish the
+desired result. The capacity Cof a noisy channel has been defined as , C Max H
+x Hy x = , where xis the input and ythe output. The maximization is over all
+sources which might be used as input tothe channel. Let S0 be a source which
+achieves the maximum capacity C. If this maximum is not actually achieved by
+any source let S0 be a source which approximates to giving the maximum rate.
+Suppose S0 is used asinput to the channel. We consider the possible transmitted
+and received sequences of a long duration T. Thefollowing will be true: 1. The
+transmitted sequences fall into two classes, a high probability group with
+about 2T H x members and the remaining sequences of small total probability. 2.
+Similarly the received sequences have a high probability set of about 2T H y
+members and a low probability set of remaining sequences. 3. Each high
+probability output could be produced by about 2THy x inputs. The probability of
+all other cases has a small total probability. All the 's and 's implied by the
+words "small" and "about" in these statements approach zero as we allow Tto
+increase and S0 to approach the maximizing source. The situation is summarized
+in Fig. 10 where the input sequences are points on the left and output
+sequences points on the right. The fan of cross lines represents the range of
+possible causes for a typicaloutput. E M 2H x T HIGH PROBABILITY 2H y T
+MESSAGES HIGH PROBABILITY RECEIVED SIGNALS 2Hy x T REASONABLE CAUSES FOR EACH E
+2Hx y T REASONABLE EFFECTS FOR EACH M Fig. 10 -- Schematic representation of
+the relations between inputs and outputs in a channel. Now suppose we have
+another source producing information at rate Rwith R C. In the period Tthis
+source will have 2TRhigh probability messages. We wish to associate these with
+a selection of the possiblechannel inputs in such a way as to get a small
+frequency of errors. We will set up this association in all 23
+===============================================================================
+possible ways (using, however, only the high probability group of inputs as
+determined by the source S0)and average the frequency of errors for this large
+class of possible coding systems. This is the same ascalculating the frequency
+of errors for a random association of the messages and channel inputs of
+durationT. Suppose a particular output y1 is observed. What is the probability
+of more than one message in the setof possible causes of y x 1? There are 2T
+Rmessages distributed at random in 2T H points. The probability of a particular
+point being a message is thus 2T R H x , : The probability that none of the
+points in the fan is a message (apart from the actual originating message) is x
+2T Hy P 1 2T R H x , = , : Now R H x Hy xso R H x Hy x with positive.
+Consequently , , = , , x 2T Hy P 1 2 THy x T , , = , approaches (as T ) ! 1 2 T
+, , : Hence the probability of an error approaches zero and the first part of
+the theorem is proved. The second part of the theorem is easily shown by noting
+that we could merely send Cbits per second from the source, completely
+neglecting the remainder of the information generated. At the receiver
+theneglected part gives an equivocation H x Cand the part transmitted need only
+add . This limit can also , be attained in many other ways, as will be shown
+when we consider the continuous case. The last statement of the theorem is a
+simple consequence of our definition of C. Suppose we can encode a source with
+H x C ain such a way as to obtain an equivocation Hy x a with positive. Then =
++ = , R H x C aand = = + H x Hy x C , = + with positive. This contradicts the
+definition of Cas the maximum of H x Hy x. , Actually more has been proved than
+was stated in the theorem. If the average of a set of numbers is p p within of
+of their maximum, a fraction of at most can be more than below the maximum.
+Since is arbitrarily small we can say that almost all the systems are
+arbitrarily close to the ideal. 14. DISCUSSION The demonstration of Theorem 11,
+while not a pure existence proof, has some of the deficiencies of suchproofs.
+An attempt to obtain a good approximation to ideal coding by following the
+method of the proof isgenerally impractical. In fact, apart from some rather
+trivial cases and certain limiting situations, no explicitdescription of a
+series of approximation to the ideal has been found. Probably this is no
+accident but isrelated to the difficulty of giving an explicit construction for
+a good approximation to a random sequence. An approximation to the ideal would
+have the property that if the signal is altered in a reasonable way by the
+noise, the original can still be recovered. In other words the alteration will
+not in general bring itcloser to another reasonable signal than the original.
+This is accomplished at the cost of a certain amount ofredundancy in the
+coding. The redundancy must be introduced in the proper way to combat the
+particularnoise structure involved. However, any redundancy in the source will
+usually help if it is utilized at thereceiving point. In particular, if the
+source already has a certain redundancy and no attempt is made toeliminate it
+in matching to the channel, this redundancy will help combat noise. For
+example, in a noiselesstelegraph channel one could save about 50% in time by
+proper encoding of the messages. This is not doneand most of the redundancy of
+English remains in the channel symbols. This has the advantage, however,of
+allowing considerable noise in the channel. A sizable fraction of the letters
+can be received incorrectlyand still reconstructed by the context. In fact this
+is probably not a bad approximation to the ideal in manycases, since the
+statistical structure of English is rather involved and the reasonable English
+sequences arenot too far (in the sense required for the theorem) from a random
+selection. 24
+===============================================================================
+As in the noiseless case a delay is generally required to approach the ideal
+encoding. It now has the additional function of allowing a large sample of
+noise to affect the signal before any judgment is madeat the receiving point as
+to the original message. Increasing the sample size always sharpens the
+possiblestatistical assertions. The content of Theorem 11 and its proof can be
+formulated in a somewhat different way which exhibits the connection with the
+noiseless case more clearly. Consider the possible signals of duration Tand
+supposea subset of them is selected to be used. Let those in the subset all be
+used with equal probability, and supposethe receiver is constructed to select,
+as the original signal, the most probable cause from the subset, when
+aperturbed signal is received. We define N T qto be the maximum number of
+signals we can choose for the ;
+subset such that the probability of an incorrect interpretation is less than or
+equal to q. log N T q Theorem 12:Lim ;
+C, where Cis the channel capacity, provided that qdoes not equal 0 or = T T !
+1. In other words, no matter how we set out limits of reliability, we can
+distinguish reliably in time T enough messages to correspond to about CTbits,
+when Tis sufficiently large. Theorem 12 can be comparedwith the definition of
+the capacity of a noiseless channel given in Section 1. 15. EXAMPLE OF A
+DISCRETE CHANNEL AND ITS CAPACITY A simple example of a discrete channel is
+indicated in Fig. 11. There are three possible symbols. The first isnever
+affected by noise. The second and third each have probability pof coming
+through undisturbed, andqof being changed into the other of the pair. We have
+(letting plog p qlogqand Pand Qbe the = , + p q TRANSMITTED RECEIVED SYMBOLS
+SYMBOLS q p Fig. 11 -- Example of a discrete channel. probabilities of using
+the first and second symbols) H x Plog P 2QlogQ = , , Hy x 2Q = : We wish to
+choose Pand Qin such a way as to maximize H x Hy x, subject to the constraint P
+2Q 1. , + = Hence we consider U Plog P 2QlogQ 2Q P 2Q = , , , + + U 1 logP 0 =
+, , + = P U 2 2 logQ 2 2 0 = , , , + = : Q Eliminating log P log Q = + P Qe Q =
+= 25
+===============================================================================
+1 P Q = = : 2 2 + + The channel capacity is then 2 C log + = : Note how this
+checks the obvious values in the cases p 1 and p 1 . In the first, 1 and C log
+3, = = 2 = = which is correct since the channel is then noiseless with three
+possible symbols. If p 1 , 2 and = 2 = C log 2. Here the second and third
+symbols cannot be distinguished at all and act together like one = symbol. The
+first symbol is used with probability P 1 and the second and third together
+with probability = 2 1 . This may be distributed between them in any desired
+way and still achieve the maximum capacity. 2 For intermediate values of pthe
+channel capacity will lie between log 2 and log 3. The distinction between the
+second and third symbols conveys some information but not as much as in the
+noiseless case.The first symbol is used somewhat more frequently than the other
+two because of its freedom from noise. 16. THE CHANNEL CAPACITY IN CERTAIN
+SPECIAL CASES If the noise affects successive channel symbols independently it
+can be described by a set of transitionprobabilities pi j. This is the
+probability, if symbol iis sent, that jwill be received. The maximum
+channelrate is then given by the maximum of PipijlogPipijPipijlogpij , + i j i
+i j ;
+;
+where we vary the Pisubject to Pi 1. This leads by the method of Lagrange to
+the equations, = ps j ps jlog s 1 2 = = ;
+;
+: : : : j i Pi pi j Multiplying by Psand summing on sshows that C. Let the
+inverse of ps j(if it exists) be hstso that = s hst psj t j. Then: =
+hstpsjlogpsjlogPipit Chst , = : s j i s ;
+Hence: h i Pi pit exp Chsthst psjlog psj = , + i s s j ;
+or, h i Pi hitexp Chsthstpsjlogpsj = , + : t s s j ;
+This is the system of equations for determining the maximizing values of Pi,
+with Cto be determined so that Pi 1. When this is done Cwill be the channel
+capacity, and the Pithe proper probabilities for the = channel symbols to
+achieve this capacity. If each input symbol has the same set of probabilities
+on the lines emerging from it, and the same is true of each output symbol, the
+capacity can be easily calculated. Examples are shown in Fig. 12. In such a
+caseHx yis independent of the distribution of probabilities on the input
+symbols, and is given by pilog pi , where the piare the values of the
+transition probabilities from any input symbol. The channel capacity is Max H y
+Hx y Max H y pilogpi , = + : The maximum of H yis clearly log mwhere mis the
+number of output symbols, since it is possible to make them all equally
+probable by making the input symbols equally probable. The channel capacity is
+therefore C log m pilogpi = + : 26
+===============================================================================
+1 2 1 2 1 3 1 2 1 3 1 6 1 3 1 2 1 6 1 6 1 6 1 2 1 2 1 6 1 2 1 6 1 3 1 3 1 2 1 3
+1 2 1 6 1 3 1 2 1 2 a b c Fig. 12 -- Examples of discrete channels with the
+same transition probabilities for each input and for each output. In Fig. 12a
+it would be C log 4 log2 log 2 = , = : This could be achieved by using only the
+1st and 3d symbols. In Fig. 12b C log 4 2 log3 1 log6 = , 3 , 3 log 4 log3 1
+log2 = , , 3 5 log 1 2 3 = 3 : In Fig. 12c we have C log 3 1 log2 1 log3 1 log6
+= , 2 , 3 , 6 3 log = 1 1 1 : 2 2 3 3 6 6 Suppose the symbols fall into several
+groups such that the noise never causes a symbol in one group to be mistaken
+for a symbol in another group. Let the capacity for the nth group be Cn(in bits
+per second)when we use only the symbols in this group. Then it is easily shown
+that, for best use of the entire set, thetotal probability Pnof all symbols in
+the nth group should be 2Cn Pn= : 2Cn Within a group the probability is
+distributed just as it would be if these were the only symbols being used.The
+channel capacity is C log 2Cn = : 17. AN EXAMPLE OF EFFICIENT CODING The
+following example, although somewhat unrealistic, is a case in which exact
+matching to a noisy channelis possible. There are two channel symbols, 0 and 1,
+and the noise affects them in blocks of seven symbols.A block of seven is
+either transmitted without error, or exactly one symbol of the seven is
+incorrect. Theseeight possibilities are equally likely. We have C Max H y Hx y
+= , 1 7 8 log 1 = 7 + 8 8 4 bits/symbol = 7 : An efficient code, allowing
+complete correction of errors and transmitting at the rate C, is the following
+(found by a method due to R. Hamming): 27
+===============================================================================
+Let a block of seven symbols be X1 X2 X7. Of these X3, X5, X6 and X7 are
+message symbols and ;
+;
+: : : ;
+chosen arbitrarily by the source. The other three are redundant and calculated
+as follows: X4 is chosen to make X4 X5 X6 X7 even = + + + X2 " " " " X2 X3 X6
+X7 " = + + + X1 " " " " X1 X3 X5 X7 " = + + + When a block of seven is received
+and are calculated and if even called zero, if odd called one. The ;
+binary number then gives the subscript of the Xithat is incorrect (if 0 there
+was no error). APPENDIX 1 THE GROWTH OF THE NUMBER OF BLOCKS OF SYMBOLS WITH A
+FINITE STATE CONDITION Let Ni Lbe the number of blocks of symbols of length
+Lending in state i. Then we have , s N j L Ni L b = , i j i s ;
+where b1 b2 bmare the length of the symbols which may be chosen in state iand
+lead to state j. These i j;
+i j;
+: : : ;
+i j are linear difference equations and the behavior as L must be of the type !
+Nj A jW L = : Substituting in the difference equation s A bij jW L AiWL, = i s
+;
+or s A bij j AiW, = i s ;
+s W b , i j i j Ai 0 , = : i s For this to be possible the determinant s D W a
+bij i j W, i j = j j = , s must vanish and this determines W, which is, of
+course, the largest real root of D 0. = The quantity Cis then given by log A jW
+L C Lim logW = L L = ! and we also note that the same growth properties result
+if we require that all blocks start in the same (arbi-trarily chosen) state.
+APPENDIX 2 DERIVATION OF H pilog pi = , 1 1 1 Let H A n. From condition (3) we
+can decompose a choice from smequally likely possi- ;
+;
+: : : ;
+= n n n bilities into a series of mchoices from sequally likely possibilities
+and obtain A sm mA s = : 28
+===============================================================================
+Similarly A tn nA t = : We can choose narbitrarily large and find an mto
+satisfy sm tn s m1 + : Thus, taking logarithms and dividing by nlog s, m log t
+m 1 m log t or + , n log s n n n log s where is arbitrarily small. Now from the
+monotonic property of A n, A sm A tn A sm1 + mA s nA t m 1 A s + : Hence,
+dividing by nA s, m A t m 1 m A t or + , n A s n n n A s A t logt 2 A t Klogt ,
+= A s log s where Kmust be positive to satisfy (2). ni Now suppose we have a
+choice from npossibilities with commeasurable probabilities pi where = ni the
+niare integers. We can break down a choice from nipossibilities into a choice
+from npossibilitieswith probabilities p1 pnand then, if the ith was chosen, a
+choice from niwith equal probabilities. Using ;
+: : : ;
+condition (3) again, we equate the total choice from nias computed by two
+methods Klog ni H p1 pn K pilogni = ;
+: : : ;
++ : Hence h i H K pilogni pilogni = , ni K pilog K pilog pi = , = , : ni If the
+piare incommeasurable, they may be approximated by rationals and the same
+expression must holdby our continuity assumption. Thus the expression holds in
+general. The choice of coefficient Kis a matterof convenience and amounts to
+the choice of a unit of measure. APPENDIX 3 THEOREMS ON ERGODIC SOURCES If it
+is possible to go from any state with P 0 to any other along a path of
+probability p 0, the system is ergodic and the strong law of large numbers can
+be applied. Thus the number of times a given path pi jinthe network is
+traversed in a long sequence of length Nis about proportional to the
+probability of being ati, say Pi, and then choosing this path, Pi pi jN. If Nis
+large enough the probability of percentage error in this is less than so that
+for all but a set of small probability the actual numbers lie within the limits
+Pi pi j N : Hence nearly all sequences have a probability pgiven by P N p p
+ipij = i j 29
+===============================================================================
+log p and is limited by N log p Pipij log pi j = N or log p Pipijlogpij , : N
+This proves Theorem 3. Theorem 4 follows immediately from this on calculating
+upper and lower bounds for n qbased on the possible range of values of pin
+Theorem 3. In the mixed (not ergodic) case if L piLi = and the entropies of the
+components are H1 H2 Hnwe have the Theorem:Lim logn q qis a decreasing step
+function, N N = ' ! s1 s , q Hs in the interval i q i ' = : 1 1 To prove
+Theorems 5 and 6 first note that FNis monotonic decreasing because increasing
+Nadds a subscript to a conditional entropy. A simple substitution for pB S in
+the definition of F i j Nshows that FN NGN N 1 GN1 = , , , 1 and summing this
+for all Ngives GN Fn. Hence GN FNand GNmonotonic decreasing. Also they = N must
+approach the same limit. By using Theorem 3 we see that Lim GN H. = N !
+APPENDIX 4 MAXIMIZING THE RATE FOR A SYSTEM OF CONSTRAINTS Suppose we have a
+set of constraints on sequences of symbols that is of the finite state type and
+can be s represented therefore by a linear graph. Let be the lengths of the
+various symbols that can occur in `i j s passing from state ito state j. What
+distribution of probabilities P ifor the different states and p for i j
+choosing symbol sin state iand going to state jmaximizes the rate of generating
+information under theseconstraints? The constraints define a discrete channel
+and the maximum rate must be less than or equal tothe capacity Cof this
+channel, since if all blocks of large length were equally likely, this rate
+would result,and if possible this would be best. We will show that this rate
+can be achieved by proper choice of the Piand s p . i j The rate in question is
+s s P i p log p N , i j i j = : s s P M i pij`i j s s s Let i j . Evidently for
+a maximum p kexp . The constraints on maximization are Pi ` = s`i j i j= `i j =
+1, j pi j 1, Pi pi j i j 0. Hence we maximize = , = Pipijlog pij , U Pi ipij
+jPi pij ij = P + + + , i pi j i j ` i U MPi1 log pi j NPi i j + + ` i iPi 0 = ,
++ = : pi j M2 + + 30
+===============================================================================
+Solving for pi j pi j AiB jD,`ij = : Since p 1 i j 1 A, BjD,`ij = ;
+i = j j B jD,`ij pi j= : s BsD,`is The correct value of Dis the capacity Cand
+the B jare solutions of B i j i BjC,` = for then B j pi j C,`ij = Bi Bj Pi
+C,`ij Pj = Bi or Pi Pj C,`ij= : Bi B j So that if isatisfy iC,`ij j = Pi Bi i =
+: Both the sets of equations for Biand ican be satisfied since Cis such that
+C,`ij i j 0 j , j = : In this case the rate is B B P j j i pi jlog C,`ij P B i
+pi jlog i B C i , = , Pi pi j i j Pipij ij ` ` but
+PipijlogBjlogBiPjlogBjPilogBi0 , = , = j Hence the rate is Cand as this could
+never be exceeded this is the maximum, justifying the assumed solution. 31
+===============================================================================
+PART III: MATHEMATICAL PRELIMINARIES In this final installment of the paper we
+consider the case where the signals or the messages or both arecontinuously
+variable, in contrast with the discrete nature assumed heretofore. To a
+considerable extent thecontinuous case can be obtained through a limiting
+process from the discrete case by dividing the continuumof messages and signals
+into a large but finite number of small regions and calculating the various
+parametersinvolved on a discrete basis. As the size of the regions is decreased
+these parameters in general approach aslimits the proper values for the
+continuous case. There are, however, a few new effects that appear and alsoa
+general change of emphasis in the direction of specialization of the general
+results to particular cases. We will not attempt, in the continuous case, to
+obtain our results with the greatest generality, or with the extreme rigor of
+pure mathematics, since this would involve a great deal of abstract measure
+theoryand would obscure the main thread of the analysis. A preliminary study,
+however, indicates that the theorycan be formulated in a completely axiomatic
+and rigorous manner which includes both the continuous anddiscrete cases and
+many others. The occasional liberties taken with limiting processes in the
+present analysiscan be justified in all cases of practical interest. 18. SETS
+AND ENSEMBLES OF FUNCTIONS We shall have to deal in the continuous case with
+sets of functions and ensembles of functions. A set offunctions, as the name
+implies, is merely a class or collection of functions, generally of one
+variable, time.It can be specified by giving an explicit representation of the
+various functions in the set, or implicitly bygiving a property which functions
+in the set possess and others do not. Some examples are: 1. The set of
+functions: f t sin t = + : Each particular value of determines a particular
+function in the set. 2. The set of all functions of time containing no
+frequencies over Wcycles per second. 3. The set of all functions limited in
+band to Wand in amplitude to A. 4. The set of all English speech signals as
+functions of time. An ensembleof functions is a set of functions together with
+a probability measure whereby we may determine the probability of a function in
+the set having certain properties.1 For example with the set, f t sin t = + ;
+we may give a probability distribution for , P . The set then becomes an
+ensemble. Some further examples of ensembles of functions are: 1. A finite set
+of functions fk t(k 1 2 n) with the probability of fkbeing pk. = ;
+;
+: : : ;
+2. A finite dimensional family of functions f 1 2 n;
+t ;
+;
+: : : ;
+with a probability distribution on the parameters i: p 1 n ;
+: : : ;
+: For example we could consider the ensemble defined by n f a1 an1 n;
+t aisini t i ;
+: : : ;
+;
+;
+: : : ;
+= ! + i1 = with the amplitudes aidistributed normally and independently, and
+the phases idistributed uniformly (from 0 to 2 ) and independently. 1In
+mathematical terminology the functions belong to a measure space whose total
+measure is unity. 32
+===============================================================================
+3. The ensemble + sin 2W t n f a , i t an ;
+= 2W t n n , =, p with the ainormal and independent all with the same standard
+deviation N. This is a representation of "white" noise, band limited to the
+band from 0 to Wcycles per second and with average power N.2 4. Let points be
+distributed on the taxis according to a Poisson distribution. At each selected
+point the function f tis placed and the different functions added, giving the
+ensemble f t tk + k =, where the tkare the points of the Poisson distribution.
+This ensemble can be considered as a type ofimpulse or shot noise where all the
+impulses are identical. 5. The set of English speech functions with the
+probability measure given by the frequency of occurrence in ordinary use. An
+ensemble of functions f tis stationaryif the same ensemble results when all
+functions are shifted any fixed amount in time. The ensemble f t sin t = + is
+stationary if is distributed uniformly from 0 to 2 . If we shift each function
+by t1 we obtain f t t1 sin t t1 + = + + sin t = + ' with distributed uniformly
+from 0 to 2 . Each function has changed but the ensemble as a whole is '
+invariant under the translation. The other examples given above are also
+stationary. An ensemble is ergodicif it is stationary, and there is no subset
+of the functions in the set with a probability different from 0 and 1 which is
+stationary. The ensemble sin t + is ergodic. No subset of these functions of
+probability 0 1 is transformed into itself under all time trans- 6= ;
+lations. On the other hand the ensemble asin t + with adistributed normally and
+uniform is stationary but not ergodic. The subset of these functions with
+abetween 0 and 1 for example is stationary. Of the examples given, 3 and 4 are
+ergodic, and 5 may perhaps be considered so. If an ensemble is ergodic we may
+say roughly that each function in the set is typical of the ensemble. More
+precisely it isknown that with an ergodic ensemble an average of any statistic
+over the ensemble is equal (with probability1) to an average over the time
+translations of a particular function of the set.3 Roughly speaking,
+eachfunction can be expected, as time progresses, to go through, with the
+proper frequency, all the convolutionsof any of the functions in the set. 2This
+representation can be used as a definition of band limited white noise. It has
+certain advantages in that it involves fewer limiting operations than do
+definitions that have been used in the past. The name "white noise," already
+firmly entrenched in theliterature, is perhaps somewhat unfortunate. In optics
+white light means either any continuous spectrum as contrasted with a
+pointspectrum, or a spectrum which is flat with wavelength(which is not the
+same as a spectrum flat with frequency). 3This is the famous ergodic theorem or
+rather one aspect of this theorem which was proved in somewhat different
+formulations by Birkoff, von Neumann, and Koopman, and subsequently generalized
+by Wiener, Hopf, Hurewicz and others. The literature onergodic theory is quite
+extensive and the reader is referred to the papers of these writers for precise
+and general formulations;
+e.g.,E. Hopf, "Ergodentheorie," Ergebnisse der Mathematik und ihrer
+Grenzgebiete,v. 5;
+"On Causality Statistics and Probability," Journalof Mathematics and Physics,v.
+XIII, No. 1, 1934;
+N. Wiener, "The Ergodic Theorem," Duke Mathematical Journal,v. 5, 1939. 33
+===============================================================================
+Just as we may perform various operations on numbers or functions to obtain new
+numbers or functions, we can perform operations on ensembles to obtain new
+ensembles. Suppose, for example, we have anensemble of functions f tand an
+operator Twhich gives for each function f ta resulting function g t: g t T f t
+= : Probability measure is defined for the set g tby means of that for the set
+f t. The probability of a certain subset of the g tfunctions is equal to that
+of the subset of the f tfunctions which produce members of the given subset of
+gfunctions under the operation T. Physically this corresponds to passing the
+ensemblethrough some device, for example, a filter, a rectifier or a modulator.
+The output functions of the deviceform the ensemble g t. A device or operator
+Twill be called invariant if shifting the input merely shifts the output, i.e.,
+if g t T f t = implies g t t1 T f t t1 + = + for all f tand all t1. It is
+easily shown (see Appendix 5 that if Tis invariant and the input ensemble is
+stationary then the output ensemble is stationary. Likewise if the input is
+ergodic the output will also beergodic. A filter or a rectifier is invariant
+under all time translations. The operation of modulation is not since the
+carrier phase gives a certain time structure. However, modulation is invariant
+under all translations whichare multiples of the period of the carrier. Wiener
+has pointed out the intimate relation between the invariance of physical
+devices under time translations and Fourier theory.4 He has shown, in fact,
+that if a device is linear as well as invariant Fourieranalysis is then the
+appropriate mathematical tool for dealing with the problem. An ensemble of
+functions is the appropriate mathematical representation of the messages
+produced by a continuous source (for example, speech), of the signals produced
+by a transmitter, and of the perturbingnoise. Communication theory is properly
+concerned, as has been emphasized by Wiener, not with operationson particular
+functions, but with operations on ensembles of functions. A communication
+system is designednot for a particular speech function and still less for a
+sine wave, but for the ensemble of speech functions. 19. BAND LIMITED ENSEMBLES
+OF FUNCTIONS If a function of time f tis limited to the band from 0 to Wcycles
+per second it is completely determined by giving its ordinates at a series of
+discrete points spaced 1 seconds apart in the manner indicated by the 2W
+following result.5 Theorem 13:Let f tcontain no frequencies over W. Then sin 2W
+t n f t X , n = 2W t n , , where n Xn f = : 2W 4Communication theory is heavily
+indebted to Wiener for much of its basic philosophy and theory. His classic
+NDRC report, The Interpolation, Extrapolation and Smoothing of Stationary Time
+Series(Wiley, 1949), contains the first clear-cut formulation ofcommunication
+theory as a statistical problem, the study of operations on time series. This
+work, although chiefly concerned with thelinear prediction and filtering
+problem, is an important collateral reference in connection with the present
+paper. We may also referhere to Wiener's Cybernetics(Wiley, 1948), dealing with
+the general problems of communication and control. 5For a proof of this theorem
+and further discussion see the author's paper "Communication in the Presence of
+Noise" published in the Proceedings of the Institute of Radio Engineers,v. 37,
+No. 1, Jan., 1949, pp. 10�21. 34
+===============================================================================
+In this expansion f tis represented as a sum of orthogonal functions. The
+coefficients Xnof the various terms can be considered as coordinates in an
+infinite dimensional "function space." In this space eachfunction corresponds
+to precisely one point and each point to one function. A function can be
+considered to be substantially limited to a time Tif all the ordinates
+Xnoutside this interval of time are zero. In this case all but 2TWof the
+coordinates will be zero. Thus functions limited toa band Wand duration
+Tcorrespond to points in a space of 2TWdimensions. A subset of the functions of
+band Wand duration Tcorresponds to a region in this space. For example, the
+functions whose total energy is less than or equal to Ecorrespond to points in
+a 2TWdimensional sphere p with radius r 2W E. = An ensembleof functions of
+limited duration and band will be represented by a probability distribution p
+x1 xnin the corresponding ndimensional space. If the ensemble is not limited in
+time we can consider ;
+: : : ;
+the 2TWcoordinates in a given interval Tto represent substantially the part of
+the function in the interval Tand the probability distribution p x1 xnto give
+the statistical structure of the ensemble for intervals of ;
+: : : ;
+that duration. 20. ENTROPY OF A CONTINUOUS DISTRIBUTION The entropy of a
+discrete set of probabilities p1 pnhas been defined as: ;
+: : : ;
+H pilogpi = , : In an analogous manner we define the entropy of a continuous
+distribution with the density distributionfunction p xby: Z H p xlog p x dx = ,
+: , With an ndimensional distribution p x1 xnwe have ;
+: : : ;
+Z Z H p x1 xnlog p x1 xn dx1 dxn = , ;
+: : : ;
+;
+: : : ;
+: If we have two arguments xand y(which may themselves be multidimensional) the
+joint and conditionalentropies of p x yare given by ;
+Z Z H x y p x ylog p x y dx dy ;
+= , ;
+;
+and Z Z p x y ;
+Hx y p x ylog dx dy = , ;
+p x Z Z p x y H ;
+y x p x ylog dx dy = , ;
+p y where Z p x p x y dy = ;
+Z p y p x y dx = ;
+: The entropies of continuous distributions have most (but not all) of the
+properties of the discrete case. In particular we have the following: 1. If xis
+limited to a certain volume vin its space, then H xis a maximum and equal to
+log vwhen p x is constant (1 v) in the volume. = 35
+===============================================================================
+2. With any two variables x, ywe have H x y H x H y ;
++ with equality if (and only if) xand yare independent, i.e., p x y p x p y
+(apart possibly from a ;
+= set of points of probability zero). 3. Consider a generalized averaging
+operation of the following type: Z p0 y a x y p x dx = ;
+with Z Z a x y dx a x y dy 1 a x y 0 ;
+= ;
+= ;
+;
+: Then the entropy of the averaged distribution p0 yis equal to or greater than
+that of the original distribution p x. 4. We have H x y H x Hx y H y Hy x ;
+= + = + and Hx y H y : 5. Let p xbe a one-dimensional distribution. The form of
+p xgiving a maximum entropy subject to the condition that the standard
+deviation of xbe fixed at is Gaussian. To show this we must maximize Z H x p
+xlog p x dx = , with Z Z 2 p x x2 dx and 1 p x dx = = as constraints. This
+requires, by the calculus of variations, maximizing Z p xlog p x p x x2 p x dx
+, + + : The condition for this is 1 log p x x2 0 , , + + = and consequently
+(adjusting the constants to satisfy the constraints) 1 p x e x2 2 2 , = p = : 2
+Similarly in ndimensions, suppose the second order moments of p x1 xnare fixed
+at Ai j: ;
+: : : ;
+Z Z Ai j xix j p x1 xn dx1 dxn = ;
+: : : ;
+: Then the maximum entropy occurs (by a similar calculation) when p x1 xnis the
+ndimensional ;
+: : : ;
+Gaussian distribution with the second order moments Ai j. 36
+===============================================================================
+6. The entropy of a one-dimensional Gaussian distribution whose standard
+deviation is is given by p H x log 2 e = : This is calculated as follows: 1 p x
+e x2 2 2 , = p = 2 x2 p log p x log 2 , = + 2 2 Z H x p xlog p x dx = , Z Z x2
+p p xlog 2 dx p x dx = + 2 2 2 p log 2 = + 2 2 p p log 2 log e = + p log 2 e =
+: Similarly the ndimensional Gaussian distribution with associated quadratic
+form ai jis given by 1 a i j2 j j p x 1 1 xn exp aijxixj ;
+: : : ;
+= , 2 n2 2 = and the entropy can be calculated as 1 H log 2 e n2 = a, i j 2 = j
+j where ai jis the determinant whose elements are ai j. j j 7. If xis limited
+to a half line (p x 0 for x 0) and the first moment of xis fixed at a: = Z a p
+x x dx = ;
+0 then the maximum entropy occurs when 1 p x e x a , = = a and is equal to log
+ea. 8. There is one important difference between the continuous and discrete
+entropies. In the discrete case the entropy measures in an absoluteway the
+randomness of the chance variable. In the continuouscase the measurement is
+relative to the coordinate system. If we change coordinates the entropy willin
+general change. In fact if we change to coordinates y1 ynthe new entropy is
+given by Z Z x x H y p x1 xn J log p x1 xn J dy1 dyn = ;
+: : : ;
+;
+: : : ;
+y y , where J xis the Jacobian of the coordinate transformation. On expanding
+the logarithm and chang- y ing the variables to x1 xn, we obtain: Z Z x H y H x
+p x1 xnlog J dx1 dxn = , ;
+: : : ;
+: : : : y 37
+===============================================================================
+Thus the new entropy is the old entropy less the expected logarithm of the
+Jacobian. In the continuouscase the entropy can be considered a measure of
+randomness relative to an assumed standard, namelythe coordinate system chosen
+with each small volume element dx1 dxngiven equal weight. When we change the
+coordinate system the entropy in the new system measures the randomness when
+equalvolume elements dy1 dynin the new system are given equal weight. In spite
+of this dependence on the coordinate system the entropy concept is as important
+in the con-tinuous case as the discrete case. This is due to the fact that the
+derived concepts of information rateand channel capacity depend on the
+differenceof two entropies and this difference does notdependon the coordinate
+frame, each of the two terms being changed by the same amount. The entropy of a
+continuous distribution can be negative. The scale of measurements sets an
+arbitraryzero corresponding to a uniform distribution over a unit volume. A
+distribution which is more confinedthan this has less entropy and will be
+negative. The rates and capacities will, however, always be non-negative. 9. A
+particular case of changing coordinates is the linear transformation y j aijxi
+= : i In this case the Jacobian is simply the determinant a 1 , i j and j j H y
+H x log ai j = + j j: In the case of a rotation of coordinates (or any measure
+preserving transformation) J 1 and H y = = H x. 21. ENTROPY OF AN ENSEMBLE OF
+FUNCTIONS Consider an ergodic ensemble of functions limited to a certain band
+of width Wcycles per second. Let p x1 xn ;
+: : : ;
+be the density distribution function for amplitudes x1 xnat nsuccessive sample
+points. We define the ;
+: : : ;
+entropy of the ensemble per degree of freedom by 1 Z Z H0 Lim p x1 xnlog p x1
+xn dx1 dxn = , ;
+: : : ;
+;
+: : : ;
+: : : : n n ! We may also define an entropy Hper second by dividing, not by n,
+but by the time Tin seconds for nsamples. Since n 2TW, H 2W H0. = = With white
+thermal noise pis Gaussian and we have p H0 log 2 eN = ;
+H Wlog 2 eN = : For a given average power N, white noise has the maximum
+possible entropy. This follows from the maximizing properties of the Gaussian
+distribution noted above. The entropy for a continuous stochastic process has
+many properties analogous to that for discrete pro- cesses. In the discrete
+case the entropy was related to the logarithm of the probabilityof long
+sequences,and to the numberof reasonably probable sequences of long length. In
+the continuous case it is related ina similar fashion to the logarithm of the
+probability densityfor a long series of samples, and the volumeofreasonably
+high probability in the function space. More precisely, if we assume p x1
+xncontinuous in all the xifor all n, then for sufficiently large n ;
+: : : ;
+log p H0 , n 38
+===============================================================================
+for all choices of x1 xnapart from a set whose total probability is less than ,
+with and arbitrarily ;
+: : : ;
+small. This follows form the ergodic property if we divide the space into a
+large number of small cells. The relation of Hto volume can be stated as
+follows: Under the same assumptions consider the n dimensional space
+corresponding to p x1 xn. Let Vn qbe the smallest volume in this space which ;
+: : : ;
+includes in its interior a total probability q. Then logVn q Lim H0 = n n !
+provided qdoes not equal 0 or 1. These results show that for large nthere is a
+rather well-defined volume (at least in the logarithmic sense) of high
+probability, and that within this volume the probability density is relatively
+uniform (again in thelogarithmic sense). In the white noise case the
+distribution function is given by 1 1 p x1 xn exp x2 ;
+: : : ;
+= , 2 N n2 i: = 2N Since this depends only on x2 the surfaces of equal
+probability density are spheres and the entire distri- i p bution has spherical
+symmetry. The region of high probability is a sphere of radius nN. As n the ! p
+probability of being outside a sphere of radius n N approaches zero and 1 times
+the logarithm of the + n p volume of the sphere approaches log 2 eN. In the
+continuous case it is convenient to work not with the entropy Hof an ensemble
+but with a derived quantity which we will call the entropy power. This is
+defined as the power in a white noise limited to thesame band as the original
+ensemble and having the same entropy. In other words if H0 is the entropy of
+anensemble its entropy power is 1 N1 exp 2H0 = : 2 e In the geometrical picture
+this amounts to measuring the high probability volume by the squared radius of
+asphere having the same volume. Since white noise has the maximum entropy for a
+given power, the entropypower of any noise is less than or equal to its actual
+power. 22. ENTROPY LOSS IN LINEAR FILTERS Theorem 14:If an ensemble having an
+entropy H1 per degree of freedom in band Wis passed through a filter with
+characteristic Y fthe output ensemble has an entropy 1 Z H 2 2 H1 log Y f d f =
++ j j : W W The operation of the filter is essentially a linear transformation
+of coordinates. If we think of the different frequency components as the
+original coordinate system, the new frequency components are merely the oldones
+multiplied by factors. The coordinate transformation matrix is thus essentially
+diagonalized in termsof these coordinates. The Jacobian of the transformation
+is (for nsine and ncosine components) n J Y f2 i = j j i1 = where the fiare
+equally spaced through the band W. This becomes in the limit 1 Z exp log Y f2 d
+f j j : W W Since Jis constant its average value is the same quantity and
+applying the theorem on the change of entropywith a change of coordinates, the
+result follows. We may also phrase it in terms of the entropy power. Thusif the
+entropy power of the first ensemble is N1 that of the second is 1 Z N 2 1 exp
+log Y f d f j j : W W 39
+===============================================================================
+TABLE I ENTROPY ENTROPY GAIN POWER POWER GAIN IMPULSE RESPONSE FACTOR IN
+DECIBELS 1 1 1 sin2 t2 , ! 8 69 = e2 , : t2 2 = ! 0 1 1 1 2 2 4 sint cos t , !
+5 33 2 , : e t3 , t2 ! 0 1 1 1 3 cos t 1 cos t sint , , ! 0 411 3 87 6 : , : t4
+, 2t2 + t3 ! 0 1 1 p 1 2 2 2 J1 t , ! 2 67 e , : 2 t ! 0 1 1 1 1 8 69 cos 1 t
+cos t : , , e2 , t2 ! 0 1 The final entropy power is the initial entropy power
+multiplied by the geometric mean gain of the filter. Ifthe gain is measured in
+db, then the output entropy power will be increased by the arithmetic mean
+dbgainover W. In Table I the entropy power loss has been calculated (and also
+expressed in db) for a number of ideal gain characteristics. The impulsive
+responses of these filters are also given for W 2 , with phase assumed = to be
+0. The entropy loss for many other cases can be obtained from these results.
+For example the entropy power factor 1 e2 for the first case also applies to
+any gain characteristic obtain from 1 by a measure = , ! preserving
+transformation of the axis. In particular a linearly increasing gain G , or a
+"saw tooth" ! ! = ! characteristic between 0 and 1 have the same entropy loss.
+The reciprocal gain has the reciprocal factor.Thus 1 has the factor e2. Raising
+the gain to any power raises the factor to this power. =! 23. ENTROPY OF A SUM
+OF TWO ENSEMBLES If we have two ensembles of functions f tand g twe can form a
+new ensemble by "addition." Suppose the first ensemble has the probability
+density function p x1 xnand the second q x1 xn. Then the ;
+: : : ;
+;
+: : : ;
+40
+===============================================================================
+density function for the sum is given by the convolution: Z Z r x1 xn p y1 yn q
+x1 y1 xn yn dy1 dyn ;
+: : : ;
+= ;
+: : : ;
+, ;
+: : : ;
+, : Physically this corresponds to adding the noises or signals represented by
+the original ensembles of func-tions. The following result is derived in
+Appendix 6. Theorem 15:Let the average power of two ensembles be N1 and N2 and
+let their entropy powers be N1 and N2. Then the entropy power of the sum, N3,
+is bounded by N1 N2 N3 N1 N2 + + : White Gaussian noise has the peculiar
+property that it can absorb any other noise or signal ensemble which may be
+added to it with a resultant entropy power approximately equal to the sum of
+the white noisepower and the signal power (measured from the average signal
+value, which is normally zero), provided thesignal power is small, in a certain
+sense, compared to noise. Consider the function space associated with these
+ensembles having ndimensions. The white noise corresponds to the spherical
+Gaussian distribution in this space. The signal ensemble corresponds to
+anotherprobability distribution, not necessarily Gaussian or spherical. Let the
+second moments of this distributionabout its center of gravity be ai j. That
+is, if p x1 xnis the density distribution function ;
+: : : ;
+Z Z ai j p xi i x j j dx1 dxn = , , where the iare the coordinates of the
+center of gravity. Now ai jis a positive definite quadratic form, and we can
+rotate our coordinate system to align it with the principal directions of this
+form. ai jis then reducedto diagonal form bii. We require that each biibe small
+compared to N, the squared radius of the sphericaldistribution. In this case
+the convolution of the noise and signal produce approximately a Gaussian
+distribution whose corresponding quadratic form is N bii + : The entropy power
+of this distribution is h i1 n = N bii + or approximately h i1 n = N n b n1 ,
+ii N = + 1 : N bii = + : n The last term is the signal power, while the first
+is the noise power. PART IV: THE CONTINUOUS CHANNEL 24. THE CAPACITY OF A
+CONTINUOUS CHANNEL In a continuous channel the input or transmitted signals
+will be continuous functions of time f tbelonging to a certain set, and the
+output or received signals will be perturbed versions of these. We will
+consideronly the case where both transmitted and received signals are limited
+to a certain band W. They can thenbe specified, for a time T, by 2TWnumbers,
+and their statistical structure by finite dimensional distributionfunctions.
+Thus the statistics of the transmitted signal will be determined by P x1 xn P x
+;
+: : : ;
+= 41
+===============================================================================
+and those of the noise by the conditional probability distribution Px y y P y 1
+xn 1 n x ;
+: : : ;
+= : ;
+:::;
+The rate of transmission of information for a continuous channel is defined in
+a way analogous to that for a discrete channel, namely R H x Hy x = , where H
+xis the entropy of the input and Hy xthe equivocation. The channel capacity Cis
+defined as the maximum of Rwhen we vary the input over all possible ensembles.
+This means that in a finite dimensionalapproximation we must vary P x P x1
+xnand maximize = ;
+: : : ;
+Z ZZ P x y P xlog P x dx P x ylog ;
+dx dy , + ;
+: P y This can be written Z Z P x y P x ylog ;
+dx dy ;
+P x P y Z Z Z using the fact that P x ylog P x dx dy P xlog P x dx. The channel
+capacity is thus expressed as ;
+= follows: 1 ZZ P x y C Lim Max P x ylog ;
+dx dy = ;
+: T P x T P x P y ! It is obvious in this form that Rand Care independent of
+the coordinate system since the numerator P x y and denominator in log ;
+will be multiplied by the same factors when xand yare transformed in P x P y
+any one-to-one way. This integral expression for Cis more general than H x Hy
+x. Properly interpreted , (see Appendix 7) it will always exist while H x Hy
+xmay assume an indeterminate form in some , , cases. This occurs, for example,
+if xis limited to a surface of fewer dimensions than nin its
+ndimensionalapproximation. If the logarithmic base used in computing H xand Hy
+xis two then Cis the maximum number of binary digits that can be sent per
+second over the channel with arbitrarily small equivocation, just as inthe
+discrete case. This can be seen physically by dividing the space of signals
+into a large number ofsmall cells, sufficiently small so that the probability
+density Px yof signal xbeing perturbed to point yis substantially constant over
+a cell (either of xor y). If the cells are considered as distinct points the
+situation isessentially the same as a discrete channel and the proofs used
+there will apply. But it is clear physically thatthis quantizing of the volume
+into individual points cannot in any practical situation alter the final
+answersignificantly, provided the regions are sufficiently small. Thus the
+capacity will be the limit of the capacitiesfor the discrete subdivisions and
+this is just the continuous capacity defined above. On the mathematical side it
+can be shown first (see Appendix 7) that if uis the message, xis the signal,
+yis the received signal (perturbed by noise) and vis the recovered message then
+H x Hy x H u Hv u , , regardless of what operations are performed on uto obtain
+xor on yto obtain v. Thus no matter how weencode the binary digits to obtain
+the signal, or how we decode the received signal to recover the message,the
+discrete rate for the binary digits does not exceed the channel capacity we
+have defined. On the otherhand, it is possible under very general conditions to
+find a coding system for transmitting binary digits at therate Cwith as small
+an equivocation or frequency of errors as desired. This is true, for example,
+if, when wetake a finite dimensional approximating space for the signal
+functions, P x yis continuous in both xand y ;
+except at a set of points of probability zero. An important special case occurs
+when the noise is added to the signal and is independent of it (in the
+probability sense). Then Px yis a function only of the difference n y x, = , Px
+y Q y x = , 42
+===============================================================================
+and we can assign a definite entropy to the noise (independent of the
+statistics of the signal), namely theentropy of the distribution Q n. This
+entropy will be denoted by H n. Theorem 16:If the signal and noise are
+independent and the received signal is the sum of the transmitted signal and
+the noise then the rate of transmission is R H y H n = , ;
+i.e., the entropy of the received signal less the entropy of the noise. The
+channel capacity is C Max H y H n = , : P x We have, since y x n: = + H x y H x
+n ;
+= ;
+: Expanding the left side and using the fact that xand nare independent H y Hy
+x H x H n + = + : Hence R H x Hy x H y H n = , = , : Since H nis independent of
+P x, maximizing Rrequires maximizing H y, the entropy of the received signal.
+If there are certain constraints on the ensemble of transmitted signals, the
+entropy of the receivedsignal must be maximized subject to these constraints.
+25. CHANNEL CAPACITY WITH AN AVERAGE POWER LIMITATION A simple application of
+Theorem 16 is the case when the noise is a white thermal noise and the
+transmittedsignals are limited to a certain average power P. Then the received
+signals have an average power P N + where Nis the average noise power. The
+maximum entropy for the received signals occurs when they alsoform a white
+noise ensemble since this is the greatest possible entropy for a power P Nand
+can be obtained + by a suitable choice of transmitted signals, namely if they
+form a white noise ensemble of power P. Theentropy (per second) of the received
+ensemble is then H y Wlog 2 e P N = + ;
+and the noise entropy is H n Wlog 2 eN = : The channel capacity is P N C H y H
+n Wlog + = , = : N Summarizing we have the following: Theorem 17:The capacity
+of a channel of band Wperturbed by white thermal noise power Nwhen the average
+transmitter power is limited to Pis given by P N + C Wlog = : N This means that
+by sufficiently involved encoding systems we can transmit binary digits at the
+rate P N Wlog + 2 bits per second, with arbitrarily small frequency of errors.
+It is not possible to transmit at a N higher rate by any encoding system
+without a definite positive frequency of errors. To approximate this limiting
+rate of transmission the transmitted signals must approximate, in statistical
+properties, a white noise.6 A system which approaches the ideal rate may be
+described as follows: Let 6This and other properties of the white noise case
+are discussed from the geometrical point of view in "Communication in the
+Presence of Noise," loc. cit. 43
+===============================================================================
+M 2ssamples of white noise be constructed each of duration T. These are
+assigned binary numbers from = 0 to M 1. At the transmitter the message
+sequences are broken up into groups of sand for each group , the corresponding
+noise sample is transmitted as the signal. At the receiver the Msamples are
+known andthe actual received signal (perturbed by noise) is compared with each
+of them. The sample which has theleast R.M.S. discrepancy from the received
+signal is chosen as the transmitted signal and the correspondingbinary number
+reconstructed. This process amounts to choosing the most probable (a
+posteriori) signal.The number Mof noise samples used will depend on the
+tolerable frequency of errors, but for almost all selections of samples we have
+log M T P N ;
++ Lim Lim Wlog = ;
+0 T T N ! ! so that no matter how small is chosen, we can, by taking
+Tsufficiently large, transmit as near as we wish P N to TWlog + binary digits
+in the time T. N P N Formulas similar to C Wlog + for the white noise case have
+been developed independently = N by several other writers, although with
+somewhat different interpretations. We may mention the work ofN. Wiener,7 W. G.
+Tuller,8 and H. Sullivan in this connection. In the case of an arbitrary
+perturbing noise (not necessarily white thermal noise) it does not appear that
+the maximizing problem involved in determining the channel capacity Ccan be
+solved explicitly. However,upper and lower bounds can be set for Cin terms of
+the average noise power Nthe noise entropy power N1.These bounds are
+sufficiently close together in most practical cases to furnish a satisfactory
+solution to theproblem. Theorem 18:The capacity of a channel of band Wperturbed
+by an arbitrary noise is bounded by the inequalities P N1 P N Wlog + C Wlog +
+N1 N1 where P average transmitter power = N average noise power = N1 entropy
+power of the noise. = Here again the average power of the perturbed signals
+will be P N. The maximum entropy for this + power would occur if the received
+signal were white noise and would be Wlog 2 e P N. It may not + be possible to
+achieve this;
+i.e., there may not be any ensemble of transmitted signals which, added to
+theperturbing noise, produce a white thermal noise at the receiver, but at
+least this sets an upper bound to H y. We have, therefore C Max H y H n = ,
+Wlog 2 e P N Wlog 2 eN1 + , : This is the upper limit given in the theorem. The
+lower limit can be obtained by considering the rate if wemake the transmitted
+signal a white noise, of power P. In this case the entropy power of the
+received signalmust be at least as great as that of a white noise of power P N1
+since we have shown in in a previous + theorem that the entropy power of the
+sum of two ensembles is greater than or equal to the sum of theindividual
+entropy powers. Hence Max H y Wlog 2 e P N1 + 7Cybernetics, loc.
+cit.8"Theoretical Limitations on the Rate of Transmission of Information,"
+Proceedings of the Institute of Radio Engineers,v. 37, No. 5, May, 1949, pp.
+468�78. 44
+===============================================================================
+and C Wlog 2 e P N1 Wlog 2 eN1 + , P N1 + Wlog = : N1 As Pincreases, the upper
+and lower bounds approach each other, so we have as an asymptotic rate P N Wlog
++ : N1 If the noise is itself white, N N1 and the result reduces to the formula
+proved previously: = P C Wlog 1 = + : N If the noise is Gaussian but with a
+spectrum which is not necessarily flat, N1 is the geometric mean of the noise
+power over the various frequencies in the band W. Thus 1 Z N1 exp log N f d f =
+W W where N fis the noise power at frequency f. Theorem 19:If we set the
+capacity for a given transmitter power Pequal to P N + , C Wlog = N1 then is
+monotonic decreasing as Pincreases and approaches 0 as a limit. Suppose that
+for a given power P1 the channel capacity is P1 N 1 Wlog + , : N1 This means
+that the best signal distribution, say p x, when added to the noise
+distribution q x, gives a received distribution r ywhose entropy power is P1 N
+1 . Let us increase the power to P1 Pby + , + adding a white noise of power Pto
+the signal. The entropy of the received signal is now at least H y Wlog 2 e P1
+N 1 P = + , + by application of the theorem on the minimum entropy power of a
+sum. Hence, since we can attain theHindicated, the entropy of the maximizing
+distribution must be at least as great and must be monotonic decreasing. To
+show that 0 as P consider a signal which is white noise with a large P.
+Whatever ! ! the perturbing noise, the received signal will be approximately a
+white noise, if Pis sufficiently large, in thesense of having an entropy power
+approaching P N. + 26. THE CHANNEL CAPACITY WITH A PEAK POWER LIMITATION In
+some applications the transmitter is limited not by the average power output
+but by the peak instantaneouspower. The problem of calculating the channel
+capacity is then that of maximizing (by variation of theensemble of transmitted
+symbols) H y H n , p subject to the constraint that all the functions f tin the
+ensemble be less than or equal to S, say, for all t. A constraint of this type
+does not work out as well mathematically as the average power limitation. The S
+most we have obtained for this case is a lower bound valid for all , an
+"asymptotic" upper bound (valid N S S for large ) and an asymptotic value of
+Cfor small. N N 45
+===============================================================================
+Theorem 20:The channel capacity Cfor a band Wperturbed by white thermal noise
+of power Nis bounded by 2 S C Wlog ;
+e3 N S where Sis the peak allowed transmitter power. For sufficiently large N 2
+S N + C Wlog e 1 + N S where is arbitrarily small. As 0 (and provided the band
+Wstarts at 0) ! N . S C Wlog 1 1 + ! : N S We wish to maximize the entropy of
+the received signal. If is large this will occur very nearly when N we maximize
+the entropy of the transmitted ensemble. The asymptotic upper bound is obtained
+by relaxing the conditions on the ensemble. Let us suppose that the power is
+limited to Snot at every instant of time, but only at the sample points. The
+maximum entropy ofthe transmitted ensemble under these weakened conditions is
+certainly greater than or equal to that under theoriginal conditions. This
+altered problem can be solved easily. The maximum entropy occurs if the
+different p p samples are independent and have a distribution function which is
+constant from Sto S. The entropy , + can be calculated as Wlog 4S: The received
+signal will then have an entropy less than Wlog 4S 2 eN1 + + S with 0 as and
+the channel capacity is obtained by subtracting the entropy of the white noise,
+! ! N Wlog 2 eN: 2 S N + Wlog 4S 2 eN1 Wlog 2 eN Wlog e 1 + + , = + : N This is
+the desired upper bound to the channel capacity. To obtain a lower bound
+consider the same ensemble of functions. Let these functions be passed through
+an ideal filter with a triangular transfer characteristic. The gain is to be
+unity at frequency 0 and declinelinearly down to gain 0 at frequency W. We
+first show that the output functions of the filter have a peak sin 2 W t power
+limitation Sat all times (not just the sample points). First we note that a
+pulse going into 2 W t the filter produces 1 sin2 W t 2 W t2 in the output.
+This function is never negative. The input function (in the general case) can
+be thought of asthe sum of a series of shifted functions sin 2 W t a 2 W t p
+where a, the amplitude of the sample, is not greater than S. Hence the output
+is the sum of shifted functions of the non-negative form above with the same
+coefficients. These functions being non-negative, the greatest p positive value
+for any tis obtained when all the coefficients ahave their maximum positive
+values, i.e., S. p In this case the input function was a constant of amplitude
+Sand since the filter has unit gain for D.C., the output is the same. Hence the
+output ensemble has a peak power S. 46
+===============================================================================
+The entropy of the output ensemble can be calculated from that of the input
+ensemble by using the theorem dealing with such a situation. The output entropy
+is equal to the input entropy plus the geometricalmean gain of the filter: Z W
+Z W W f2 log G2 d f log , d f 2W = = , : 0 0 W Hence the output entropy is 4S
+Wlog 4S 2W Wlog , = e2 and the channel capacity is greater than 2 S Wlog : e3 N
+S We now wish to show that, for small (peak signal power over average white
+noise power), the channel N capacity is approximately S C Wlog 1 = + : N . S S
+More precisely C Wlog 1 1 as 0. Since the average signal power Pis less than or
+equal + ! ! N N S to the peak S, it follows that for all N P S C Wlog 1 Wlog 1
++ + : N N S Therefore, if we can find an ensemble of functions such that they
+correspond to a rate nearly Wlog 1 + Nand are limited to band Wand peak Sthe
+result will be proved. Consider the ensemble of functions of the p p following
+type. A series of tsamples have the same value, either Sor S, then the next
+tsamples have + , p p the same value, etc. The value for a series is chosen at
+random, probability 1 for Sand 1 for S. If 2 + 2 , this ensemble be passed
+through a filter with triangular gain characteristic (unit gain at D.C.), the
+output ispeak limited to S. Furthermore the average power is nearly Sand can be
+made to approach this by taking t sufficiently large. The entropy of the sum of
+this and the thermal noise can be found by applying the theoremon the sum of a
+noise and a small signal. This theorem will apply if S p t N S is sufficiently
+small. This can be ensured by taking small enough (after tis chosen). The
+entropy power N will be S Nto as close an approximation as desired, and hence
+the rate of transmission as near as we wish + to S N Wlog + : N PART V: THE
+RATE FOR A CONTINUOUS SOURCE 27. FIDELITY EVALUATION FUNCTIONS In the case of a
+discrete source of information we were able to determine a definite rate of
+generatinginformation, namely the entropy of the underlying stochastic process.
+With a continuous source the situationis considerably more involved. In the
+first place a continuously variable quantity can assume an infinitenumber of
+values and requires, therefore, an infinite number of binary digits for exact
+specification. Thismeans that to transmit the output of a continuous source
+with exact recoveryat the receiving point requires, 47
+===============================================================================
+in general, a channel of infinite capacity (in bits per second). Since,
+ordinarily, channels have a certainamount of noise, and therefore a finite
+capacity, exact transmission is impossible. This, however, evades the real
+issue. Practically, we are not interested in exact transmission when we have a
+continuous source, but only in transmission to within a certain tolerance. The
+question is, can weassign a definite rate to a continuous source when we
+require only a certain fidelity of recovery, measured ina suitable way. Of
+course, as the fidelity requirements are increased the rate will increase. It
+will be shownthat we can, in very general cases, define such a rate, having the
+property that it is possible, by properlyencoding the information, to transmit
+it over a channel whose capacity is equal to the rate in question, andsatisfy
+the fidelity requirements. A channel of smaller capacity is insufficient. It is
+first necessary to give a general mathematical formulation of the idea of
+fidelity of transmission. Consider the set of messages of a long duration, say
+Tseconds. The source is described by giving theprobability density, in the
+associated space, that the source will select the message in question P x. A
+given communication system is described (from the external point of view) by
+giving the conditional probabilityPx ythat if message xis produced by the
+source the recovered message at the receiving point will be y. The system as a
+whole (including source and transmission system) is described by the
+probability function P x y ;
+of having message xand final output y. If this function is known, the complete
+characteristics of the systemfrom the point of view of fidelity are known. Any
+evaluation of fidelity must correspond mathematicallyto an operation applied to
+P x y. This operation must at least have the properties of a simple ordering of
+;
+systems;
+i.e., it must be possible to say of two systems represented by P1 x yand P2 x
+ythat, according to ;
+;
+our fidelity criterion, either (1) the first has higher fidelity, (2) the
+second has higher fidelity, or (3) they haveequal fidelity. This means that a
+criterion of fidelity can be represented by a numerically valued function: , v
+P x y ;
+whose argument ranges over possible probability functions P x y. ;
+, We will now show that under very general and reasonable assumptions the
+function v P x y can be ;
+written in a seemingly much more specialized form, namely as an average of a
+function x yover the set ;
+of possible values of xand y: Z Z , v P x y P x y x y dx dy ;
+= ;
+;
+: To obtain this we need only assume (1) that the source and system are ergodic
+so that a very long samplewill be, with probability nearly 1, typical of the
+ensemble, and (2) that the evaluation is "reasonable" in thesense that it is
+possible, by observing a typical input and output x1 and y1, to form a
+tentative evaluationon the basis of these samples;
+and if these samples are increased in duration the tentative evaluation
+will,with probability 1, approach the exact evaluation based on a full
+knowledge of P x y. Let the tentative ;
+evaluation be x y. Then the function x yapproaches (as T ) a constant for
+almost all x ywhich ;
+;
+! ;
+are in the high probability region corresponding to the system: , x y v P x y ;
+! ;
+and we may also write Z Z x y P x y x y dx dy ;
+! ;
+;
+since Z Z P x y dx dy 1 ;
+= : This establishes the desired result. The function x yhas the general nature
+of a "distance" between xand y.9 It measures how undesirable ;
+it is (according to our fidelity criterion) to receive ywhen xis transmitted.
+The general result given abovecan be restated as follows: Any reasonable
+evaluation can be represented as an average of a distance functionover the set
+of messages and recovered messages xand yweighted according to the probability
+P x yof ;
+getting the pair in question, provided the duration Tof the messages be taken
+sufficiently large. The following are simple examples of evaluation functions:
+9It is not a "metric" in the strict sense, however, since in general it does
+not satisfy either x y y xor x y y z x z. ;
+= ;
+;
++ ;
+;
+48
+===============================================================================
+1. R.M.S. criterion. , 2 v x t y t = , : In this very commonly used measure of
+fidelity the distance function x yis (apart from a constant ;
+factor) the square of the ordinary Euclidean distance between the points xand
+yin the associatedfunction space. 1 Z T 2 x y x t y t dt ;
+= , : T0 2. Frequency weighted R.M.S. criterion. More generally one can apply
+different weights to the different frequency components before using an R.M.S.
+measure of fidelity. This is equivalent to passing thedifference x t y tthrough
+a shaping filter and then determining the average power in the output. , Thus
+let e t x t y t = , and Z f t e k t d = , , then 1 Z T x y f t2 dt ;
+= : T0 3. Absolute error criterion. 1 Z T x y x t y t dt ;
+= , : T0 4. The structure of the ear and brain determine implicitly an
+evaluation, or rather a number of evaluations, appropriate in the case of
+speech or music transmission. There is, for example, an
+"intelligibility"criterion in which x yis equal to the relative frequency of
+incorrectly interpreted words when ;
+message x tis received as y t. Although we cannot give an explicit
+representation of x yin these ;
+cases it could, in principle, be determined by sufficient experimentation. Some
+of its properties followfrom well-known experimental results in hearing, e.g.,
+the ear is relatively insensitive to phase and thesensitivity to amplitude and
+frequency is roughly logarithmic. 5. The discrete case can be considered as a
+specialization in which we have tacitly assumed an evaluation based on the
+frequency of errors. The function x yis then defined as the number of symbols
+in the ;
+sequence ydiffering from the corresponding symbols in xdivided by the total
+number of symbols inx. 28. THE RATE FOR A SOURCE RELATIVE TO A FIDELITY
+EVALUATION We are now in a position to define a rate of generating information
+for a continuous source. We are givenP xfor the source and an evaluation
+vdetermined by a distance function x ywhich will be assumed ;
+continuous in both xand y. With a particular system P x ythe quality is
+measured by ;
+Z Z v x y P x y dx dy = ;
+;
+: Furthermore the rate of flow of binary digits corresponding to P x yis ;
+Z Z P x y R P x ylog ;
+dx dy = ;
+: P x P y We define the rate R1 of generating information for a given quality
+v1 of reproduction to be the minimum ofRwhen we keep vfixed at v1 and vary Px
+y. That is: Z Z P x y R ;
+1 Min P x ylog dx dy = ;
+Px y P x P y 49
+===============================================================================
+subject to the constraint: Z Z v1 P x y x y dx dy = ;
+;
+: This means that we consider, in effect, all the communication systems that
+might be used and that transmit with the required fidelity. The rate of
+transmission in bits per second is calculated for each oneand we choose that
+having the least rate. This latter rate is the rate we assign the source for
+the fidelity inquestion. The justification of this definition lies in the
+following result: Theorem 21:If a source has a rate R1 for a valuation v1 it is
+possible to encode the output of the source and transmit it over a channel of
+capacity Cwith fidelity as near v1 as desired provided R1 C. This is not
+possible if R1 C. The last statement in the theorem follows immediately from
+the definition of R1 and previous results. If it were not true we could
+transmit more than Cbits per second over a channel of capacity C. The first
+partof the theorem is proved by a method analogous to that used for Theorem 11.
+We may, in the first place,divide the x yspace into a large number of small
+cells and represent the situation as a discrete case. This ;
+will not change the evaluation function by more than an arbitrarily small
+amount (when the cells are verysmall) because of the continuity assumed for x
+y. Suppose that P1 x yis the particular system which ;
+;
+minimizes the rate and gives R1. We choose from the high probability y's a set
+at random containing 2 R T 1+ members where 0 as T . With large Teach chosen
+point will be connected by a high probability ! ! line (as in Fig. 10) to a set
+of x's. A calculation similar to that used in proving Theorem 11 shows that
+withlarge Talmost all x's are covered by the fans from the chosen ypoints for
+almost all choices of the y's. Thecommunication system to be used operates as
+follows: The selected points are assigned binary numbers.When a message xis
+originated it will (with probability approaching 1 as T ) lie within at least
+one ! of the fans. The corresponding binary number is transmitted (or one of
+them chosen arbitrarily if there areseveral) over the channel by suitable
+coding means to give a small probability of error. Since R1 Cthis is possible.
+At the receiving point the corresponding yis reconstructed and used as the
+recovered message. The evaluation v0 for this system can be made arbitrarily
+close to v 1 1 by taking Tsufficiently large. This is due to the fact that for
+each long sample of message x tand recovered message y tthe evaluation
+approaches v1 (with probability 1). It is interesting to note that, in this
+system, the noise in the recovered message is actually produced by a kind of
+general quantizing at the transmitter and not produced by the noise in the
+channel. It is more or lessanalogous to the quantizing noise in PCM. 29. THE
+CALCULATION OF RATES The definition of the rate is similar in many respects to
+the definition of channel capacity. In the former Z Z P x y R Min P x ylog ;
+dx dy = ;
+Px y P x P y Z Z with P xand v1 P x y x y dx dyfixed. In the latter = ;
+;
+Z Z P x y ;
+C Max P x ylog dx dy = ;
+P x P x P y with Px yfixed and possibly one or more other constraints (e.g., an
+average power limitation) of the form R R K P x y x y dx dy. = ;
+;
+A partial solution of the general maximizing problem for determining the rate
+of a source can be given. Using Lagrange's method we consider Z Z P x y ;
+P x ylog P x y x y x P x y dx dy ;
++ ;
+;
++ ;
+: P x P y 50
+===============================================================================
+The variational equation (when we take the first variation on P x y) leads to ;
+P x y ;
+y x B x e, = where is determined to give the required fidelity and B xis chosen
+to satisfy Z B x e x y , ;
+dx 1 = : This shows that, with best encoding, the conditional probability of a
+certain cause for various received y, Py xwill decline exponentially with the
+distance function x ybetween the xand yin question. ;
+In the special case where the distance function x ydepends only on the (vector)
+difference between x ;
+and y, x y x y ;
+= , we have Z B x e x y , , dx 1 = : Hence B xis constant, say , and P x y , y
+x e, = : Unfortunately these formal solutions are difficult to evaluate in
+particular cases and seem to be of little value.In fact, the actual calculation
+of rates has been carried out in only a few very simple cases. If the distance
+function x yis the mean square discrepancy between xand yand the message
+ensemble ;
+is white noise, the rate can be determined. In that case we have R Min H x Hy x
+H x MaxHy x = , = , with N x y2. But the Max Hy xoccurs when y xis a white
+noise, and is equal to W1 log 2 eNwhere = , , W1 is the bandwidth of the
+message ensemble. Therefore R W1 log 2 eQ W1 log 2 eN = , Q W1 log = N where
+Qis the average message power. This proves the following: Theorem 22:The rate
+for a white noise source of power Qand band W1 relative to an R.M.S. measure of
+fidelity is Q R W1 log = N where Nis the allowed mean square error between
+original and recovered messages. More generally with any message source we can
+obtain inequalities bounding the rate relative to a mean square error
+criterion. Theorem 23:The rate for any source of band W1 is bounded by Q1 Q W1
+log R W1 log N N where Qis the average power of the source, Q1 its entropy
+power and Nthe allowed mean square error. The lower bound follows from the fact
+that the Max H 2 y x for a given x y Noccurs in the white , = noise case. The
+upper bound results if we place points (used in the proof of Theorem 21) not in
+the best way p but at random in a sphere of radius Q N. , 51
+===============================================================================
+ACKNOWLEDGMENTS The writer is indebted to his colleagues at the Laboratories,
+particularly to Dr. H. W. Bode, Dr. J. R. Pierce,Dr. B. McMillan, and Dr. B. M.
+Oliver for many helpful suggestions and criticisms during the course of
+thiswork. Credit should also be given to Professor N. Wiener, whose elegant
+solution of the problems of filteringand prediction of stationary ensembles has
+considerably influenced the writer's thinking in this field. APPENDIX 5 Let S1
+be any measurable subset of the gensemble, and S2 the subset of the fensemble
+which gives S1under the operation T. Then S1 T S2 = : Let H be the operator
+which shifts all functions in a set by the time . Then HS1 HT S2 T HS2 = =
+since Tis invariant and therefore commutes with H. Hence if m Sis the
+probability measure of the set S m HS1 m T HS2 m HS2 = = m S2 m S1 = = where
+the second equality is by definition of measure in the gspace, the third since
+the fensemble isstationary, and the last by definition of gmeasure again. To
+prove that the ergodic property is preserved under invariant operations, let S1
+be a subset of the g ensemble which is invariant under H, and let S2 be the set
+of all functions fwhich transform into S1. Then HS1 HT S2 T HS2 S1 = = = so
+that HS2 is included in S2 for all . Now, since m HS2 m S1 = this implies HS2
+S2 = for all with m S2 0 1. This contradiction shows that S1 does not exist. 6=
+;
+APPENDIX 6 The upper bound, N3 N1 N2, is due to the fact that the maximum
+possible entropy for a power N1 N2 + + occurs when we have a white noise of
+this power. In this case the entropy power is N1 N2. + To obtain the lower
+bound, suppose we have two distributions in ndimensions p xiand q xiwith
+entropy powers N1 and N2. What form should pand qhave to minimize the entropy
+power N3 of theirconvolution r xi: Z r xi p yi q xi yi dyi = , : The entropy H3
+of ris given by Z H3 r xilog r xi dxi = , : We wish to minimize this subject to
+the constraints Z H1 p xilog p xi dxi = , Z H2 q xilog q xi dxi = , : 52
+===============================================================================
+We consider then Z U r xlog r x p xlog p x q xlog q x dx = , + + Z U 1 logr x r
+x 1 log p x p x 1 logq x q x dx = , + + + + + : If p xis varied at a particular
+argument xi si, the variation in r xis = r x q xi si = , and Z U q xi silog r
+xi dxi log p si 0 = , , , = and similarly when qis varied. Hence the conditions
+for a minimum are Z q xi silog r xi dxi log p si , = , Z p xi silog r xi dxi
+log q si , = , : If we multiply the first by p siand the second by q siand
+integrate with respect to siwe obtain H3 H1 = , H3 H2 = , or solving for and
+and replacing in the equations Z H1 q xi silog r xi dxi H3 log p si , = , Z H2
+p xi silog r xi dxi H3 log q si , = , : Now suppose p xiand q xiare normal A n2
+= i j j j p x 1 i exp Aijxixj = , 2 n2 2 = B n2 = i j j j q x 1 i exp Bijxixj =
+, : 2 n2 2 = Then r xiwill also be normal with quadratic form Ci j. If the
+inverses of these forms are ai j, bi j, ci jthen ci j ai j bi j = + : We wish
+to show that these functions satisfy the minimizing conditions if and only if
+ai j Kbi jand thus = give the minimum H3 under the constraints. First we have n
+1 log r x 1 i log Ci j Cijxixj = j j , 2 2 2 Z n 1 q x 1 1 i silog r xi dxi log
+Ci j CijsisjCijbij , = j j , , : 2 2 2 2 This should equal H3 n 1 log A 1 i j
+Aijsisj j j , H 2 1 2 2 H1 H1 which requires Ai j Ci j. In this case Ai j Bi
+jand both equations reduce to identities. = = H3 H2 53
+===============================================================================
+APPENDIX 7 The following will indicate a more general and more rigorous
+approach to the central definitions of commu-nication theory. Consider a
+probability measure space whose elements are ordered pairs x y. The variables ;
+x, yare to be identified as the possible transmitted and received signals of
+some long duration T. Let us callthe set of all points whose xbelongs to a
+subset S1 of xpoints the strip over S1, and similarly the set whoseybelong to
+S2 the strip over S2. We divide xand yinto a collection of non-overlapping
+measurable subsetsXiand Yiapproximate to the rate of transmission Rby 1 P Xi Yi
+R ;
+1 P Xi Yilog = ;
+T P X P Y i i i where P Xi is the probability measure of the strip over Xi P Yi
+is the probability measure of the strip over Yi P Xi Yi is the probability
+measure of the intersection of the strips ;
+: A further subdivision can never decrease R1. For let X1 be divided into X1 X0
+X00 and let = 1 + 1 P Y1 a P X1 b c = = + P X0 b P X0 Y1 d 1 = 1;
+= P X00 c P X00 Y1 e 1 = 1 ;
+= P X1 Y1 d e ;
+= + : Then in the sum we have replaced (for the X1, Y1 intersection) d e d e +
+d elog by dlog elog + + : a b c ab ac + It is easily shown that with the
+limitation we have on b, c, d, e, d e d e + ddee + b c bdce + and consequently
+the sum is increased. Thus the various possible subdivisions form a directed
+set, withRmonotonic increasing with refinement of the subdivision. We may
+define Runambiguously as the leastupper bound for R1 and write it 1 ZZ P x y R
+P x ylog ;
+dx dy = ;
+: T P x P y This integral, understood in the above sense, includes both the
+continuous and discrete cases and of coursemany others which cannot be
+represented in either form. It is trivial in this formulation that if xand
+uarein one-to-one correspondence, the rate from uto yis equal to that from xto
+y. If vis any function of y(notnecessarily with an inverse) then the rate from
+xto yis greater than or equal to that from xto vsince, inthe calculation of the
+approximations, the subdivisions of yare essentially a finer subdivision of
+those forv. More generally if yand vare related not functionally but
+statistically, i.e., we have a probability measurespace y v, then R x v R x y.
+This means that any operation applied to the received signal, even though ;
+;
+;
+it involves statistical elements, does not increase R. Another notion which
+should be defined precisely in an abstract formulation of the theory is that of
+"dimension rate," that is the average number of dimensions required per second
+to specify a member ofan ensemble. In the band limited case 2Wnumbers per
+second are sufficient. A general definition can beframed as follows. Let f tbe
+an ensemble of functions and let T f t f t be a metric measuring ;
+54
+===============================================================================
+the "distance" from fto fover the time T(for example the R.M.S. discrepancy
+over this interval.) LetN Tbe the least number of elements fwhich can be chosen
+such that all elements of the ensemble ;
+;
+apart from a set of measure are within the distance of at least one of those
+chosen. Thus we are covering the space to within apart from a set of small
+measure . We define the dimension rate for the ensemble by the triple limit log
+N T Lim Lim Lim ;
+;
+= : 0 0 T Tlog ! ! ! This is a generalization of the measure type definitions
+of dimension in topology, and agrees with the intu-itive dimension rate for
+simple ensembles where the desired result is obvious. 55
+===============================================================================
+************ DDooccuummeenntt OOuuttlliinnee ************
+    * A Mathematical Theory of Communication
+    * Introduction
+    * Part I: Discrete Noiseless Systems
+          o The Discrete Noiseless Channel
+          o The Discrete Source of Information
+          o The Series of Approximations to English
+          o Graphical Representations of a Markoff Process
+          o Ergodic and Mixed Sources
+          o Choice, Uncertainty and Entropy
+          o Representation of the Encoding and Decoding Operation
+          o The Fundamental Theorem of a Noiseless Channel
+          o Discussion and Examples
+    * Part II: The Discrete Channel with Noise
+          o Representation of a Noisy Discrete Channel
+          o The Fundamental Theorem for a Discrete Channel with Noise
+          o Discussion
+          o Example of a Discrete Channel and its Capacity
+          o The Channel Capacity in Certain Special Cases
+          o An Example of Efficient Coding
+          o A1. The Growth of the Number of Blocks of Symbols with a Finite
+            State Condition
+          o A2. The Derivation of Entropy
+          o A3. Theorems on Ergodic Sources
+          o A4. Maximizing the Rate for a System of Constraints
+    * Part III: Mathematical Prelininaries
+          o Sets and Ensembles of Functions
+          o Band Limited Ensembles of Functions
+          o Entropy of a Continuous Distribution
+          o Entropy of an Ensemble of Functions
+          o Entropy Loss in Linear Filters
+          o Entropy of a Sum of Two Ensembles
+    * Part IV: The Continuous Channel
+          o The Capacity of a Continuous Channel
+          o Channel Capacity with an Average Power Limitation
+          o The Channel Capacity with a Peak Power Limitation
+    * Part V: The Rate for a Continuous Source
+          o Fidelity Evaluation Functions
+          o The Rate for a Source Relative to a Fidelity Evaluation
+          o The Calculation of Rates
+          o A5
+          o A6
+          o A7
+===============================================================================