@machinespirits/eval 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +91 -9
- package/config/eval-settings.yaml +3 -3
- package/config/paper-manifest.json +486 -0
- package/config/providers.yaml +9 -6
- package/config/tutor-agents.yaml +2261 -0
- package/content/README.md +23 -0
- package/content/courses/479/course.md +53 -0
- package/content/courses/479/lecture-1.md +361 -0
- package/content/courses/479/lecture-2.md +360 -0
- package/content/courses/479/lecture-3.md +655 -0
- package/content/courses/479/lecture-4.md +530 -0
- package/content/courses/479/lecture-5.md +326 -0
- package/content/courses/479/lecture-6.md +346 -0
- package/content/courses/479/lecture-7.md +326 -0
- package/content/courses/479/lecture-8.md +273 -0
- package/content/courses/479/roadmap-slides.md +656 -0
- package/content/manifest.yaml +8 -0
- package/docs/research/build.sh +44 -20
- package/docs/research/figures/figure10.png +0 -0
- package/docs/research/figures/figure11.png +0 -0
- package/docs/research/figures/figure3.png +0 -0
- package/docs/research/figures/figure4.png +0 -0
- package/docs/research/figures/figure5.png +0 -0
- package/docs/research/figures/figure6.png +0 -0
- package/docs/research/figures/figure7.png +0 -0
- package/docs/research/figures/figure8.png +0 -0
- package/docs/research/figures/figure9.png +0 -0
- package/docs/research/header.tex +23 -2
- package/docs/research/paper-full.md +941 -285
- package/docs/research/paper-short.md +216 -585
- package/docs/research/references.bib +132 -0
- package/docs/research/slides-header.tex +188 -0
- package/docs/research/slides-pptx.md +363 -0
- package/docs/research/slides.md +531 -0
- package/docs/research/style-reference-pptx.py +199 -0
- package/package.json +6 -5
- package/scripts/analyze-eval-results.js +69 -17
- package/scripts/analyze-mechanism-traces.js +763 -0
- package/scripts/analyze-modulation-learning.js +498 -0
- package/scripts/analyze-prosthesis.js +144 -0
- package/scripts/analyze-run.js +264 -79
- package/scripts/assess-transcripts.js +853 -0
- package/scripts/browse-transcripts.js +854 -0
- package/scripts/check-parse-failures.js +73 -0
- package/scripts/code-dialectical-modulation.js +1320 -0
- package/scripts/download-data.sh +55 -0
- package/scripts/eval-cli.js +106 -18
- package/scripts/generate-paper-figures.js +663 -0
- package/scripts/generate-paper-figures.py +577 -76
- package/scripts/generate-paper-tables.js +299 -0
- package/scripts/qualitative-analysis-ai.js +3 -3
- package/scripts/render-sequence-diagram.js +694 -0
- package/scripts/test-latency.js +210 -0
- package/scripts/test-rate-limit.js +95 -0
- package/scripts/test-token-budget.js +332 -0
- package/scripts/validate-paper-manifest.js +670 -0
- package/services/__tests__/evalConfigLoader.test.js +2 -2
- package/services/__tests__/learnerRubricEvaluator.test.js +361 -0
- package/services/__tests__/learnerTutorInteractionEngine.test.js +326 -0
- package/services/evaluationRunner.js +975 -98
- package/services/evaluationStore.js +12 -4
- package/services/learnerTutorInteractionEngine.js +27 -2
- package/services/mockProvider.js +133 -0
- package/services/promptRewriter.js +1471 -5
- package/services/rubricEvaluator.js +55 -2
- package/services/transcriptFormatter.js +675 -0
- package/docs/EVALUATION-VARIABLES.md +0 -589
- package/docs/REPLICATION-PLAN.md +0 -577
- package/scripts/analyze-run.mjs +0 -282
- package/scripts/compare-runs.js +0 -44
- package/scripts/compare-suggestions.js +0 -80
- package/scripts/dig-into-run.js +0 -158
- package/scripts/show-failed-suggestions.js +0 -64
- /package/scripts/{check-run.mjs → check-run.js} +0 -0
|
@@ -0,0 +1,530 @@
|
|
|
1
|
+
## Attention
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. & Polosukhin, I. (2017). [Attention is all you need](https://i-share-uiu.primo.exlibrisgroup.com/discovery/fulldisplay?docid=cdi_proquest_journals_2076493815&context=PC&vid=01CARLI_UIU:CARLI_UIU&search_scope=CentralIndex&tab=CentralIndex&lang=en). *Advances in Neural Information Processing Systems*, 30.
|
|
5
|
+
- Petersen, S. E., & Posner, M. I. (2012). [The attention system of the human brain: 20 years after]([url](https://i-share-uiu.primo.exlibrisgroup.com/discovery/fulldisplay?docid=cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3413263&context=PC&vid=01CARLI_UIU:CARLI_UIU&search_scope=CentralIndex&tab=CentralIndex&lang=en)). *Annual Review of Neuroscience*, 35(1), 73-89.
|
|
6
|
+
- Terranova, T. (2012). [Attention, economy and the brain]([url](https://culturemachine.net/wp-content/uploads/2019/01/465-973-1-PB.pdf)). *Culture Machine*, 13.
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
```notes
|
|
11
|
+
As we've noted in the week's online guide, this week we are moving both back and forward - back to the earlier moments of consciousness, perception in particular, but also forward to the much more recent developments in both neuro and computer science.
|
|
12
|
+
|
|
13
|
+
What I propose this week is that we examine three key papers that all treat the concept of attention in a specific way. I won't be doing too much here to relate this concept to Hegel's unfolding architecture in *Phenomenology of Spirit* - we'll instead turn to that in the weeks ahead. But you may want to think how different meanings of attention might be situated with respect to both concepts of experience and recognition we've covered to date.
|
|
14
|
+
|
|
15
|
+
I'll start by looking at the Petersen & Posner [-@stevene.petersen2012theattention] paper, *The Attention System of the Human Brain: 20 Years After*, then the Vaswani et al's [-@ashishvaswani2017attentionis] *Attention is All You Need* paper, and then finally Terranova's [-@tizianaterranova2012attentioneconomy] critique of the Attention Economy. In each case we'll provide a short summary, and connect the argument to the wider lecture and course content - then provide time for discussion.
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
### Human Attention
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
| Attention Subsystem | Part of Brain | Reading Analogy |
|
|
30
|
+
| --- | --- | --- |
|
|
31
|
+
| Alerting | Thalamus; Frontal area | Be *ready* to read |
|
|
32
|
+
| Orienting | Temporoparietal Junction | *Direct* attention to text |
|
|
33
|
+
| Executive: Sustain focus | Cingulo-opercular | *Continue* to concentrate |
|
|
34
|
+
| Executive: Switch focus | Frontoparietal (eye field / superior parietal lob | Realize that you are tired and *need a break* |
|
|
35
|
+
|
|
36
|
+
|
|
37
|
+

|
|
38
|
+
|
|
39
|
+
```notes
|
|
40
|
+
|
|
41
|
+
The first paper is an updated version of an earlier text from 1990, also by Posner and Petersen, at the beginning of the era of "neuroimaging": using tomography, fMRI and EEG machines to monitor brain activity. Posner and Petersen suggest human attention involves three distinct but related subsystems and associated cognitive processes: alterting, orienting and executive (or executive control). We might think of these as involving become aware of something; turning our attention towards that thing; and then making some decision about that thing: is it dangerous, attractive, and so on.
|
|
42
|
+
|
|
43
|
+
I'll briefly talk through these three functions.
|
|
44
|
+
|
|
45
|
+
Alerting involves the initial registration of an external stimulus. They further distinguish two modes of alerting: phasic, or short-term reactions, and tonic, or sustained vigilance. Both their initial and this updated paper located alerting function to the right hemisphere of the brain.
|
|
46
|
+
|
|
47
|
+
Orienting involves some kind of fast directing of attention in response to the stimulus or goal. This might be the instant turning of the head toward a large sound; or the fixation of the head and eyes on the road ahead while driving in difficult conditions. This happens in the frontal and posterior parts of the brain.
|
|
48
|
+
|
|
49
|
+
Finally, the executive function involves making decisions: to stay fixed upon an object that has gained attention, or to move on. In their revised article Petersen and Posner identify two distinct executive processes: one involving *sustained* focus or attention, another enabling a *switching* of tasks within the same overall attention frame or goal. The prefrontal cortex is responsible for the first of these - task focus – while the frontal eye field and superior parietal lobes together are responsible task switching.
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
### Hierarchy and Networks
|
|
58
|
+
|
|
59
|
+
- (a) Connected, Ordered Hierarchy: different parts of the brain are responsible for **different** aspects of attention: alerting, orienting, executive functions (sustaining and switching attention)
|
|
60
|
+
- (b) Semi-redundant Networks: different parts of the brain coordinate on the **same** aspects of attention
|
|
61
|
+
|
|
62
|
+
|
|
63
|
+

|
|
64
|
+
|
|
65
|
+
```notes
|
|
66
|
+
|
|
67
|
+
But what perhaps matters to us here is not so much which areas of the brain are involved in different aspects of attention. Instead we can note that attention involves (a) a connected *hierarchy* of functions or subsystems and (b) a *network* of parts performing the same or similar roles (with some redundancy).
|
|
68
|
+
|
|
69
|
+
As we will see, it is not so much that use of attention in computer networks follows precisely what happens in our brains. The word "attention" is in some sense metaphorical, just as is the term "neural networks". However at the level of architecture we will see some analogies.
|
|
70
|
+
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
### Attention and Learning
|
|
76
|
+
|
|
77
|
+
|
|
78
|
+

|
|
79
|
+
|
|
80
|
+
```notes
|
|
81
|
+
|
|
82
|
+
Within the context of learning, we can also see how this new attention-oriented work helps to make sense of, in particular, learning *difficulties*. Impairments on different brain regions can result in complex but still localizable – and potentially addressable – limitations in how attention is alerted, directed, sustained and purposefully redirected, as conditions require.
|
|
83
|
+
|
|
84
|
+
Now while this is not a course diving into neuroscientific research, we can note in passing that this kind of neurological or neuroscientific research has impacted upon the theory and practice of pedagogy. A quick Google Scholar search shows for example how many results have integrated "executive function" into pedagogy research: 17,200 results since 2021.
|
|
85
|
+
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
### Attention and Childhood Development
|
|
91
|
+
|
|
92
|
+
- infants first develop alerting functions (with some orienting and executive ability)
|
|
93
|
+
- children develop mature orienting and basic executive (concentrating) functions
|
|
94
|
+
- adolescents and young adults develop higher order executive (both attention-sustaining and -switching) functions
|
|
95
|
+
|
|
96
|
+
|
|
97
|
+
**References:**
|
|
98
|
+
|
|
99
|
+
<span style="font-size:0.8em;">
|
|
100
|
+
|
|
101
|
+
- Rueda, M. R., & Posner, M. I. (2013). Development of attention networks.
|
|
102
|
+
- Posner, M. I., & Rothbart, M. K. (2007). Research on attention networks as a model for the integration of psychological science. Annual Review of Psychology, 58, 1–23.
|
|
103
|
+
- Boen, R., Ferschmann, L., Vijayakumar, N., Overbye, K., Fjell, A. M., Espeseth, T., & Tamnes, C. K. (2021). Development of attention networks from childhood to young adulthood: A study of performance, intraindividual variability and cortical thickness. Cortex, 138, 138-151.
|
|
104
|
+
|
|
105
|
+
</span>
|
|
106
|
+
|
|
107
|
+
<!--- Rueda, M. R., Posner, M. I., & Rothbart, M. K. (2004). Attentional control and self-regulation in early development. Trends in Cognitive Sciences, 8, 140–147.-->
|
|
108
|
+
|
|
109
|
+
```notes
|
|
110
|
+
|
|
111
|
+
Several of these studies show that the hierarchy of attention mechanisms - alerting to orienting to executing – also correspond to stages of learning and childhood development. Work by Posner and colleagues in particular has sought to demonstrate that:
|
|
112
|
+
|
|
113
|
+
- **infants** first develop alerting functions (with some orienting and executive ability)
|
|
114
|
+
- **children** develop mature orienting and basic executive (concentrating) functions
|
|
115
|
+
- **adolescents and young adults** develop higher order executive (both attention-sustaining and -switching) functions
|
|
116
|
+
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
### Neuroscience and Hegel?
|
|
122
|
+
|
|
123
|
+
> Without hesitation, the raw instinct of self-conscious reason will reject such a science of phrenology – as well as reject this other observing instinct of self-conscious reason, which, once it has blossomed into a foreshadowing *of cognition*, has spiritlessly grasped cognition as, “The outer is supposed to be an expression of the inner.” However, the worse the thought is, the less easy it sometimes is to say exactly where its badness lies, and it becomes even more difficult to explicate it. (para 340)
|
|
124
|
+
|
|
125
|
+
```notes
|
|
126
|
+
And to keep concordance with Hegel, we might also note his own strong distrust of the "neuroscience" of his day – a now outdated field called "phrenology", which involved measuring skulls to determinine cognitive aptitude. Much later in the *Phenomenology*, he savagely criticises the pseudoscience of phrenology for attempting to account for traits like intelligence based on bumps on the skull. Of course for Hegel, as we have seen, the development of Consciousness and Self-consciousness - and eventually Reason, Spirit and Absolute Knowledge – depends upon an infinitely supple and complex negotiation, both within ourselves and with others. This complex process of development cannot be "read" off the shape or dimensions of the skull. In a phrase that pre-empts where we go next week, Hegel states:
|
|
127
|
+
|
|
128
|
+
> Without hesitation, the raw instinct of self-conscious reason will reject such a science of phrenology – as well as reject this other observing instinct of self-conscious reason, which, once it has blossomed into a foreshadowing *of cognition*, has spiritlessly grasped cognition as, “The outer is supposed to be an expression of the inner.” However, the worse the thought is, the less easy it sometimes is to say exactly where its badness lies, and it becomes even more difficult to explicate it. (para 340)
|
|
129
|
+
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
|
|
136
|
+
### Neuroscience: A Modern Phrenology?
|
|
137
|
+
|
|
138
|
+
| | |
|
|
139
|
+
|---|---|
|
|
140
|
+
|  |  |
|
|
141
|
+
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
### The Gap Between Brain and Mind
|
|
146
|
+
|
|
147
|
+
> The idea of mapping psychological functions to brain structures has a venerable history, dating back to Galen’s ventricular doctrine (Green [2003]) and continuing to Gall’s phrenology (Gall and Spurzheim [1810]). Although those theories are now in disrepute, the advent of neuroimaging techniques, such as positron emission tomography (PET), functional magnetic resonance imaging (fMRI), electro-encephalography (EEG), and magnetoencephalography (MEG), gives the prospect of finding one-to-one correlations between psychological functions and brain structures new vigour, and the project is the main goal of the young field of cognitive neuroscience (Posner and DiGirolamo [2000]).1 Yet many doubt that cognitive neuroscience can give us such a psychological atlas, whereby the building blocks of mind get assigned to specific neural structures (Uttal [2001], [2011]).
|
|
148
|
+
|
|
149
|
+
#### References:
|
|
150
|
+
|
|
151
|
+
<span style="font-size:0.8em;">
|
|
152
|
+
|
|
153
|
+
- Dobbs, D. (2005). Fact or phrenology?. Scientific American Mind, 16(1), 24-31.
|
|
154
|
+
|
|
155
|
+
- Stea, J. N., Black, T. R., & Di Domenico, S. I. (2022). Phrenology and neuroscience. In *Investigating Pop Psychology* (pp. 9-19). Routledge.
|
|
156
|
+
|
|
157
|
+
</span>
|
|
158
|
+
|
|
159
|
+
|
|
160
|
+
|
|
161
|
+
```notes
|
|
162
|
+
|
|
163
|
+
|
|
164
|
+
While neuroscience involves far more rigorous and detailed methods of investigation into the operations of the brain than phrenology, we can note in passing that it has attracted criticisms quite similar to those levelled by Hegel towards the "neuroscience" of his day. See for example the following quote from a recent book chapter by Stea, Black and Domenico, titled appropriately for our purposes "Phrenology and Neuroscience.
|
|
165
|
+
|
|
166
|
+
Despite the advances in science, for many today there remains a distinct gap between brain and mind, or the biological processing of signals and the rich descriptions of consciousness we get from philosophy, literature, art and religion.
|
|
167
|
+
|
|
168
|
+
How we understand this gap also affects our interpretation of the potential for machines to develop consciousness - our topic for next week. But for now, we need to look at how the concept of attention also applies to machine learning.
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
### Machine Attention
|
|
174
|
+
|
|
175
|
+
|
|
176
|
+
- Pre-2017: Recurrent, convolutional networks (RNNs, CNNs)
|
|
177
|
+
- Problems: Recurrent networks degrade as the length of text grows
|
|
178
|
+
- Attention already used in combination with architectures
|
|
179
|
+
- But is Attention All You Need? The Vaswani et al. (2017) Transformers paper
|
|
180
|
+
- Implementation by OpenAI: *Improving Language Understanding by Generative Pre-Training* (Radford et al. 2018): **GPT-1**.
|
|
181
|
+
|
|
182
|
+
```notes
|
|
183
|
+
Vaswani et al.'s 2017 paper is a landmark in machine learning. Perhaps the most cited paper this century, this work by Google scholars was first actually implemented, not by Google, but by a young start-up company, OpenAI. There is a entire story of intrigue about how OpenAI was founded – with seed funding from Elon Musk – and eventually caught sight of this 2017 paper, understood its potential, and developed something called a "Generative Pre-Training" model
|
|
184
|
+
(*Improving Language Understanding by Generative Pre-Training*)[-@alecradford2018improvinglanguage].
|
|
185
|
+
|
|
186
|
+
Now we don't have time or opportunity to fully talk through this paper and its technical details. We would need to venture too far into the history of neural networks and their application to language modelling. But we can say that *prior* to this paper, the state-of-the-art models were using recurrent or convlutional networks, sometimes with attention mechanisms built in.
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
---
|
|
190
|
+
|
|
191
|
+
### From linear to grid representations
|
|
192
|
+
|
|
193
|
+
|
|
194
|
+
|
|
195
|
+
|
|
196
|
+
<!-- | RNN | Transformer |
|
|
197
|
+
| --- | --- |
|
|
198
|
+
|  |  | -->
|
|
199
|
+
|
|
200
|
+
|
|
201
|
+

|
|
202
|
+
|
|
203
|
+
|
|
204
|
+
|
|
205
|
+
```notes
|
|
206
|
+
|
|
207
|
+
In short the problem with these systems was the need to maintain an ever-growing set of connections between the next token we would like to predict and the tokens or words that preceded it.
|
|
208
|
+
|
|
209
|
+
I've used two heatmaps generated by GPT-5 to convey the general idea: RNNs process tokens in a linear way, from left to right, just as we read. But we quickly develop a long set of connections between tokens.
|
|
210
|
+
|
|
211
|
+
Instead transformers require every token in a sentence or sequence to be related to every other - no matter how far apart they are in the sequence. This reduces the *time* involved in processing data, at the cost of increased training time and model space - but these are (comparatively) cheap.
|
|
212
|
+
|
|
213
|
+
Shortly we'll do a thought experiment that will hopefully make this more clear.
|
|
214
|
+
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
### Some general terminology...
|
|
221
|
+
|
|
222
|
+
#### Tokens vs Words
|
|
223
|
+
|
|
224
|
+
**Tokens** are word-like pieces of data that are the foundational primitatives of language models. Why not words? There are many words in natural language, but often they involve commonly recurring terms (e.g. morphemes, prefixes, suffixes: 'un-', '-ing'). Roughly **4** tokens per **3** words.
|
|
225
|
+
|
|
226
|
+
#### Training vs Inference:
|
|
227
|
+
|
|
228
|
+
**Training** involves building a large language model by processing large amounts of textual (or other) data, to develop up relationships between tokens.
|
|
229
|
+
|
|
230
|
+
**Inference** applies the trained model to a particular sequence, e.g. when you ask a question of ChatGPT.
|
|
231
|
+
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
### Terminology continued...
|
|
236
|
+
|
|
237
|
+
#### Neural Network, Layers, Weights and Biases
|
|
238
|
+
|
|
239
|
+
Describes the structure of a language model: has **interconnected** nodes organized in **layers**. Each layer contains (typically) a **weight** matrix and **bias** vector.
|
|
240
|
+
|
|
241
|
+
|
|
242
|
+
Much of the difference between models involves details about the architecture of a network and its layers; how much the weights and biases are trained; and what data is used in training.
|
|
243
|
+
|
|
244
|
+
#### Feed forward / Backpropagation
|
|
245
|
+
|
|
246
|
+
1. First we estimate **weights** and **biases**, and generate a **loss**, measuring their accuracy.
|
|
247
|
+
2. Second we use **calculus** to update all the weights
|
|
248
|
+
3. Rinse and repeat, until loss is minimized.
|
|
249
|
+
|
|
250
|
+
|
|
251
|
+
---
|
|
252
|
+
|
|
253
|
+
### Is AI just fancy statistics?
|
|
254
|
+
|
|
255
|
+
- Core Intuition: A **language model** approximates a **function** (with many parameters).
|
|
256
|
+
|
|
257
|
+

|
|
258
|
+
|
|
259
|
+
Typical multiple regression - a simplified 1-layer neural network:
|
|
260
|
+
|
|
261
|
+
$$
|
|
262
|
+
\begin{align}
|
|
263
|
+
y_i &= \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \varepsilon_i,
|
|
264
|
+
\quad i = 1, \dots, n
|
|
265
|
+
\end{align}
|
|
266
|
+
$$
|
|
267
|
+
|
|
268
|
+
So **training** is the attempt to produce an (ever more) accurate
|
|
269
|
+
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
### Attention is Three Matrices: Queries, Keys, Values
|
|
273
|
+
|
|
274
|
+
- Queries: what each token is **asking for** from other tokens (in training or inference)
|
|
275
|
+
- Keys: the **relative match** of each token to this token
|
|
276
|
+
- Value: what the token **represents** (e.g. in syntax or semantic terms)
|
|
277
|
+
|
|
278
|
+
This process is what is meant by **attention**.
|
|
279
|
+
|
|
280
|
+
|
|
281
|
+
---
|
|
282
|
+
|
|
283
|
+
### Thought Experiment
|
|
284
|
+
|
|
285
|
+
|
|
286
|
+
> The cat sat on <span style="color: red;">the</span>...
|
|
287
|
+
|
|
288
|
+
```notes
|
|
289
|
+
Let's work through the following thought experiment. We will imagine we have the following unfinished sentence, and we'll focus on the last word, `the`:
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
|
|
293
|
+
|
|
294
|
+
---
|
|
295
|
+
|
|
296
|
+
### Sympathy for the Machine...
|
|
297
|
+
|
|
298
|
+
|
|
299
|
+
| Query | Key | Weight |
|
|
300
|
+
|-------|-----|--------|
|
|
301
|
+
| the | The | 0.05** |
|
|
302
|
+
| the | cat | 0.15 |
|
|
303
|
+
| the | sat | 0.25 |
|
|
304
|
+
| the | on | 0.55 |
|
|
305
|
+
|
|
306
|
+
|
|
307
|
+
** Because `the` rarely follows `the`!
|
|
308
|
+
|
|
309
|
+
What does this set of probabilities refer to? The relevance or how much the word `the` *attends* to the other tokens in the sentence.
|
|
310
|
+
|
|
311
|
+
|
|
312
|
+
```notes
|
|
313
|
+
Now let's think about this from the machine's point of view, during inference rather than training. We saw from preceding discussion that this word `the` maintains a place in three lists of words or tokens:
|
|
314
|
+
|
|
315
|
+
- Query
|
|
316
|
+
- Key
|
|
317
|
+
- Value
|
|
318
|
+
|
|
319
|
+
We are considering the second `the` as our *query* word, and we want to know what word should follow. We first compare it with every other word in the sentence, checking against their *key* values. This generates an initial set of probabilities, for which I'll just use some example values:
|
|
320
|
+
|
|
321
|
+
|
|
322
|
+
|
|
323
|
+
| Query | Key | Weight |
|
|
324
|
+
|-------|-----|--------|
|
|
325
|
+
| the | The | 0.05** |
|
|
326
|
+
| the | cat | 0.15 |
|
|
327
|
+
| the | sat | 0.25 |
|
|
328
|
+
| the | on | 0.55 |
|
|
329
|
+
|
|
330
|
+
|
|
331
|
+
** Because `the` rarely follows `the`!
|
|
332
|
+
|
|
333
|
+
What does this set of probabilities refer to? The relevance or how much the word `the` *attends* to the other tokens in the sentence. We are primed, in other words, more strongly in favour of 'on the' than anything else.
|
|
334
|
+
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
|
|
338
|
+
---
|
|
339
|
+
|
|
340
|
+
### From Attention to Context
|
|
341
|
+
|
|
342
|
+

|
|
343
|
+
|
|
344
|
+
```notes
|
|
345
|
+
|
|
346
|
+
Now each of these tokens – *the*, *cat* etc – also contains a set of numbers relating to their *values*. The values are – if you like – the semantic space of the word: the **cattiness** of the 'cat' (noun, animal, furry, etc); the **sittingness** of the 'sat' (verb, temporal, positional, etc); the **on-ness** of the 'on') (preposition, relational term); the **the-ness** of the 'the' (definite article, applies to nouns, connected to preposition). But also tied to the context of the current sentence.
|
|
347
|
+
|
|
348
|
+
So once we have a sense of relative attention – how the 'the' relates to other words in the sentence – we combine the values, which we can think of as a hybrid syntactico-semantic representation, with these attention weights.
|
|
349
|
+
|
|
350
|
+
This produces a **context** that governs prediction.
|
|
351
|
+
|
|
352
|
+
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
|
|
356
|
+
---
|
|
357
|
+
|
|
358
|
+
### Iterated Context: From tokens to quasi-phrases / sententces
|
|
359
|
+
|
|
360
|
+
|
|
361
|
+
- 'the' is no longer just a word or token
|
|
362
|
+
- its 'the'-ness becomes a kind of 0.55 * 'on' + 0.25 + 'sat etc.
|
|
363
|
+
- All of these influences, derived from repeated attention, make 'mat' a more likely continuation.
|
|
364
|
+
|
|
365
|
+
```notes
|
|
366
|
+
|
|
367
|
+
This process is repeated over several or many layers of a network. At each layer we develop a richer representation of this context for each token. The mathematical representation of the 'the' we are looking at accumulates the influence of the other tokens it attends to - and so do these other tokens themselves. These ultimately help to narrow the scope – or increase the bias - toward particular tokens such as 'mat'.
|
|
368
|
+
|
|
369
|
+
The 'the'-ness becomes a highly specific and contextualized 'the'-ness that is paired strongly with a prepositional phrase; is associated with a spatio-temporal situation - one of sitting; and is (less strongly) influenced by an agent. None of these roles are hard-coded; they are learned by the network. But at the same time they ressemble the rules of grammar and meaning we are used to.
|
|
370
|
+
|
|
371
|
+
All of these influences, derived from repeated attention, make 'mat' a likely continuation (from within the wider set of the model's vocabulary).
|
|
372
|
+
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
---
|
|
376
|
+
|
|
377
|
+
### From Context to Prediction
|
|
378
|
+
|
|
379
|
+
|
|
380
|
+
| Token | Probability |
|
|
381
|
+
|-------|-------------|
|
|
382
|
+
| the | 0.01 |
|
|
383
|
+
| cat | 0.01 |
|
|
384
|
+
| sat | 0.01 |
|
|
385
|
+
| on | 0.01 |
|
|
386
|
+
| mat | **0.80** |
|
|
387
|
+
| dog | 0.08 |
|
|
388
|
+
| house | 0.08 |
|
|
389
|
+
|
|
390
|
+
|
|
391
|
+
```notes
|
|
392
|
+
This learned context is applied to every item in our initial vocabulary. That would include all the words we have used, plus (in a toy example), other nouns like 'mat', 'dog', 'house'. This produces a final set of probabilities:
|
|
393
|
+
|
|
394
|
+
|
|
395
|
+
And finally: we roll a virtual die, and produce a set of predictions. In this toy example, 80 per cent of the time the predicted completion of the sentence will be 'mat'.
|
|
396
|
+
|
|
397
|
+
Note that this 80% of the time - not 100% - is what makes LLMs often 'probabilistic', 'stochastic' and 'non-deterministic'.
|
|
398
|
+
|
|
399
|
+
As a further note: You might also imagine all of this computation gets expensive for (a) large vocabularies (like multiple human languages) and (b) long contexts (like novels). That is true! And why companies like Nvidia and TSMC have such extreme valuations today - to train and do inference on attention-based mechanisms involves hardware investments in the order of tens or hundreds of billions of dollars today.
|
|
400
|
+
|
|
401
|
+
```
|
|
402
|
+
|
|
403
|
+
---
|
|
404
|
+
|
|
405
|
+
### What about human attention?
|
|
406
|
+
|
|
407
|
+
Let's continue now with a further rough experiment.
|
|
408
|
+
|
|
409
|
+
Start by *attending to* the following words I say:
|
|
410
|
+
|
|
411
|
+
```
|
|
412
|
+
The cat sat on the...
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
---
|
|
416
|
+
|
|
417
|
+
|
|
418
|
+
Now when I said:
|
|
419
|
+
|
|
420
|
+
> Start by *attending to* the following words I say:
|
|
421
|
+
|
|
422
|
+
You are *alerted*. You have to shift from an everyday state to an alerted one, maybe by the fact that I've issued an imperative: 'Start'.
|
|
423
|
+
|
|
424
|
+
What follows is your orientation to what actually does *follow* from the words 'the following words I say'. You are oriented towards the completion:
|
|
425
|
+
|
|
426
|
+
```
|
|
427
|
+
The cat sat on the...
|
|
428
|
+
```
|
|
429
|
+
|
|
430
|
+
|
|
431
|
+
---
|
|
432
|
+
|
|
433
|
+
### Executive Functions and Metacognition
|
|
434
|
+
|
|
435
|
+
Do you:
|
|
436
|
+
|
|
437
|
+
1. Complete the sentence?
|
|
438
|
+
2. Keep listening to what I say?
|
|
439
|
+
|
|
440
|
+
```notes
|
|
441
|
+
But then you are also possibly primed to the visual and audible *incompleteness*. The sentence doesn't end, instead your lecturer continues on with his exposition. The executive functions need to *decide*. What do you do? Do you:
|
|
442
|
+
|
|
443
|
+
1. Complete the sentence?
|
|
444
|
+
2. Keep listening to what I say?
|
|
445
|
+
|
|
446
|
+
Or both? Because this is a trivial case, you can complete the sentence very fast, and I'm not speaking too fast. Or do you *resist* the completion – ignoring my command altogether, or completing the sentence with another word?
|
|
447
|
+
```
|
|
448
|
+
|
|
449
|
+
---
|
|
450
|
+
|
|
451
|
+
### Questions on human attention
|
|
452
|
+
|
|
453
|
+
Think for a moment about this final activity. Is **your** completion different to the **LLM**? How much of this - pointing ahead to next week's topic – is **conscious** or **unconscious**? Do you – like the Transformer model – draft a list of candidates, and pick the most **likely**? Is there a kind of metacognitive aspect that enables you to determine to **sustain** or **switch** your attention? Can you **refuse** to complete what you ought to - what your training suggests?
|
|
454
|
+
|
|
455
|
+
|
|
456
|
+
|
|
457
|
+
---
|
|
458
|
+
|
|
459
|
+
### Make Content, Get Attention... Profit?
|
|
460
|
+
|
|
461
|
+

|
|
462
|
+
|
|
463
|
+
```notes
|
|
464
|
+
Turning now to Terranova's article, we come to the idea that attention is a kind of *commodity* and even *capital*, marked - like all commodities – by scarcity. It is an object that in itself warrants the *attention* of capital, of investors and advertisers, in the context of digital media. This is of course not new - the nephew of Sigmund Freud, Edward Bernays, pioneered many uses of what was then, in the early/mid twentieth century, new media, such as radio, magazines, film and television. But with the maturation of computers, the Internet, smartphones, social media and, today, AI, we come to a point at which we see attention as corroded or "degraded" by information. There is so much information, in other worrds, that human attentive processes become saturated, barely able to keep up.
|
|
465
|
+
|
|
466
|
+
Terranova argues, citing Nicholas Carr, Catherine Malabou, Jonathan Crary and othres, that precisely the kind of neuroscientific research we discussed earlier makes possible a new corresponding *industrialization* of attention. By developing sophisticated techniques for securing attention (at alerting and orienting levels), it also seems as though the higher order "executive functions" are disrupted. In particular, the ability to "switch" is impaired - we find ourselves staring at the screen long past the point at which we intended to, when we initially and intentionally sought distraction.
|
|
467
|
+
```
|
|
468
|
+
|
|
469
|
+
---
|
|
470
|
+
|
|
471
|
+
### Attention and Imitation
|
|
472
|
+
|
|
473
|
+

|
|
474
|
+
|
|
475
|
+
```notes
|
|
476
|
+
In a turn that also reminds us of our discussion of Hegel and the *social* process of learning, Terranova then discusses how attention to digital media in turn leads to another kind of by-passing of the deeper attention marked by executive function, due to social imitation.
|
|
477
|
+
|
|
478
|
+
But this need not be entirely negative. Here Terranova turns to another Italian theorist, Lazzarato, and his treatment of attention as the condition of social labour – and therefore a positive and productive force.
|
|
479
|
+
|
|
480
|
+
But Terranova's discussion takes a negative turn again, through the work of Bernard Stiegler. Stiegler – a French philosopher writing on technology since the 1990s – famously argued that contemporary technologies short-circuit important cognitive processes of memory and social processes of communication, resulting in, as Stiegler put it, a grave risk of "proletariatanization". Primal psychic and libidinal energy gets put to service, in this analysis, in the creation of value for companies that can direct our collective attention via "social technologies" and "new forms of social relations".
|
|
481
|
+
```
|
|
482
|
+
|
|
483
|
+
---
|
|
484
|
+
|
|
485
|
+
### Cooperation or Proletariatanization?
|
|
486
|
+
|
|
487
|
+

|
|
488
|
+
|
|
489
|
+
|
|
490
|
+
```notes
|
|
491
|
+
Collecting up both Lazzaratto and Stiegler's arguments, Terranova claims that – despite the very different valences or attitudes each brings to their analysis – both authors see attention as not simply a store of human attention that is only degraded by technologies. Rather, those technologies redirect attention, which in turn makes possible new kinds of subjects and social relations. For Lazzaratto, technology actually makes humans cooperate in ways that can resemble the internal structure of an individual brain. For Stiegler, technology is similarly integral to all human cognitive and social activity – but in its current form (the Internet, social media, and the general capitalization of attention and associated "libidinal" energies), it is tending toward the production of a simplified, proletarianized and even stupified society.
|
|
492
|
+
```
|
|
493
|
+
|
|
494
|
+
|
|
495
|
+
---
|
|
496
|
+
|
|
497
|
+
|
|
498
|
+
### Is Attention a Design Problem?
|
|
499
|
+
|
|
500
|
+

|
|
501
|
+
|
|
502
|
+
```notes
|
|
503
|
+
The reason for including Terranova's analysis – aside from its wide-ranging survey of recent debates – is that in a certain sense it elaborates upon Hegel's insistence that self-consciousness and learning is essentially *social* in nature. Indeed both Lazzaratto and Stiegler's positions, which Terranova surveys, can be seen as extensions to Hegel's insight, though adjusted for the dramatic effects wrought by informatic technologies.
|
|
504
|
+
|
|
505
|
+
The individual human subject is affected by what others say and do, and digital technologies act like a concentrating device of those social habits. Let's exaggerate: every tweet, post or Tiktok we read or watch acts like a small encounter between two self-consciousnesses, which must resolve itself into a micro-master / servant dialectic enounter. Do we like the content, do we stay engaged to it - are we in other words, a servant to it? Or do we criticize, disengage and ultimately walk away? Is our self-regulation of our own attention a method also of self-mastery that resists servitude to others? Or are these attention-grabbing technologies too powerful for self-regulation, and do we need to treat attention management as a collective design problem?
|
|
506
|
+
|
|
507
|
+
And where does this then bring us with respect to a technology that arguably exceeds what Terranova, Lazzaratto and Stiegler could ever have anticipated in terms of its potential capture of human attention - precisely via application of its own "attention" mechanisms?
|
|
508
|
+
Hansen argues that as we enter the era of machine learning, platforms will increasingly predict, and thereby control, even more fundamental processes than our attention: our conscious thinking itself.
|
|
509
|
+
```
|
|
510
|
+
|
|
511
|
+
|
|
512
|
+
|
|
513
|
+
---
|
|
514
|
+
|
|
515
|
+
|
|
516
|
+
### Synthesizing Human and Machine Attention?
|
|
517
|
+
|
|
518
|
+
|
|
519
|
+

|
|
520
|
+
|
|
521
|
+
Posner on the role of att
|
|
522
|
+
- https://www.youtube.com/watch?v=PKzz1OAiTRQ
|
|
523
|
+
- https://www.youtube.com/watch?v=uYUdwS7-WvA
|
|
524
|
+
|
|
525
|
+
```notes
|
|
526
|
+
According to many neuroscience, attention is seen as critical to the operations of consciousness. Surprisingly, in recent discussions, neuroscientists like Posner have also emphasized the experimental and social nature of attention formation, even in infants as they shape their alerting, orienting and executive facilities. Surprisingly, neuroscience may not be so far removed from Hegel's speculations on the nature of consciousness.
|
|
527
|
+
|
|
528
|
+
Next week we focus on this concept, bringing closer together Hegel's ideas on consciousness and self-consciousness with other theories. We'll see how some scholars, like N. Katherine Hayles, have sought to combine research into both human cognition and machine learning with more traditional philosophical concerns about the nature of consciousness. We will revisit attention, but also consider ideas of the "unconscious" – developed originally by Freud, but surprisingly relevant in the world of machine learning too – as well as Katherine Hayles' work on what she terms "nonconscious cognition", operating in the world of machines.
|
|
529
|
+
```
|
|
530
|
+
|