079project 1.0.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,344 +0,0 @@
1
- ---
2
- annotations_creators:
3
- - no-annotation
4
- language_creators:
5
- - crowdsourced
6
- language:
7
- - en
8
- license:
9
- - cc-by-sa-3.0
10
- - gfdl
11
- multilinguality:
12
- - monolingual
13
- size_categories:
14
- - 1M<n<10M
15
- source_datasets:
16
- - original
17
- task_categories:
18
- - text-generation
19
- - fill-mask
20
- task_ids:
21
- - language-modeling
22
- - masked-language-modeling
23
- paperswithcode_id: wikitext-2
24
- pretty_name: WikiText
25
- dataset_info:
26
- - config_name: wikitext-103-raw-v1
27
- features:
28
- - name: text
29
- dtype: string
30
- splits:
31
- - name: test
32
- num_bytes: 1305088
33
- num_examples: 4358
34
- - name: train
35
- num_bytes: 546500949
36
- num_examples: 1801350
37
- - name: validation
38
- num_bytes: 1159288
39
- num_examples: 3760
40
- download_size: 315466397
41
- dataset_size: 548965325
42
- - config_name: wikitext-103-v1
43
- features:
44
- - name: text
45
- dtype: string
46
- splits:
47
- - name: test
48
- num_bytes: 1295575
49
- num_examples: 4358
50
- - name: train
51
- num_bytes: 545141915
52
- num_examples: 1801350
53
- - name: validation
54
- num_bytes: 1154751
55
- num_examples: 3760
56
- download_size: 313093838
57
- dataset_size: 547592241
58
- - config_name: wikitext-2-raw-v1
59
- features:
60
- - name: text
61
- dtype: string
62
- splits:
63
- - name: test
64
- num_bytes: 1305088
65
- num_examples: 4358
66
- - name: train
67
- num_bytes: 11061717
68
- num_examples: 36718
69
- - name: validation
70
- num_bytes: 1159288
71
- num_examples: 3760
72
- download_size: 7747362
73
- dataset_size: 13526093
74
- - config_name: wikitext-2-v1
75
- features:
76
- - name: text
77
- dtype: string
78
- splits:
79
- - name: test
80
- num_bytes: 1270947
81
- num_examples: 4358
82
- - name: train
83
- num_bytes: 10918118
84
- num_examples: 36718
85
- - name: validation
86
- num_bytes: 1134123
87
- num_examples: 3760
88
- download_size: 7371282
89
- dataset_size: 13323188
90
- configs:
91
- - config_name: wikitext-103-raw-v1
92
- data_files:
93
- - split: test
94
- path: wikitext-103-raw-v1/test-*
95
- - split: train
96
- path: wikitext-103-raw-v1/train-*
97
- - split: validation
98
- path: wikitext-103-raw-v1/validation-*
99
- - config_name: wikitext-103-v1
100
- data_files:
101
- - split: test
102
- path: wikitext-103-v1/test-*
103
- - split: train
104
- path: wikitext-103-v1/train-*
105
- - split: validation
106
- path: wikitext-103-v1/validation-*
107
- - config_name: wikitext-2-raw-v1
108
- data_files:
109
- - split: test
110
- path: wikitext-2-raw-v1/test-*
111
- - split: train
112
- path: wikitext-2-raw-v1/train-*
113
- - split: validation
114
- path: wikitext-2-raw-v1/validation-*
115
- - config_name: wikitext-2-v1
116
- data_files:
117
- - split: test
118
- path: wikitext-2-v1/test-*
119
- - split: train
120
- path: wikitext-2-v1/train-*
121
- - split: validation
122
- path: wikitext-2-v1/validation-*
123
- ---
124
-
125
- # Dataset Card for "wikitext"
126
-
127
- ## Table of Contents
128
- - [Dataset Description](#dataset-description)
129
- - [Dataset Summary](#dataset-summary)
130
- - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
131
- - [Languages](#languages)
132
- - [Dataset Structure](#dataset-structure)
133
- - [Data Instances](#data-instances)
134
- - [Data Fields](#data-fields)
135
- - [Data Splits](#data-splits)
136
- - [Dataset Creation](#dataset-creation)
137
- - [Curation Rationale](#curation-rationale)
138
- - [Source Data](#source-data)
139
- - [Annotations](#annotations)
140
- - [Personal and Sensitive Information](#personal-and-sensitive-information)
141
- - [Considerations for Using the Data](#considerations-for-using-the-data)
142
- - [Social Impact of Dataset](#social-impact-of-dataset)
143
- - [Discussion of Biases](#discussion-of-biases)
144
- - [Other Known Limitations](#other-known-limitations)
145
- - [Additional Information](#additional-information)
146
- - [Dataset Curators](#dataset-curators)
147
- - [Licensing Information](#licensing-information)
148
- - [Citation Information](#citation-information)
149
- - [Contributions](#contributions)
150
-
151
- ## Dataset Description
152
-
153
- - **Homepage:** [https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
154
- - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
155
- - **Paper:** [Pointer Sentinel Mixture Models](https://arxiv.org/abs/1609.07843)
156
- - **Point of Contact:** [Stephen Merity](mailto:smerity@salesforce.com)
157
- - **Size of downloaded dataset files:** 391.41 MB
158
- - **Size of the generated dataset:** 1.12 GB
159
- - **Total amount of disk used:** 1.52 GB
160
-
161
- ### Dataset Summary
162
-
163
- The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified
164
- Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
165
-
166
- Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over
167
- 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation
168
- and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models
169
- that can take advantage of long term dependencies.
170
-
171
- Each subset comes in two different variants:
172
- - Raw (for character level work) contain the raw tokens, before the addition of the <unk> (unknown) tokens.
173
- - Non-raw (for word level work) contain only the tokens in their vocabulary (wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens).
174
- The out-of-vocabulary tokens have been replaced with the the <unk> token.
175
-
176
-
177
- ### Supported Tasks and Leaderboards
178
-
179
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
180
-
181
- ### Languages
182
-
183
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
184
-
185
- ## Dataset Structure
186
-
187
- ### Data Instances
188
-
189
- #### wikitext-103-raw-v1
190
-
191
- - **Size of downloaded dataset files:** 191.98 MB
192
- - **Size of the generated dataset:** 549.42 MB
193
- - **Total amount of disk used:** 741.41 MB
194
-
195
- An example of 'validation' looks as follows.
196
- ```
197
- This example was too long and was cropped:
198
-
199
- {
200
- "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..."
201
- }
202
- ```
203
-
204
- #### wikitext-103-v1
205
-
206
- - **Size of downloaded dataset files:** 190.23 MB
207
- - **Size of the generated dataset:** 548.05 MB
208
- - **Total amount of disk used:** 738.27 MB
209
-
210
- An example of 'train' looks as follows.
211
- ```
212
- This example was too long and was cropped:
213
-
214
- {
215
- "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
216
- }
217
- ```
218
-
219
- #### wikitext-2-raw-v1
220
-
221
- - **Size of downloaded dataset files:** 4.72 MB
222
- - **Size of the generated dataset:** 13.54 MB
223
- - **Total amount of disk used:** 18.26 MB
224
-
225
- An example of 'train' looks as follows.
226
- ```
227
- This example was too long and was cropped:
228
-
229
- {
230
- "text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..."
231
- }
232
- ```
233
-
234
- #### wikitext-2-v1
235
-
236
- - **Size of downloaded dataset files:** 4.48 MB
237
- - **Size of the generated dataset:** 13.34 MB
238
- - **Total amount of disk used:** 17.82 MB
239
-
240
- An example of 'train' looks as follows.
241
- ```
242
- This example was too long and was cropped:
243
-
244
- {
245
- "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
246
- }
247
- ```
248
-
249
- ### Data Fields
250
-
251
- The data fields are the same among all splits.
252
-
253
- #### wikitext-103-raw-v1
254
- - `text`: a `string` feature.
255
-
256
- #### wikitext-103-v1
257
- - `text`: a `string` feature.
258
-
259
- #### wikitext-2-raw-v1
260
- - `text`: a `string` feature.
261
-
262
- #### wikitext-2-v1
263
- - `text`: a `string` feature.
264
-
265
- ### Data Splits
266
-
267
- | name | train |validation|test|
268
- |-------------------|------:|---------:|---:|
269
- |wikitext-103-raw-v1|1801350| 3760|4358|
270
- |wikitext-103-v1 |1801350| 3760|4358|
271
- |wikitext-2-raw-v1 | 36718| 3760|4358|
272
- |wikitext-2-v1 | 36718| 3760|4358|
273
-
274
- ## Dataset Creation
275
-
276
- ### Curation Rationale
277
-
278
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
279
-
280
- ### Source Data
281
-
282
- #### Initial Data Collection and Normalization
283
-
284
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
285
-
286
- #### Who are the source language producers?
287
-
288
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
289
-
290
- ### Annotations
291
-
292
- #### Annotation process
293
-
294
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
295
-
296
- #### Who are the annotators?
297
-
298
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
299
-
300
- ### Personal and Sensitive Information
301
-
302
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
303
-
304
- ## Considerations for Using the Data
305
-
306
- ### Social Impact of Dataset
307
-
308
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
309
-
310
- ### Discussion of Biases
311
-
312
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
313
-
314
- ### Other Known Limitations
315
-
316
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
317
-
318
- ## Additional Information
319
-
320
- ### Dataset Curators
321
-
322
- [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
323
-
324
- ### Licensing Information
325
-
326
- The dataset is available under the [Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
327
-
328
- ### Citation Information
329
-
330
- ```
331
- @misc{merity2016pointer,
332
- title={Pointer Sentinel Mixture Models},
333
- author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher},
334
- year={2016},
335
- eprint={1609.07843},
336
- archivePrefix={arXiv},
337
- primaryClass={cs.CL}
338
- }
339
- ```
340
-
341
-
342
- ### Contributions
343
-
344
- Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham) for adding this dataset.
@@ -1 +0,0 @@
1
- 说明:这个仓库来自https://hf-mirror.com/datasets/Salesforce/wikitext