ruby-spacy 0.1.5.4 → 0.2.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +5 -0
- data/CHANGELOG.md +9 -0
- data/Gemfile +2 -1
- data/README.md +169 -3
- data/examples/get_started/lexeme.rb +3 -0
- data/examples/get_started/linguistic_annotations.rb +3 -0
- data/examples/get_started/morphology.rb +3 -0
- data/examples/get_started/most_similar.rb +3 -0
- data/examples/get_started/named_entities.rb +3 -0
- data/examples/get_started/outputs/test_dep.svg +0 -0
- data/examples/get_started/outputs/test_dep_compact.svg +0 -0
- data/examples/get_started/outputs/test_ent.html +0 -0
- data/examples/get_started/pos_tags_and_dependencies.rb +3 -0
- data/examples/get_started/similarity.rb +3 -0
- data/examples/get_started/tokenization.rb +3 -0
- data/examples/get_started/visualizing_dependencies.rb +3 -0
- data/examples/get_started/visualizing_dependencies_compact.rb +3 -0
- data/examples/get_started/visualizing_named_entities.rb +3 -0
- data/examples/get_started/vocab.rb +3 -0
- data/examples/get_started/word_vectors.rb +3 -0
- data/examples/japanese/ancestors.rb +3 -0
- data/examples/japanese/entity_annotations_and_labels.rb +3 -0
- data/examples/japanese/information_extraction.rb +3 -0
- data/examples/japanese/lemmatization.rb +3 -0
- data/examples/japanese/most_similar.rb +3 -0
- data/examples/japanese/named_entity_recognition.rb +3 -0
- data/examples/japanese/navigating_parse_tree.rb +3 -0
- data/examples/japanese/noun_chunks.rb +3 -0
- data/examples/japanese/outputs/test_dep.svg +0 -0
- data/examples/japanese/outputs/test_ent.html +0 -0
- data/examples/japanese/pos_tagging.rb +3 -0
- data/examples/japanese/sentence_segmentation.rb +3 -0
- data/examples/japanese/similarity.rb +3 -0
- data/examples/japanese/tokenization.rb +3 -0
- data/examples/japanese/visualizing_dependencies.rb +3 -0
- data/examples/japanese/visualizing_named_entities.rb +3 -0
- data/examples/linguistic_features/ancestors.rb +3 -0
- data/examples/linguistic_features/entity_annotations_and_labels.rb +3 -0
- data/examples/linguistic_features/finding_a_verb_with_a_subject.rb +3 -0
- data/examples/linguistic_features/information_extraction.rb +3 -0
- data/examples/linguistic_features/iterating_children.rb +4 -1
- data/examples/linguistic_features/iterating_lefts_and_rights.rb +3 -0
- data/examples/linguistic_features/lemmatization.rb +3 -0
- data/examples/linguistic_features/named_entity_recognition.rb +3 -0
- data/examples/linguistic_features/navigating_parse_tree.rb +3 -0
- data/examples/linguistic_features/noun_chunks.rb +3 -0
- data/examples/linguistic_features/outputs/test_ent.html +0 -0
- data/examples/linguistic_features/pos_tagging.rb +3 -0
- data/examples/linguistic_features/retokenize_1.rb +3 -0
- data/examples/linguistic_features/retokenize_2.rb +3 -0
- data/examples/linguistic_features/rule_based_morphology.rb +3 -0
- data/examples/linguistic_features/sentence_segmentation.rb +3 -0
- data/examples/linguistic_features/similarity.rb +3 -0
- data/examples/linguistic_features/similarity_between_lexemes.rb +3 -0
- data/examples/linguistic_features/similarity_between_spans.rb +3 -0
- data/examples/linguistic_features/tokenization.rb +3 -0
- data/examples/openai_integration/openai_completion.rb +19 -0
- data/examples/openai_integration/openai_embeddings.rb +22 -0
- data/examples/openai_integration/openai_query_1.rb +20 -0
- data/examples/openai_integration/openai_query_2.rb +32 -0
- data/examples/openai_integration/openai_query_3.rb +74 -0
- data/examples/openai_integration/openai_query_4.rb +39 -0
- data/examples/rule_based_matching/creating_spans_from_matches.rb +3 -0
- data/examples/rule_based_matching/matcher.rb +3 -0
- data/lib/ruby-spacy/version.rb +1 -1
- data/lib/ruby-spacy.rb +139 -2
- data/ruby-spacy.gemspec +2 -1
- metadata +24 -8
- data/.python-version +0 -1
- data/.rubocop.yml +0 -48
- data/.solargraph.yml +0 -22
- data/.yardopts +0 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9e9edb55398e8926b4fd9c06d65b49538129e34a960098b1ad20535d64a2b787
|
4
|
+
data.tar.gz: 639fa3186d563480d0eb268fa948d8b97428fcdf37887dfff64954fcfc86c1f0
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 74367e0cd67a3537b20f73427baf626ada1f123d9c34da1a55795a905c3cfd8239c5cc1a04e6cf92c8312c6338a6300ce95b837d24c642c3dbb77733a25060ed
|
7
|
+
data.tar.gz: 4723555e09a6416ec8cb5727b3344756be36a26f65429759508aaee697b245960065cb74699b7f619d552a67b574dbc70da1b53a0ddc434ade45901b0ca72dd7
|
data/.gitignore
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,14 @@
|
|
1
1
|
# Change Log
|
2
2
|
|
3
|
+
## 0.2.0 - 2022-10-02
|
4
|
+
- spaCy 3.7.0 supported
|
5
|
+
|
6
|
+
## 0.2.0 - 2022-10-02
|
7
|
+
### Added
|
8
|
+
- `Doc::openai_query`
|
9
|
+
- `Doc::openai_completion`
|
10
|
+
- `Doc::openai_embeddings`
|
11
|
+
|
3
12
|
## 0.1.4.1 - 2021-07-06
|
4
13
|
- Test code refined
|
5
14
|
- `Spacy::Language::most_similar` returns an array of hash-based objects that accepts method calls
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -11,6 +11,11 @@
|
|
11
11
|
| ✅ | Named entity recognition |
|
12
12
|
| ✅ | Syntactic dependency visualization |
|
13
13
|
| ✅ | Access to pre-trained word vectors |
|
14
|
+
| ✅ | OpenAI Chat/Completion/Embeddings API integration |
|
15
|
+
|
16
|
+
Current Version: `0.2.2`
|
17
|
+
|
18
|
+
- Addressed installation issues in some environments
|
14
19
|
|
15
20
|
## Installation of prerequisites
|
16
21
|
|
@@ -32,6 +37,7 @@ Then, install [spaCy](https://spacy.io/). If you use `pip`, the following comman
|
|
32
37
|
$ pip install spacy
|
33
38
|
```
|
34
39
|
|
40
|
+
|
35
41
|
Install trained language models. For a starter, `en_core_web_sm` will be the most useful to conduct basic text processing in English. However, if you want to use advanced features of spaCy, such as named entity recognition or document similarity calculation, you should also install a larger model like `en_core_web_lg`.
|
36
42
|
|
37
43
|
|
@@ -469,9 +475,6 @@ Output:
|
|
469
475
|
| 10 | marseille | 0.6370999813079834 |
|
470
476
|
|
471
477
|
|
472
|
-
|
473
|
-
|
474
|
-
|
475
478
|
### Word vector calculation (Japanese)
|
476
479
|
|
477
480
|
**東京 - 日本 + フランス = パリ ?**
|
@@ -518,6 +521,169 @@ Output:
|
|
518
521
|
| 9 | アルザス | 0.5644999742507935 |
|
519
522
|
| 10 | 南仏 | 0.5547999739646912 |
|
520
523
|
|
524
|
+
|
525
|
+
## OpenAI API Integration
|
526
|
+
|
527
|
+
Easily leverage GPT models within ruby-spacy by using an OpenAI API key. When constructing prompts for the `Doc::openai_query` method, you can incorporate various token properties from the document. These properties are retrieved through function calls and seamlessly integrated into your prompt (`gpt-3.5-turbo-0613` or greater is needed). The available properties include:
|
528
|
+
|
529
|
+
- `surface`
|
530
|
+
- `lemma`
|
531
|
+
- `tag`
|
532
|
+
- `pos` (part of speech)
|
533
|
+
- `dep` (dependency)
|
534
|
+
- `ent_type` (entity type)
|
535
|
+
- `morphology`
|
536
|
+
|
537
|
+
### GPT Prompting 1
|
538
|
+
|
539
|
+
Ruby code:
|
540
|
+
|
541
|
+
```ruby
|
542
|
+
|
543
|
+
require "ruby-spacy"
|
544
|
+
|
545
|
+
api_key = ENV["OPENAI_API_KEY"]
|
546
|
+
nlp = Spacy::Language.new("en_core_web_sm")
|
547
|
+
doc = nlp.read("The Beatles released 12 studio albums")
|
548
|
+
|
549
|
+
# default parameter values
|
550
|
+
# max_tokens: 1000
|
551
|
+
# temperature: 0.7
|
552
|
+
# model: "gpt-3.5-turbo-0613"
|
553
|
+
res1 = doc.openai_query(
|
554
|
+
access_token: api_key,
|
555
|
+
prompt: "Translate the text to Japanese."
|
556
|
+
)
|
557
|
+
puts res1
|
558
|
+
```
|
559
|
+
|
560
|
+
Output:
|
561
|
+
|
562
|
+
> ビートルズは12枚のスタジオアルバムをリリースしました。
|
563
|
+
|
564
|
+
### GPT Prompting 2
|
565
|
+
|
566
|
+
Ruby code:
|
567
|
+
|
568
|
+
```ruby
|
569
|
+
require "ruby-spacy"
|
570
|
+
|
571
|
+
api_key = ENV["OPENAI_API_KEY"]
|
572
|
+
nlp = Spacy::Language.new("en_core_web_sm")
|
573
|
+
doc = nlp.read("The Beatles were an English rock band formed in Liverpool in 1960.")
|
574
|
+
|
575
|
+
res = doc.openai_query(
|
576
|
+
access_token: api_key,
|
577
|
+
prompt: "Extract the topic of the document and list 10 entities (names, concepts, locations, etc.) that are relevant to the topic."
|
578
|
+
)
|
579
|
+
```
|
580
|
+
|
581
|
+
Output:
|
582
|
+
|
583
|
+
> Topic: The Beatles
|
584
|
+
>
|
585
|
+
> Entities:
|
586
|
+
> 1. The Beatles (band)
|
587
|
+
> 2. English (nationality)
|
588
|
+
> 3. Rock band
|
589
|
+
> 4. Liverpool (city)
|
590
|
+
> 5. 1960 (year)
|
591
|
+
> 6. John Lennon (member)
|
592
|
+
> 7. Paul McCartney (member)
|
593
|
+
> 8. George Harrison (member)
|
594
|
+
> 9. Ringo Starr (member)
|
595
|
+
> 10. Music
|
596
|
+
|
597
|
+
### GPT Prompting 3
|
598
|
+
|
599
|
+
Ruby code:
|
600
|
+
|
601
|
+
```ruby
|
602
|
+
require "ruby-spacy"
|
603
|
+
|
604
|
+
api_key = ENV["OPENAI_API_KEY"]
|
605
|
+
nlp = Spacy::Language.new("en_core_web_sm")
|
606
|
+
|
607
|
+
res = doc.openai_query(
|
608
|
+
access_token: api_key,
|
609
|
+
model: "gpt-4",
|
610
|
+
prompt: "Generate a tree diagram from the text in the following style: [S [NP [Det the] [N cat]] [VP [V sat] [PP [P on] [NP the mat]]]"
|
611
|
+
)
|
612
|
+
puts res
|
613
|
+
```
|
614
|
+
|
615
|
+
Output:
|
616
|
+
|
617
|
+
```
|
618
|
+
[S
|
619
|
+
[NP
|
620
|
+
[Det The]
|
621
|
+
[N Beatles]
|
622
|
+
]
|
623
|
+
[VP
|
624
|
+
[V released]
|
625
|
+
[NP
|
626
|
+
[Num 12]
|
627
|
+
[N
|
628
|
+
[N studio]
|
629
|
+
[N albums]
|
630
|
+
]
|
631
|
+
]
|
632
|
+
]
|
633
|
+
]
|
634
|
+
```
|
635
|
+
|
636
|
+
### GPT Text Completion
|
637
|
+
|
638
|
+
Ruby code:
|
639
|
+
|
640
|
+
```ruby
|
641
|
+
require "ruby-spacy"
|
642
|
+
|
643
|
+
api_key = ENV["OPENAI_API_KEY"]
|
644
|
+
nlp = Spacy::Language.new("en_core_web_sm")
|
645
|
+
doc = nlp.read("Vladimir Nabokov was a")
|
646
|
+
|
647
|
+
# default parameter values
|
648
|
+
# max_tokens: 1000
|
649
|
+
# temperature: 0.7
|
650
|
+
# model: "gpt-3.5-turbo-0613"
|
651
|
+
res = doc.openai_completion(access_token: api_key)
|
652
|
+
puts res
|
653
|
+
```
|
654
|
+
|
655
|
+
Output:
|
656
|
+
|
657
|
+
> Russian-American novelist and lepidopterist. He was born in 1899 in St. Petersburg, Russia, and later emigrated to the United States in 1940. Nabokov is best known for his novel "Lolita," which was published in 1955 and caused much controversy due to its controversial subject matter. Throughout his career, Nabokov wrote many other notable works, including "Pale Fire" and "Ada or Ardor: A Family Chronicle." In addition to his writing, Nabokov was also a passionate butterfly collector and taxonomist, publishing several scientific papers on the subject. He passed away in 1977, leaving behind a rich literary legacy.
|
658
|
+
|
659
|
+
### Text Embeddings
|
660
|
+
|
661
|
+
Ruby code:
|
662
|
+
|
663
|
+
```ruby
|
664
|
+
require "ruby-spacy"
|
665
|
+
|
666
|
+
api_key = ENV["OPENAI_API_KEY"]
|
667
|
+
nlp = Spacy::Language.new("en_core_web_sm")
|
668
|
+
doc = nlp.read("Vladimir Nabokov was a Russian-American novelist, poet, translator and entomologist.")
|
669
|
+
|
670
|
+
# default model: text-embedding-ada-002
|
671
|
+
res = doc.openai_embeddings(access_token: api_key)
|
672
|
+
|
673
|
+
puts res
|
674
|
+
```
|
675
|
+
|
676
|
+
Output:
|
677
|
+
|
678
|
+
```
|
679
|
+
-0.00208362
|
680
|
+
-0.01645165
|
681
|
+
0.0110955965
|
682
|
+
0.012802119
|
683
|
+
0.0012175755
|
684
|
+
...
|
685
|
+
```
|
686
|
+
|
521
687
|
## Author
|
522
688
|
|
523
689
|
Yoichiro Hasebe [<yohasebe@gmail.com>]
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
@@ -1,5 +1,8 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
+
# add path to ruby-spacy lib to load path
|
4
|
+
$LOAD_PATH.unshift(File.expand_path("../../lib", __dir__))
|
5
|
+
|
3
6
|
require "ruby-spacy"
|
4
7
|
require "terminal-table"
|
5
8
|
|
@@ -17,6 +20,6 @@ doc.each do |token|
|
|
17
20
|
end
|
18
21
|
end
|
19
22
|
|
20
|
-
puts results
|
23
|
+
puts results
|
21
24
|
|
22
25
|
# ["shift"]
|
File without changes
|