ohmyscrapper 0.2.1__tar.gz → 0.2.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,9 +1,9 @@
1
1
  Metadata-Version: 2.3
2
2
  Name: ohmyscrapper
3
- Version: 0.2.1
3
+ Version: 0.2.3
4
4
  Summary: This project aims to create a text-based scraper containing links to create a final PDF with general information about job openings.
5
- Author: Cesar Cardoso gh@bouli
6
- Author-email: Cesar Cardoso gh@bouli <hello@cesarcardoso.cc>
5
+ Author: Cesar Cardoso
6
+ Author-email: Cesar Cardoso <hello@cesarcardoso.cc>
7
7
  Requires-Dist: beautifulsoup4>=4.14.3
8
8
  Requires-Dist: google-genai>=1.55.0
9
9
  Requires-Dist: markdown>=3.10
@@ -16,13 +16,11 @@ Requires-Dist: urlextract>=1.9.0
16
16
  Requires-Python: >=3.11
17
17
  Description-Content-Type: text/markdown
18
18
 
19
- # OhMyScrapper - v0.2.1
19
+ # OhMyScrapper - v0.2.3
20
20
 
21
21
  This project aims to create a text-based scraper containing links to create a
22
22
  final PDF with general information about job openings.
23
23
 
24
- > This project is using [uv](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer) by default.
25
-
26
24
  ## Scope
27
25
 
28
26
  - Read texts;
@@ -31,11 +29,23 @@ final PDF with general information about job openings.
31
29
 
32
30
  ## Installation
33
31
 
32
+ You can install directly in your `pip`:
33
+ ```shell
34
+ pip install ohmyscrapper
35
+ ```
36
+
34
37
  I recomend to use the [uv](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer), so you can just use the command bellow and everything is installed:
35
38
  ```shell
36
- uv sync
39
+ uv add ohmyscrapper
40
+ uv run ohmyscrapper --version
37
41
  ```
38
42
 
43
+ But you can use everything as a tool, for example:
44
+ ```shell
45
+ uvx ohmyscrapper --version
46
+ ```
47
+
48
+
39
49
  ## How to use and test (development only)
40
50
 
41
51
  OhMyScrapper works in 3 stages:
@@ -46,7 +56,7 @@ OhMyScrapper works in 3 stages:
46
56
 
47
57
  You can do 3 stages with the command:
48
58
  ```shell
49
- make start
59
+ ohmyscrapper start
50
60
  ```
51
61
  > Remember to add your text file in the folder `/input` with the name `_chat.txt`!
52
62
 
@@ -66,17 +76,17 @@ use the whatsapp history, but it works with any txt file.
66
76
  The default file is `input/_chat.txt`. If you have the default file you just use
67
77
  the command `load`:
68
78
  ```shell
69
- make load
79
+ ohmyscrapper load
70
80
  ```
71
81
  or, if you have another file, just use the argument `-file` like this:
72
82
  ```shell
73
- uv run main.py load -file=my-text-file.txt
83
+ ohmyscrapper load -file=my-text-file.txt
74
84
  ```
75
85
  That will create a database if it doesn't exist and store every url the oh-my-scrapper
76
86
  find. After that, let's scrap the urls with the command `scrap-urls`:
77
87
 
78
88
  ```shell
79
- make scrap-urls
89
+ ohmyscrapper scrap-urls --recursive --ignore-type
80
90
  ```
81
91
 
82
92
  That will scrap only the linkedin urls we are interested in. For now they are:
@@ -88,23 +98,33 @@ That will scrap only the linkedin urls we are interested in. For now they are:
88
98
 
89
99
  But we can use every other one generically using the argument `--ignore-type`:
90
100
  ```shell
91
- uv run main.py scrap-urls --ignore-type
101
+ ohmyscrapper scrap-urls --ignore-type
92
102
  ```
93
103
 
94
104
  And we can ask to make it recursively adding the argument `--recursive`:
95
105
  ```shell
96
- uv run main.py scrap-urls --recursive
106
+ ohmyscrapper scrap-urls --recursive
97
107
  ```
98
108
  > !!! important: we are not sure about blocks we can have for excess of requests
99
109
 
100
110
  And we can finally export with the command:
101
111
  ```shell
102
- make export
112
+ ohmyscrapper export
113
+ ohmyscrapper export --file=output/urls-simplified.csv --simplify
114
+ ohmyscrapper report
103
115
  ```
104
116
 
105
117
 
106
118
  That's the basic usage!
107
119
  But you can understand more using the help:
108
120
  ```shell
109
- uv run main.py --help
121
+ ohmyscrapper --help
110
122
  ```
123
+
124
+ ## See Also
125
+
126
+ - Github: https://github.com/bouli/ohmyscrapper
127
+ - PyPI: https://pypi.org/project/ohmyscrapper/
128
+
129
+ ## License
130
+ This package is distributed under the [MIT license](https://opensource.org/license/MIT).
@@ -1,10 +1,8 @@
1
- # OhMyScrapper - v0.2.1
1
+ # OhMyScrapper - v0.2.3
2
2
 
3
3
  This project aims to create a text-based scraper containing links to create a
4
4
  final PDF with general information about job openings.
5
5
 
6
- > This project is using [uv](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer) by default.
7
-
8
6
  ## Scope
9
7
 
10
8
  - Read texts;
@@ -13,11 +11,23 @@ final PDF with general information about job openings.
13
11
 
14
12
  ## Installation
15
13
 
14
+ You can install directly in your `pip`:
15
+ ```shell
16
+ pip install ohmyscrapper
17
+ ```
18
+
16
19
  I recomend to use the [uv](https://docs.astral.sh/uv/getting-started/installation/#standalone-installer), so you can just use the command bellow and everything is installed:
17
20
  ```shell
18
- uv sync
21
+ uv add ohmyscrapper
22
+ uv run ohmyscrapper --version
19
23
  ```
20
24
 
25
+ But you can use everything as a tool, for example:
26
+ ```shell
27
+ uvx ohmyscrapper --version
28
+ ```
29
+
30
+
21
31
  ## How to use and test (development only)
22
32
 
23
33
  OhMyScrapper works in 3 stages:
@@ -28,7 +38,7 @@ OhMyScrapper works in 3 stages:
28
38
 
29
39
  You can do 3 stages with the command:
30
40
  ```shell
31
- make start
41
+ ohmyscrapper start
32
42
  ```
33
43
  > Remember to add your text file in the folder `/input` with the name `_chat.txt`!
34
44
 
@@ -48,17 +58,17 @@ use the whatsapp history, but it works with any txt file.
48
58
  The default file is `input/_chat.txt`. If you have the default file you just use
49
59
  the command `load`:
50
60
  ```shell
51
- make load
61
+ ohmyscrapper load
52
62
  ```
53
63
  or, if you have another file, just use the argument `-file` like this:
54
64
  ```shell
55
- uv run main.py load -file=my-text-file.txt
65
+ ohmyscrapper load -file=my-text-file.txt
56
66
  ```
57
67
  That will create a database if it doesn't exist and store every url the oh-my-scrapper
58
68
  find. After that, let's scrap the urls with the command `scrap-urls`:
59
69
 
60
70
  ```shell
61
- make scrap-urls
71
+ ohmyscrapper scrap-urls --recursive --ignore-type
62
72
  ```
63
73
 
64
74
  That will scrap only the linkedin urls we are interested in. For now they are:
@@ -70,23 +80,33 @@ That will scrap only the linkedin urls we are interested in. For now they are:
70
80
 
71
81
  But we can use every other one generically using the argument `--ignore-type`:
72
82
  ```shell
73
- uv run main.py scrap-urls --ignore-type
83
+ ohmyscrapper scrap-urls --ignore-type
74
84
  ```
75
85
 
76
86
  And we can ask to make it recursively adding the argument `--recursive`:
77
87
  ```shell
78
- uv run main.py scrap-urls --recursive
88
+ ohmyscrapper scrap-urls --recursive
79
89
  ```
80
90
  > !!! important: we are not sure about blocks we can have for excess of requests
81
91
 
82
92
  And we can finally export with the command:
83
93
  ```shell
84
- make export
94
+ ohmyscrapper export
95
+ ohmyscrapper export --file=output/urls-simplified.csv --simplify
96
+ ohmyscrapper report
85
97
  ```
86
98
 
87
99
 
88
100
  That's the basic usage!
89
101
  But you can understand more using the help:
90
102
  ```shell
91
- uv run main.py --help
103
+ ohmyscrapper --help
92
104
  ```
105
+
106
+ ## See Also
107
+
108
+ - Github: https://github.com/bouli/ohmyscrapper
109
+ - PyPI: https://pypi.org/project/ohmyscrapper/
110
+
111
+ ## License
112
+ This package is distributed under the [MIT license](https://opensource.org/license/MIT).
@@ -1,10 +1,10 @@
1
1
  [project]
2
2
  name = "ohmyscrapper"
3
- version = "0.2.1"
3
+ version = "0.2.3"
4
4
  description = "This project aims to create a text-based scraper containing links to create a final PDF with general information about job openings."
5
5
  readme = "README.md"
6
6
  authors = [
7
- { name = "Cesar Cardoso gh@bouli", email = "hello@cesarcardoso.cc" }
7
+ { name = "Cesar Cardoso", email = "hello@cesarcardoso.cc" }
8
8
  ]
9
9
  requires-python = ">=3.11"
10
10
  dependencies = [
@@ -19,12 +19,19 @@ from ohmyscrapper.modules.merge_dbs import merge_dbs
19
19
 
20
20
  def main():
21
21
  parser = argparse.ArgumentParser(prog="ohmyscrapper")
22
- parser.add_argument("--version", action="version", version="%(prog)s v0.2.1")
22
+ parser.add_argument("--version", action="version", version="%(prog)s v0.2.3")
23
23
 
24
24
  subparsers = parser.add_subparsers(dest="command", help="Available commands")
25
+ start_parser = subparsers.add_parser(
26
+ "start", help="Make the entire process of loading, processing and exporting with the default configuration."
27
+ )
28
+
29
+ start_parser.add_argument(
30
+ "--ai", default=False, help="Make the entire process of loading, processing, reprocessing with AI and exporting with the default configuration.", action="store_true"
31
+ )
25
32
 
26
33
  ai_process_parser = subparsers.add_parser(
27
- "process-with-ai", help="Process with AI."
34
+ "ai", help="Process with AI."
28
35
  )
29
36
  ai_process_parser.add_argument(
30
37
  "--history", default=False, help="Reprocess ai history", action="store_true"
@@ -157,6 +164,16 @@ def main():
157
164
  merge_dbs()
158
165
  return
159
166
 
167
+ if args.command == "start":
168
+ load_txt()
169
+ scrap_urls(recursive=True,ignore_valid_prefix=True,randomize=False,only_parents=False)
170
+ if args.ai:
171
+ process_with_ai()
172
+ export_urls()
173
+ export_urls(csv_file="output/urls-simplified.csv", simplify=True)
174
+ export_report()
175
+ return
176
+
160
177
 
161
178
  if __name__ == "__main__":
162
179
  main()