dev-laiser 0.2.2__tar.gz → 0.2.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.2
2
2
  Name: dev-laiser
3
- Version: 0.2.2
3
+ Version: 0.2.4
4
4
  Summary: LAiSER (Leveraging Artificial Intelligence for Skill Extraction & Research) is a tool designed to help learners, educators, and employers extract and share trusted information about skills. It uses a fine-tuned language model to extract raw skill keywords from text, then aligns them with a predefined taxonomy. You can find more technical details in the project’s paper.md and an overview in the README.md.
5
5
  Home-page: https://github.com/LAiSER-Software/extract-module
6
6
  Author: Satya Phanindra Kumar Kalaga, Bharat Khandelwal, Prudhvi Chekuri
@@ -75,7 +75,7 @@ LAiSER is a tool that helps learners, educators and employers share trusted and
75
75
  Before proceeding to LAiSER, you'd want to follow the steps below to install the required dependencies:
76
76
  - Clone the repository using
77
77
  ```shell
78
- git clone https://github.com/Micah-Sanders/LAiSER.git
78
+ git clone https://github.com/LAiSER-Software/extract-module.git
79
79
  ```
80
80
  or download the [zip(link)](https://github.com/Micah-Sanders/LAiSER/archive/refs/heads/main.zip) file and extract it.
81
81
 
@@ -104,7 +104,7 @@ To use LAiSER as a command line tool, follow the steps below:
104
104
 
105
105
  - Navigate to the root directory of the repository and run the command below:
106
106
  ```shell
107
- pip install laiser-dev
107
+ pip install dev-laiser
108
108
  ```
109
109
 
110
110
  - Once the installation is complete, you can run the tool using the command below:
@@ -34,7 +34,7 @@ LAiSER is a tool that helps learners, educators and employers share trusted and
34
34
  Before proceeding to LAiSER, you'd want to follow the steps below to install the required dependencies:
35
35
  - Clone the repository using
36
36
  ```shell
37
- git clone https://github.com/Micah-Sanders/LAiSER.git
37
+ git clone https://github.com/LAiSER-Software/extract-module.git
38
38
  ```
39
39
  or download the [zip(link)](https://github.com/Micah-Sanders/LAiSER/archive/refs/heads/main.zip) file and extract it.
40
40
 
@@ -63,7 +63,7 @@ To use LAiSER as a command line tool, follow the steps below:
63
63
 
64
64
  - Navigate to the root directory of the repository and run the command below:
65
65
  ```shell
66
- pip install laiser-dev
66
+ pip install dev-laiser
67
67
  ```
68
68
 
69
69
  - Once the installation is complete, you can run the tool using the command below:
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.2
2
2
  Name: dev-laiser
3
- Version: 0.2.2
3
+ Version: 0.2.4
4
4
  Summary: LAiSER (Leveraging Artificial Intelligence for Skill Extraction & Research) is a tool designed to help learners, educators, and employers extract and share trusted information about skills. It uses a fine-tuned language model to extract raw skill keywords from text, then aligns them with a predefined taxonomy. You can find more technical details in the project’s paper.md and an overview in the README.md.
5
5
  Home-page: https://github.com/LAiSER-Software/extract-module
6
6
  Author: Satya Phanindra Kumar Kalaga, Bharat Khandelwal, Prudhvi Chekuri
@@ -75,7 +75,7 @@ LAiSER is a tool that helps learners, educators and employers share trusted and
75
75
  Before proceeding to LAiSER, you'd want to follow the steps below to install the required dependencies:
76
76
  - Clone the repository using
77
77
  ```shell
78
- git clone https://github.com/Micah-Sanders/LAiSER.git
78
+ git clone https://github.com/LAiSER-Software/extract-module.git
79
79
  ```
80
80
  or download the [zip(link)](https://github.com/Micah-Sanders/LAiSER/archive/refs/heads/main.zip) file and extract it.
81
81
 
@@ -104,7 +104,7 @@ To use LAiSER as a command line tool, follow the steps below:
104
104
 
105
105
  - Navigate to the root directory of the repository and run the command below:
106
106
  ```shell
107
- pip install laiser-dev
107
+ pip install dev-laiser
108
108
  ```
109
109
 
110
110
  - Once the installation is complete, you can run the tool using the command below:
@@ -48,7 +48,7 @@ Rev No. Date Author Description
48
48
  [1.0.0] 07/10/2024 Satya Phanindra K. Define all the LLM methods being used in the project
49
49
  [1.0.1] 07/19/2024 Satya Phanindra K. Add descriptions to each method
50
50
  [1.0.2] 11/24/2024 Prudhvi Chekuri Add support for skills extraction from syllabi data
51
- [1.0.3] 11/25/2024 Satya Phanindra K. Add support for skills extraction from course outcomes data
51
+ [1.0.3] 03/12/2025 Prudhvi Chekuri Implement functions to extract levels, KSAs from job descriptions and syllabi data using vLLM
52
52
 
53
53
  TODO:
54
54
  -----
@@ -220,20 +220,21 @@ def get_completion(input_text, text_columns, input_type, model, tokenizer) -> st
220
220
 
221
221
 
222
222
  def parse_output_vllm(response):
223
- # TODO: Verify the docstring and update missing/incorrect information
223
+
224
224
  """
225
- Parse the output from the VLLM model to extract skills, levels, knowledge required, and task abilities.
225
+ Parse the model's response to extract key skills, knowledge required, and task abilities.
226
226
 
227
227
  Parameters
228
228
  ----------
229
229
  response : str
230
- The model's response containing the structured information about skills.
230
+ The model's response after processing the prompt.
231
231
 
232
232
  Returns
233
233
  -------
234
- list: List of dictionaries containing the extracted skills, levels, knowledge required, and task abilities.
234
+ list: List of dictionaries that has levels, KSAs for all the data points in the input text.
235
+
235
236
  """
236
-
237
+
237
238
  out = []
238
239
  # Split into items, handling optional '->' prefix and multi-line input
239
240
  items = [item.strip() for item in response.split('->') if item.strip()]
@@ -289,6 +290,7 @@ def create_ksa_prompt(query, input_type, num_key_skills, num_key_kr, num_key_tas
289
290
  -------
290
291
  str
291
292
  The formatted prompt for the KSA extraction task.
293
+
292
294
  """
293
295
 
294
296
  prompt_template = """user
@@ -352,6 +354,32 @@ model
352
354
 
353
355
  def vllm_batch_generate(llm, queries, input_type, batch_size=32, num_key_skills=5, num_key_kr='3-5', num_key_tas='3-5'):
354
356
 
357
+ """
358
+ Generate completions for a batch of queries using the model.
359
+
360
+ Parameters
361
+ ----------
362
+ llm : model
363
+ The model to use for generating completions
364
+ queries : pandas DataFrame
365
+ The queries to get completions for using the model
366
+ input_type : str
367
+ Type of input data - 'job_desc' / 'syllabus' etc. (Default: 'job_desc')
368
+ batch_size : int, optional
369
+ Preferred batch size to use for generating completions
370
+ num_key_skills : int, optional
371
+ Number of key skills to extract from the input text
372
+ num_key_kr : str, optional
373
+ Number of key knowledge required items to extract from the input text
374
+ num_key_tas : str, optional
375
+ Number of key task abilities items to extract from the input text
376
+
377
+ Returns
378
+ -------
379
+ list: List of completions generated by the model for the input queries
380
+
381
+ """
382
+
355
383
  result = []
356
384
 
357
385
  sampling_params = SamplingParams(max_tokens=1000)
@@ -367,6 +395,29 @@ def vllm_batch_generate(llm, queries, input_type, batch_size=32, num_key_skills=
367
395
 
368
396
  def get_completion_vllm(input_text, text_columns, id_column, input_type, llm, batch_size=4) -> list:
369
397
 
398
+ """
399
+ Get completions for whole input data and parse the required KSAs from the model responses. The input data can be a job description or syllabi data.
400
+
401
+ Parameters
402
+ ----------
403
+ input_text : pandas DataFrame
404
+ The input data to get completions for using the model
405
+ text_columns : list
406
+ List of columns in the input_text dataframe that contain the text data. (Default: ['description'])
407
+ id_column : str
408
+ Column name in the input_text dataframe that contains the unique identifier for each row
409
+ input_type : str
410
+ Type of input data - 'job_desc' / 'syllabus' etc. (Default: 'job_desc')
411
+ llm : model
412
+ The model to use for generating completions
413
+ batch_size : int, optional
414
+ Preferred batch size to use for generating completions
415
+
416
+ Returns
417
+ -------
418
+ list: List of dictionaries that has levels, KSAs for all the data points in the input text.
419
+ """
420
+
370
421
  result = vllm_batch_generate(llm, input_text, input_type=input_type, batch_size=batch_size)
371
422
 
372
423
  parsed_output = []
@@ -40,7 +40,7 @@ Rev No. Date Author Description
40
40
  [1.0.0] 06/01/2024 Vedant M. Initial Version
41
41
  [1.0.1] 06/10/2024 Vedant M. added paths for input and output
42
42
  [1.0.2] 07/01/2024 Satya Phanindra K. updated threshold for similarity and AI model ID
43
-
43
+ [1.0.3] 03/12/2025 Prudhvi Chekuri Remove unnecessary params
44
44
 
45
45
  TODO:
46
46
  -----
@@ -51,10 +51,8 @@ import os
51
51
  from dotenv import load_dotenv
52
52
 
53
53
  ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
54
- INPUT_PATH = os.path.join(ROOT_DIR, 'input')
55
- OUTPUT_PATH = os.path.join(ROOT_DIR, 'output')
56
54
 
57
- SKILL_DB_PATH = os.path.join(INPUT_PATH, 'combined.csv')
55
+ SKILL_DB_PATH = os.path.join('https://raw.githubusercontent.com/LAiSER-Software/datasets/refs/heads/master/taxonomies/combined.csv')
58
56
 
59
57
 
60
58
  SIMILARITY_THRESHOLD = 0.85
@@ -62,7 +62,7 @@ Rev No. Date Author Description
62
62
  [1.0.8] 07/11/2024 Satya Phanindra K. Calculate cosine similarities in bulk for optimal performance.
63
63
  [1.0.9] 07/15/2024 Satya Phanindra K. Error handling for empty list outputs from extract_raw function
64
64
  [1.0.10] 11/24/2024 Prudhvi Chekuri Added support for skills extraction from syllabi data
65
- [1.0.11] 03/12/2025 Satya Phanindra K. Update extractor function to handle syllabus data
65
+ [1.1.0] 03/12/2025 Prudhvi Chekuri Added support for extracting KSAs from text and aligning them to the taxonomy
66
66
 
67
67
 
68
68
  TODO:
@@ -105,22 +105,33 @@ class Skill_Extractor:
105
105
 
106
106
  Attributes
107
107
  ----------
108
- client : HuggingFace API client
109
- nlp : spacy nlp model
108
+ model_id: string
109
+ Model ID for Large Language Model
110
+ HF_TOKEN: string
111
+ HuggingFace Token for restricted models under gated HF repos.
112
+ use_gpu: boolean
113
+ Flag to use GPU for Large Language Model
114
+ nlp: spacy model
115
+ Spacy model for NER
116
+ skill_db_df: pandas dataframe
117
+ Dataframe containing taxonomy skills
118
+ skill_db_embeddings: numpy array
119
+ Array containing embeddings of taxonomy skills
120
+ llm: LLM model
121
+ Large Language Model for skill extraction
122
+ ner_extractor: SkillExtractor
123
+ SkillNer model for CPU skill extraction
110
124
 
111
125
  Methods
112
126
  -------
113
127
  extract_raw(input_text: text)
114
128
  The function extracts skills from text using NER model
115
-
116
- align_skills(raw_skills: list, document_id='0': string):
117
- This function aligns the skills provided to the desired taxonomy
118
-
119
- align_KSAs(extracted_df: pandas dataframe, id_column='Research ID'):
120
- This function aligns the skills provided to the desired taxonomy
121
129
 
122
130
  extractor(data: pandas dataframe, id_column='Research ID', text_column='Text'):
123
131
  Function takes text dataset to extract and aligns skills based on available taxonomies
132
+
133
+ align_KSAs(extracted_df: pandas dataframe, id_column='Research ID'):
134
+ This function aligns the KSAs provided to the available taxonomy
124
135
  ....
125
136
 
126
137
  """
@@ -156,6 +167,8 @@ class Skill_Extractor:
156
167
  ----------
157
168
  input_text : pandas Series with text data
158
169
  Job advertisement / Job Description / Syllabus Description / Course Outcomes etc.
170
+ id_column: string
171
+ Name of id column in the dataset. Defaults to 'Research ID'
159
172
  text_columns: list
160
173
  Name of the text columns in the dataset. Defaults to 'description'
161
174
  input_type: string
@@ -165,11 +178,6 @@ class Skill_Extractor:
165
178
  -------
166
179
  list: List of extracted skills from text
167
180
 
168
- Notes
169
- -----
170
- More details on which (pre-trained) language model is fine-tuned can be found in llm_methods.py
171
- The Function is designed only to return list of skills based on prompt passed to OpenAI's Fine-tuned model.
172
-
173
181
  """
174
182
 
175
183
  if torch.cuda.is_available() and self.use_gpu:
@@ -247,13 +255,14 @@ class Skill_Extractor:
247
255
 
248
256
 
249
257
  def align_KSAs(self, extracted_df, id_column):
258
+
250
259
  """
251
- This function aligns the skills provided to the available taxonomy
260
+ This function aligns the KSAs provided to the available taxonomy
252
261
 
253
262
  Parameters
254
263
  ----------
255
264
  extracted_df : pandas dataframe
256
- Provide dataframe of skills extracted from Job Descriptions / Syllabus.
265
+ Dataset containing extracted KSAs from text and their details.
257
266
  id_column: string
258
267
  Name of id column in the dataset. Defaults to 'Research ID'
259
268
 
@@ -319,6 +328,7 @@ class Skill_Extractor:
319
328
 
320
329
  Returns
321
330
  -------
331
+ For CPU:
322
332
  list: List of skill tags and similarity_score for all texts in from text in JSON format
323
333
  [
324
334
  {
@@ -334,6 +344,18 @@ class Skill_Extractor:
334
344
  },
335
345
  ...
336
346
  ]
347
+
348
+ For GPU:
349
+ pandas dataframe with below columns:
350
+ - "Research ID": text_id
351
+ - "Description": text description
352
+ - "Learning Outcomes": learning outcomes
353
+ - "Raw Skill": Raw skill extracted
354
+ - "Level": Level of the skill
355
+ - "Knowledge Required": Knowledge required for the skill
356
+ - "Task Abilities": Task abilities
357
+ - "Skill Tag": taxonomy skill tag
358
+ - "Correlation Coefficient": similarity_score
337
359
 
338
360
  """
339
361
 
@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
2
2
 
3
3
  setup(
4
4
  name='dev-laiser',
5
- version='0.2.2',
5
+ version='0.2.4',
6
6
  author='Satya Phanindra Kumar Kalaga, Bharat Khandelwal, Prudhvi Chekuri',
7
7
  author_email='phanindra.connect@gmail.com',
8
8
  description='LAiSER (Leveraging Artificial Intelligence for Skill Extraction & Research) is a tool designed to help learners, educators, and employers extract and share trusted information about skills. It uses a fine-tuned language model to extract raw skill keywords from text, then aligns them with a predefined taxonomy. You can find more technical details in the project’s paper.md and an overview in the README.md.',
File without changes
File without changes
File without changes