PgsFile 0.2.3__py3-none-any.whl → 0.2.5__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of PgsFile might be problematic. Click here for more details.
- PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/HK-Press releases of the Financial Secretary Office (2007-2019).tsv +7348 -0
- PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Hong Kong bilingual court decisions (1997-2017).tsv +20000 -0
- PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/HongKong-Legislation.tsv +20000 -0
- PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Offering documents of financial products (updated as of October 2018).tsv +20000 -0
- PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Speeches delivered by SFC Executives (2006-2019).tsv +4680 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2006.txt +46 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2008.txt +48 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2009.txt +42 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2010.txt +42 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2011.txt +38 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2012.txt +28 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2013.txt +42 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2014.txt +68 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2015.txt +106 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2016.txt +82 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2017.txt +90 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2018.txt +136 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2019.txt +112 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2020.txt +124 -0
- PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2021.txt +94 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100201_000150_en.txt +6 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100201_000150_zh.txt +6 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100213_000135_en.txt +17 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100213_000135_zh.txt +17 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100215_000445_en.txt +10 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100215_000445_zh.txt +10 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000135_en.txt +12 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000135_zh.txt +12 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000205_en.txt +5 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000205_zh.txt +5 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000548_en.txt +9 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000548_zh.txt +9 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100225_001011_en.txt +8 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100225_001011_zh.txt +8 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000129_en.txt +8 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000129_zh.txt +8 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000649_en.txt +13 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000649_zh.txt +13 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100301_000549_en.txt +8 -0
- PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100301_000549_zh.txt +8 -0
- PgsFile/Corpora/Corpora/Parallel/Xi's Speech_CE_2021/Speech at a Ceremony Marking the Centenary of the CPC.txt +144 -0
- PgsFile/PgsFile.py +516 -33
- PgsFile/__init__.py +13 -3
- PgsFile/models/NLPIR.user +0 -0
- PgsFile/models/fonts/DejaVuSans.ttf +0 -0
- PgsFile/models/fonts//321/204/342/225/243/320/266/321/204/342/225/234/320/243/321/205/320/255/320/232/321/210/342/225/241/342/225/241/321/204/342/225/243/320/255/321/206/342/226/222/320/257/321/211/320/242/320/262/321/207/320/274/320/244/321/210/320/261/320/234/321/204/342/225/243/320/266/321/204/342/225/234/320/243.ttf +0 -0
- PgsFile/models/fonts//321/205/320/225/320/270/321/206/320/246/342/226/221/321/207/320/261/320/274/321/207/320/274/320/244/321/206/320/265/342/225/226/321/204/342/225/243/320/266/321/207/320/276/320/220.ttf +0 -0
- PgsFile/models/fonts//321/205/320/225/320/270/321/206/320/246/342/226/221/321/207/320/261/320/274/321/207/320/274/320/244/321/210/320/261/320/234/321/204/342/225/243/320/266/321/207/320/276/320/220.ttf +0 -0
- PgsFile/models/fonts//321/205/320/235/320/252/321/206/342/224/244/320/233/321/210/320/261/320/234/321/204/342/225/243/320/2663500.TTF +0 -0
- PgsFile/models/fonts//321/211/320/251/320/226/321/206/320/257/320/274/321/204/342/225/243/320/233/321/210/320/261/320/234/321/204/342/225/243/320/266/321/205/320/275/320/247/321/204/342/225/234/320/243.ttf +0 -0
- PgsFile/models/model_reviews2.2.bin +0 -0
- PgsFile/models/model_reviews_ReadMe.txt +134 -0
- PgsFile-0.2.5.dist-info/METADATA +41 -0
- {PgsFile-0.2.3.dist-info → PgsFile-0.2.5.dist-info}/RECORD +57 -7
- PgsFile-0.2.3.dist-info/METADATA +0 -79
- {PgsFile-0.2.3.dist-info → PgsFile-0.2.5.dist-info}/LICENSE +0 -0
- {PgsFile-0.2.3.dist-info → PgsFile-0.2.5.dist-info}/WHEEL +0 -0
- {PgsFile-0.2.3.dist-info → PgsFile-0.2.5.dist-info}/top_level.txt +0 -0
PgsFile/__init__.py
CHANGED
|
@@ -7,6 +7,7 @@ from .PgsFile import headers, encode_chinese_keyword_for_url
|
|
|
7
7
|
from .PgsFile import install_package, uninstall_package
|
|
8
8
|
from .PgsFile import run_script, run_command
|
|
9
9
|
from .PgsFile import get_library_location
|
|
10
|
+
from .PgsFile import conda_mirror_commands
|
|
10
11
|
|
|
11
12
|
# 3. Text data retrieval
|
|
12
13
|
from .PgsFile import get_data_text, get_data_lines, get_json_lines, get_tsv_lines
|
|
@@ -16,14 +17,18 @@ from .PgsFile import get_data_table_url, get_data_table_html_string
|
|
|
16
17
|
|
|
17
18
|
# 4. Text data storage
|
|
18
19
|
from .PgsFile import write_to_txt, write_to_excel, write_to_json, write_to_json_lines, append_dict_to_json, save_dict_to_excel
|
|
20
|
+
from .PgsFile import write_to_excel_normal
|
|
19
21
|
|
|
20
22
|
# 5. File/folder process
|
|
21
23
|
from .PgsFile import FilePath, FileName, DirList
|
|
22
|
-
from .PgsFile import get_subfolder_path
|
|
24
|
+
from .PgsFile import get_subfolder_path, get_full_path
|
|
23
25
|
from .PgsFile import makedirec, makefile
|
|
24
26
|
from .PgsFile import source_path, next_folder_names, get_directory_tree_with_meta, find_txt_files_with_keyword
|
|
25
27
|
from .PgsFile import remove_empty_folders, remove_empty_txts, remove_empty_lines, remove_empty_last_line, move_file, copy_file
|
|
26
28
|
from .PgsFile import concatenate_excel_files
|
|
29
|
+
from .PgsFile import set_permanent_environment_variable
|
|
30
|
+
from .PgsFile import delete_permanent_environment_variable
|
|
31
|
+
from .PgsFile import get_env_variable
|
|
27
32
|
|
|
28
33
|
# 6. Data cleaning
|
|
29
34
|
from .PgsFile import BigPunctuation, StopTags, Special, yhd
|
|
@@ -32,18 +37,23 @@ from .PgsFile import nltk_en_tags, nltk_tag_mapping, thulac_tags, ICTCLAS2008, L
|
|
|
32
37
|
from .PgsFile import check_contain_chinese, check_contain_number
|
|
33
38
|
from .PgsFile import replace_chinese_punctuation_with_english
|
|
34
39
|
from .PgsFile import replace_english_punctuation_with_chinese
|
|
35
|
-
from .PgsFile import clean_list, clean_text_with_abbreviations
|
|
40
|
+
from .PgsFile import clean_list, clean_text, clean_text_with_abbreviations, clean_line_with_abbreviations
|
|
36
41
|
from .PgsFile import extract_chinese_punctuation, generate_password, sort_strings_with_embedded_numbers
|
|
37
42
|
|
|
38
43
|
# 7. NLP (natural language processing)
|
|
39
44
|
from .PgsFile import strQ2B_raw, strQ2B_words
|
|
40
45
|
from .PgsFile import ngrams, bigrams, trigrams, everygrams, compute_similarity
|
|
41
46
|
from .PgsFile import word_list, batch_word_list
|
|
42
|
-
from .PgsFile import cs, cs1, sent_tokenize, word_tokenize
|
|
47
|
+
from .PgsFile import cs, cs1, sent_tokenize, word_tokenize, word_tokenize2
|
|
43
48
|
|
|
44
49
|
# 8. Maths
|
|
45
50
|
from .PgsFile import len_rows, check_empty_cells
|
|
46
51
|
from .PgsFile import format_float, decimal_to_percent, Percentage
|
|
47
52
|
from .PgsFile import get_text_length_kb, extract_numbers
|
|
48
53
|
|
|
54
|
+
# 9. Visualization
|
|
55
|
+
from .PgsFile import replace_white_with_transparency
|
|
56
|
+
from .PgsFile import simhei_default_font_path_MacOS_Windows
|
|
57
|
+
from .PgsFile import get_font_path
|
|
58
|
+
|
|
49
59
|
name = "PgsFile"
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
@@ -0,0 +1,134 @@
|
|
|
1
|
+
model_1.0.bin ['samples: 30', 'precision: 0.7666666666666667', 'recall: 0.696969696969697', 'F1: 0.7301587301587302']
|
|
2
|
+
model_1.2.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.7575757575757576', 'F1: 0.7936507936507938']
|
|
3
|
+
model_1.4.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.7575757575757576', 'F1: 0.7936507936507938']
|
|
4
|
+
model_1.5.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.7575757575757576', 'F1: 0.7936507936507938']
|
|
5
|
+
model_1.6.bin ['samples: 30', 'precision: 0.9', 'recall: 0.8181818181818182', 'F1: 0.8571428571428572']
|
|
6
|
+
model_1.7.bin ['samples: 30', 'precision: 0.8666666666666667', 'recall: 0.7878787878787878', 'F1: 0.8253968253968254']
|
|
7
|
+
model_1.8.bin ['samples: 30', 'precision: 0.8', 'recall: 0.7272727272727273', 'F1: 0.761904761904762']
|
|
8
|
+
model_1.9.bin ['samples: 30', 'precision: 0.8', 'recall: 0.7272727272727273', 'F1: 0.761904761904762']
|
|
9
|
+
model_2.0.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.7575757575757576', 'F1: 0.7936507936507938']
|
|
10
|
+
model_2.1.bin ['samples: 30', 'precision: 0.8666666666666667', 'recall: 0.7878787878787878', 'F1: 0.8253968253968254']
|
|
11
|
+
|
|
12
|
+
|
|
13
|
+
model_1.0.bin ['samples: 292', 'precision: 0.5787671232876712', 'recall: 0.48011363636363635', 'F1: 0.5248447204968945']
|
|
14
|
+
model_1.2.bin ['samples: 292', 'precision: 0.636986301369863', 'recall: 0.5284090909090909', 'F1: 0.577639751552795']
|
|
15
|
+
model_1.4.bin ['samples: 292', 'precision: 0.7191780821917808', 'recall: 0.5965909090909091', 'F1: 0.6521739130434782']
|
|
16
|
+
model_1.5.bin ['samples: 292', 'precision: 0.6815068493150684', 'recall: 0.5653409090909091', 'F1: 0.6180124223602484']
|
|
17
|
+
model_1.6.bin ['samples: 292', 'precision: 0.726027397260274', 'recall: 0.6022727272727273', 'F1: 0.6583850931677019']
|
|
18
|
+
model_1.7.bin ['samples: 292', 'precision: 0.7363013698630136', 'recall: 0.6107954545454546', 'F1: 0.6677018633540373']
|
|
19
|
+
model_1.8.bin ['samples: 292', 'precision: 0.7431506849315068', 'recall: 0.6164772727272727', 'F1: 0.6739130434782609']
|
|
20
|
+
model_1.9.bin ['samples: 292', 'precision: 0.7773972602739726', 'recall: 0.6448863636363636', 'F1: 0.7049689440993789']
|
|
21
|
+
model_2.0.bin ['samples: 292', 'precision: 0.7636986301369864', 'recall: 0.6335227272727273', 'F1: 0.6925465838509317']
|
|
22
|
+
model_2.1.bin ['samples: 292', 'precision: 0.7671232876712328', 'recall: 0.6363636363636364', 'F1: 0.6956521739130435']
|
|
23
|
+
|
|
24
|
+
|
|
25
|
+
model_1.0.bin ['samples: 322', 'precision: 0.5962732919254659', 'recall: 0.4987012987012987', 'F1: 0.5431400282885432']
|
|
26
|
+
model_1.2.bin ['samples: 322', 'precision: 0.65527950310559', 'recall: 0.548051948051948', 'F1: 0.5968882602545968']
|
|
27
|
+
model_1.4.bin ['samples: 322', 'precision: 0.7267080745341615', 'recall: 0.6077922077922078', 'F1: 0.6619519094766619']
|
|
28
|
+
model_1.5.bin ['samples: 322', 'precision: 0.6956521739130435', 'recall: 0.5818181818181818', 'F1: 0.6336633663366337']
|
|
29
|
+
model_1.6.bin ['samples: 322', 'precision: 0.7422360248447205', 'recall: 0.6207792207792208', 'F1: 0.6760961810466761']
|
|
30
|
+
model_1.7.bin ['samples: 322', 'precision: 0.7484472049689441', 'recall: 0.625974025974026', 'F1: 0.6817538896746819']
|
|
31
|
+
model_1.8.bin ['samples: 322', 'precision: 0.7484472049689441', 'recall: 0.625974025974026', 'F1: 0.6817538896746819']
|
|
32
|
+
model_1.9.bin ['samples: 322', 'precision: 0.7795031055900621', 'recall: 0.6519480519480519', 'F1: 0.71004243281471']
|
|
33
|
+
model_2.0.bin ['samples: 322', 'precision: 0.7701863354037267', 'recall: 0.6441558441558441', 'F1: 0.7015558698727016']
|
|
34
|
+
model_2.1.bin ['samples: 322', 'precision: 0.7763975155279503', 'recall: 0.6493506493506493', 'F1: 0.7072135785007072']
|
|
35
|
+
|
|
36
|
+
|
|
37
|
+
=========================================================非重复验证集==================================================
|
|
38
|
+
|
|
39
|
+
model_1.2.bin ['samples: 303', 'precision: 0.6435643564356436', 'recall: 0.5342465753424658', 'F1: 0.5838323353293414']
|
|
40
|
+
model_1.4.bin ['samples: 303', 'precision: 0.7161716171617162', 'recall: 0.5945205479452055', 'F1: 0.6497005988023953']
|
|
41
|
+
model_1.5.bin ['samples: 303', 'precision: 0.6864686468646864', 'recall: 0.5698630136986301', 'F1: 0.6227544910179641']
|
|
42
|
+
model_1.6.bin ['samples: 303', 'precision: 0.7326732673267327', 'recall: 0.6082191780821918', 'F1: 0.6646706586826348']
|
|
43
|
+
model_1.7.bin ['samples: 303', 'precision: 0.7425742574257426', 'recall: 0.6164383561643836', 'F1: 0.6736526946107784']
|
|
44
|
+
model_1.8.bin ['samples: 303', 'precision: 0.7392739273927392', 'recall: 0.6136986301369863', 'F1: 0.6706586826347306']
|
|
45
|
+
model_1.9.bin ['samples: 303', 'precision: 0.7722772277227723', 'recall: 0.6410958904109589', 'F1: 0.7005988023952096']
|
|
46
|
+
model_2.0.bin ['samples: 303', 'precision: 0.759075907590759', 'recall: 0.6301369863013698', 'F1: 0.688622754491018']
|
|
47
|
+
model_2.1.bin ['samples: 303', 'precision: 0.7623762376237624', 'recall: 0.6328767123287671', 'F1: 0.6916167664670658']
|
|
48
|
+
model_2.2.bin ['samples: 303', 'precision: 0.7458745874587459', 'recall: 0.6191780821917808', 'F1: 0.6766467065868264']
|
|
49
|
+
|
|
50
|
+
=================================================非重复验证集+5分标签==================================================
|
|
51
|
+
|
|
52
|
+
model_1.2.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.78125', 'F1: 0.8064516129032259']
|
|
53
|
+
model_1.4.bin ['samples: 30', 'precision: 0.8666666666666667', 'recall: 0.8125', 'F1: 0.8387096774193549']
|
|
54
|
+
model_1.5.bin ['samples: 30', 'precision: 0.9', 'recall: 0.84375', 'F1: 0.870967741935484']
|
|
55
|
+
model_1.6.bin ['samples: 30', 'precision: 0.9', 'recall: 0.84375', 'F1: 0.870967741935484']
|
|
56
|
+
model_1.7.bin ['samples: 30', 'precision: 0.8', 'recall: 0.75', 'F1: 0.7741935483870969']
|
|
57
|
+
model_1.8.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.78125', 'F1: 0.8064516129032259']
|
|
58
|
+
model_1.9.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.78125', 'F1: 0.8064516129032259']
|
|
59
|
+
model_2.0.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.78125', 'F1: 0.8064516129032259']
|
|
60
|
+
model_2.1.bin ['samples: 30', 'precision: 0.9', 'recall: 0.84375', 'F1: 0.870967741935484']
|
|
61
|
+
model_2.2.bin ['samples: 30', 'precision: 0.9', 'recall: 0.84375', 'F1: 0.870967741935484']
|
|
62
|
+
|
|
63
|
+
|
|
64
|
+
model_1.2.bin ['samples: 302', 'precision: 0.6721854304635762', 'recall: 0.6444444444444445', 'F1: 0.6580226904376014']
|
|
65
|
+
model_1.4.bin ['samples: 302', 'precision: 0.7019867549668874', 'recall: 0.6730158730158731', 'F1: 0.6871961102106969']
|
|
66
|
+
model_1.5.bin ['samples: 302', 'precision: 0.7185430463576159', 'recall: 0.6888888888888889', 'F1: 0.7034035656401946']
|
|
67
|
+
model_1.6.bin ['samples: 302', 'precision: 0.7086092715231788', 'recall: 0.6793650793650794', 'F1: 0.6936790923824959']
|
|
68
|
+
model_1.7.bin ['samples: 302', 'precision: 0.7052980132450332', 'recall: 0.6761904761904762', 'F1: 0.6904376012965965']
|
|
69
|
+
model_1.8.bin ['samples: 302', 'precision: 0.7317880794701986', 'recall: 0.7015873015873015', 'F1: 0.7163695299837927']
|
|
70
|
+
model_1.9.bin ['samples: 302', 'precision: 0.7317880794701986', 'recall: 0.7015873015873015', 'F1: 0.7163695299837927']
|
|
71
|
+
model_2.0.bin ['samples: 302', 'precision: 0.7417218543046358', 'recall: 0.7111111111111111', 'F1: 0.7260940032414911']
|
|
72
|
+
model_2.1.bin ['samples: 302', 'precision: 0.7516556291390728', 'recall: 0.7206349206349206', 'F1: 0.7358184764991895']
|
|
73
|
+
model_2.2.bin ['samples: 302', 'precision: 0.7582781456953642', 'recall: 0.726984126984127', 'F1: 0.7423014586709886']
|
|
74
|
+
|
|
75
|
+
|
|
76
|
+
model_1.2.bin ['samples: 303', 'precision: 0.6732673267326733', 'recall: 0.6455696202531646', 'F1: 0.6591276252019386']
|
|
77
|
+
model_1.4.bin ['samples: 303', 'precision: 0.7029702970297029', 'recall: 0.6740506329113924', 'F1: 0.6882067851373183']
|
|
78
|
+
model_1.5.bin ['samples: 303', 'precision: 0.7194719471947195', 'recall: 0.689873417721519', 'F1: 0.7043618739903069']
|
|
79
|
+
model_1.6.bin ['samples: 303', 'precision: 0.7095709570957096', 'recall: 0.680379746835443', 'F1: 0.6946688206785137']
|
|
80
|
+
model_1.7.bin ['samples: 303', 'precision: 0.7062706270627063', 'recall: 0.6772151898734177', 'F1: 0.6914378029079159']
|
|
81
|
+
model_1.8.bin ['samples: 303', 'precision: 0.7326732673267327', 'recall: 0.7025316455696202', 'F1: 0.7172859450726979']
|
|
82
|
+
model_1.9.bin ['samples: 303', 'precision: 0.7326732673267327', 'recall: 0.7025316455696202', 'F1: 0.7172859450726979']
|
|
83
|
+
model_2.0.bin ['samples: 303', 'precision: 0.7425742574257426', 'recall: 0.7120253164556962', 'F1: 0.7269789983844911']
|
|
84
|
+
model_2.1.bin ['samples: 303', 'precision: 0.7524752475247525', 'recall: 0.7215189873417721', 'F1: 0.7366720516962842']
|
|
85
|
+
model_2.2.bin ['samples: 303', 'precision: 0.759075907590759', 'recall: 0.7278481012658228', 'F1: 0.7431340872374799']
|
|
86
|
+
|
|
87
|
+
|
|
88
|
+
model_1.2.bin ['samples: 425', 'precision: 0.6470588235294118', 'recall: 0.5456349206349206', 'F1: 0.5920344456404736']
|
|
89
|
+
model_1.2.bin ['samples: 425', 'precision: 0.691764705882353', 'recall: 0.6621621621621622', 'F1: 0.6766398158803222']
|
|
90
|
+
model_1.4.bin ['samples: 425', 'precision: 0.7129411764705882', 'recall: 0.6824324324324325', 'F1: 0.6973532796317606']
|
|
91
|
+
model_1.5.bin ['samples: 425', 'precision: 0.7294117647058823', 'recall: 0.6981981981981982', 'F1: 0.713463751438435']
|
|
92
|
+
model_1.6.bin ['samples: 425', 'precision: 0.7129411764705882', 'recall: 0.6824324324324325', 'F1: 0.6973532796317606']
|
|
93
|
+
model_1.7.bin ['samples: 425', 'precision: 0.7105882352941176', 'recall: 0.6801801801801802', 'F1: 0.6950517836593786']
|
|
94
|
+
model_1.8.bin ['samples: 425', 'precision: 0.7505882352941177', 'recall: 0.7184684684684685', 'F1: 0.7341772151898734']
|
|
95
|
+
model_1.9.bin ['samples: 425', 'precision: 0.7529411764705882', 'recall: 0.7207207207207207', 'F1: 0.7364787111622554']
|
|
96
|
+
model_2.0.bin ['samples: 425', 'precision: 0.7670588235294118', 'recall: 0.7342342342342343', 'F1: 0.7502876869965478']
|
|
97
|
+
model_2.1.bin ['samples: 425', 'precision: 0.7717647058823529', 'recall: 0.7387387387387387', 'F1: 0.7548906789413118']
|
|
98
|
+
model_2.2.bin ['samples: 425', 'precision: 0.7764705882352941', 'recall: 0.7432432432432432', 'F1: 0.7594936708860759']
|
|
99
|
+
|
|
100
|
+
model_1.2.bin ['samples: 447', 'precision: 0.6935123042505593', 'recall: 0.6623931623931624', 'F1: 0.6775956284153005']
|
|
101
|
+
model_1.4.bin ['samples: 447', 'precision: 0.7158836689038032', 'recall: 0.6837606837606838', 'F1: 0.6994535519125684']
|
|
102
|
+
model_1.5.bin ['samples: 447', 'precision: 0.7337807606263982', 'recall: 0.7008547008547008', 'F1: 0.7169398907103826']
|
|
103
|
+
model_1.6.bin ['samples: 447', 'precision: 0.7203579418344519', 'recall: 0.688034188034188', 'F1: 0.7038251366120218']
|
|
104
|
+
model_1.7.bin ['samples: 447', 'precision: 0.7158836689038032', 'recall: 0.6837606837606838', 'F1: 0.6994535519125684']
|
|
105
|
+
model_1.8.bin ['samples: 447', 'precision: 0.7539149888143176', 'recall: 0.7200854700854701', 'F1: 0.7366120218579234']
|
|
106
|
+
model_1.9.bin ['samples: 447', 'precision: 0.7539149888143176', 'recall: 0.7200854700854701', 'F1: 0.7366120218579234']
|
|
107
|
+
model_2.0.bin ['samples: 447', 'precision: 0.7695749440715883', 'recall: 0.7350427350427351', 'F1: 0.7519125683060108']
|
|
108
|
+
model_2.1.bin ['samples: 447', 'precision: 0.7718120805369127', 'recall: 0.7371794871794872', 'F1: 0.7540983606557377']
|
|
109
|
+
model_2.2.bin ['samples: 447', 'precision: 0.7785234899328859', 'recall: 0.7435897435897436', 'F1: 0.760655737704918']
|
|
110
|
+
|
|
111
|
+
model_1.2.bin
|
|
112
|
+
model_1.4.bin
|
|
113
|
+
model_1.5.bin
|
|
114
|
+
model_1.6.bin
|
|
115
|
+
model_1.7.bin
|
|
116
|
+
model_1.8.bin
|
|
117
|
+
model_1.9.bin
|
|
118
|
+
model_2.0.bin
|
|
119
|
+
model_2.1.bin
|
|
120
|
+
model_2.2.bin
|
|
121
|
+
|
|
122
|
+
model_1.2.bin
|
|
123
|
+
model_1.4.bin
|
|
124
|
+
model_1.5.bin
|
|
125
|
+
model_1.6.bin
|
|
126
|
+
model_1.7.bin
|
|
127
|
+
model_1.8.bin
|
|
128
|
+
model_1.9.bin
|
|
129
|
+
model_2.0.bin
|
|
130
|
+
model_2.1.bin
|
|
131
|
+
model_2.2.bin
|
|
132
|
+
|
|
133
|
+
|
|
134
|
+
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: PgsFile
|
|
3
|
+
Version: 0.2.5
|
|
4
|
+
Summary: This module streamlines Python package management, script execution, file handling, web scraping, multimedia downloads, data cleaning, and NLP tasks such as word tokenization and POS tagging. It also assists with generating word lists and plotting data, making these tasks more accessible and convenient for literary students. Whether you need to scrape data from websites, clean text, or analyze language, this module provides user-friendly tools to simplify your workflow.
|
|
5
|
+
Home-page: https://mp.weixin.qq.com/s/12-KVLfaPszoZkCxuRd-nQ?token=1589547443&lang=zh_CN
|
|
6
|
+
Author: Pan Guisheng
|
|
7
|
+
Author-email: 895284504@qq.com
|
|
8
|
+
License: Educational free
|
|
9
|
+
Classifier: Programming Language :: Python :: 3
|
|
10
|
+
Classifier: License :: Free For Educational Use
|
|
11
|
+
Classifier: Operating System :: OS Independent
|
|
12
|
+
Requires-Python: >=3.8
|
|
13
|
+
Description-Content-Type: text/markdown
|
|
14
|
+
License-File: LICENSE
|
|
15
|
+
Requires-Dist: chardet
|
|
16
|
+
Requires-Dist: pandas
|
|
17
|
+
Requires-Dist: python-docx
|
|
18
|
+
Requires-Dist: pip
|
|
19
|
+
Requires-Dist: requests
|
|
20
|
+
Requires-Dist: fake-useragent
|
|
21
|
+
Requires-Dist: lxml
|
|
22
|
+
Requires-Dist: pimht
|
|
23
|
+
Requires-Dist: pysbd
|
|
24
|
+
Requires-Dist: nlpir-python
|
|
25
|
+
Requires-Dist: pillow
|
|
26
|
+
|
|
27
|
+
Purpose: This module is designed to make complex tasks accessible and convenient, even for beginners. By providing a unified set of tools, it simplifies the workflow for data collection, processing, and analysis. Whether you're scraping data from the web, cleaning text, or performing NLP tasks, this module ensures you can focus on your research without getting bogged down by technical challenges.
|
|
28
|
+
|
|
29
|
+
Key Features:
|
|
30
|
+
1. Web Scraping: Easily scrape data from websites and download multimedia content.
|
|
31
|
+
2. Package Management: Install, uninstall, and manage Python packages with simple commands.
|
|
32
|
+
3. Data Retrieval: Extract data from various file formats like text, JSON, TSV, Excel, and HTML (both online and offline).
|
|
33
|
+
4. Data Storage: Write and append data to text files, Excel, JSON, and JSON lines.
|
|
34
|
+
5. File and Folder Processing: Manage file paths, create directories, move or copy files, and search for files with specific keywords.
|
|
35
|
+
6. Data Cleaning: Clean text, handle punctuation, remove stopwords, and prepare data for analysis.
|
|
36
|
+
7. NLP: Perform tokenization, generate n-grams, and create word lists for text analysis.
|
|
37
|
+
8. Math Operations: Format numbers, convert decimals to percentages, and validate data.
|
|
38
|
+
9. Visualization: Process images (e.g., make white pixels transparent) and manage fonts for rendering text.
|
|
39
|
+
|
|
40
|
+
Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University
|
|
41
|
+
E-mail: 895284504@qq.com
|
|
@@ -1,5 +1,46 @@
|
|
|
1
|
-
PgsFile/PgsFile.py,sha256=
|
|
2
|
-
PgsFile/__init__.py,sha256=
|
|
1
|
+
PgsFile/PgsFile.py,sha256=tOSOt3CJqkDp4t8_TwWUNMkqyXXrwTLHR5uNmTRAJsQ,104811
|
|
2
|
+
PgsFile/__init__.py,sha256=J2yHIlsR26lD7Si1ZVWJjYqOmy8eb5ygm0DRDxwWyhU,2880
|
|
3
|
+
PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/HK-Press releases of the Financial Secretary Office (2007-2019).tsv,sha256=IpLGQQY5cXbFWmUPFEdzEPz8CXuCdR2DdZOhBxA7FWw,2035252
|
|
4
|
+
PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Hong Kong bilingual court decisions (1997-2017).tsv,sha256=BMmPr5eYBIv06Wnfb8nOBrfIzpAl-LLoRk3R60dLxe0,5928126
|
|
5
|
+
PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/HongKong-Legislation.tsv,sha256=PJjiJIKV9aEzE0tAcqRNRCrunyWGiuD3sbkwkD9hoqo,4460018
|
|
6
|
+
PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Offering documents of financial products (updated as of October 2018).tsv,sha256=aoGw2XNahZ8K7B_PAi2Ca4l37xAKfo2xmTIMEGZGn8g,6361610
|
|
7
|
+
PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Speeches delivered by SFC Executives (2006-2019).tsv,sha256=qsViJ3UbvmBLgUSTcbFKF55N5uYgNuIVuUFPkgJ3IP0,1315100
|
|
8
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2006.txt,sha256=4tpLYd28r2JLSpFvqoFtZs3KQaIsQomKi-mEUva7XuU,9817
|
|
9
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2008.txt,sha256=mABYTdKc_y5ZZVnFx16WBDJwM2Z0BU1DgtvksiTeRzU,8827
|
|
10
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2009.txt,sha256=rE5Ev7j3uQKLTkXWyYo4Har0bvXhtYEEvZK4JMppv-o,9193
|
|
11
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2010.txt,sha256=ihahkqoWwD-UsDWXmJ65VceBz7iEaMu4zayzsdeBnmY,9627
|
|
12
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2011.txt,sha256=0pw2kXh1RgFZyjR-A6McVLoU_0hwihuqfVJZwmLckiM,8551
|
|
13
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2012.txt,sha256=UJiQ6WKpOcl__E4npBVAgBSaYdRMTK9LInjLcf0mjK8,6717
|
|
14
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2013.txt,sha256=eYbA72SA1wTTsZ6tO02mFmLbji4_Yzqh44kEH8_D8Qo,7435
|
|
15
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2014.txt,sha256=0oPiksRAQjqgLfEdYcfmMIsA_i6NoLtGc6sFLy1rEDw,5706
|
|
16
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2015.txt,sha256=QQyMYL0LiCHpSUroa7FT5k4dko9qxs21ctJhEClecD4,10239
|
|
17
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2016.txt,sha256=Ysnu-wtaWwjonho7JzE_xnK_ziSsYlIpuGtl_h2hF4Q,10255
|
|
18
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2017.txt,sha256=hGRoKBuBYzpLeptXpAXSphnMC2sonh2XX7GA-1sTI8Y,11526
|
|
19
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2018.txt,sha256=j_AAJpsxyVDpmUJtYSVpVQaOfk6U3nUGfFoAqNV5kJQ,12752
|
|
20
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2019.txt,sha256=5929xIfazLnhQ0j4j3rQbfHfxXrsihIvJ6wRyDPiK9k,14064
|
|
21
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2020.txt,sha256=LUBRJ0_eaHDiCu6uLeYAKfTNJQUg05vdR1n1HBLaKEI,15030
|
|
22
|
+
PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2021.txt,sha256=S1-9qwrs4M5G2YNo8vNC1t_8z3f_RNVjKaGxuQUtD70,11292
|
|
23
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100201_000150_en.txt,sha256=q0E5jn267NoLl9gunb9GzogIbE54F24qgaH-kGUon8w,3752
|
|
24
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100201_000150_zh.txt,sha256=pa-aahoIxS6dMkE8dA879u9ldHzL4EZCISsdwMAH68U,2878
|
|
25
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100213_000135_en.txt,sha256=emg69zBQ_ju9e4homDsMD7LK48-LizQ8pe3zkWlu-oc,4301
|
|
26
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100213_000135_zh.txt,sha256=RBMMP3Q4SDgX6hbP7lyMKv_ENYRQBxgZ5_HuQpteU4c,3394
|
|
27
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100215_000445_en.txt,sha256=D8HrrYGIy8SERtJx1RqNm2-txfd3ZhPRIAiw-s-QzQI,3641
|
|
28
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100215_000445_zh.txt,sha256=G_9xGBv-mRsSR1DFEqRDM1-9ybEXbyARmHoiHpHtm_o,2798
|
|
29
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000135_en.txt,sha256=vrm538nymNexLjkBeGx_9Y0unnpPgDfI3tN2FU9-L4E,3858
|
|
30
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000135_zh.txt,sha256=fCKOw2Z-EApI2Adl7gTPXS2Zys8xHxsn5cwdr9oJEFg,2880
|
|
31
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000205_en.txt,sha256=12dT0idBdBZF9yFKYn-P2KZ2TOagT7N_Nt-gLk8AUm0,3529
|
|
32
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000205_zh.txt,sha256=BwTnWaiMlSddL393ROSpeTmZ2ZhyfZiljA2JG8C3BvI,2768
|
|
33
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000548_en.txt,sha256=w9Hloxk9-JZOlNRUGEjVo2jatrx-AABnSiEHDeM_GJQ,3675
|
|
34
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000548_zh.txt,sha256=mEAjUaGa4WBnEwggXUKm_sLJiR0uUZqghw84fsxz0DY,2998
|
|
35
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100225_001011_en.txt,sha256=ZVXvK1DJpFlvbRyb2OQ4LRB8f579CW9H8rCkcaELdmM,3653
|
|
36
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100225_001011_zh.txt,sha256=tdipXKy-7U--_l01kyVFfsMmYiD_DKduuFt1dZlfpOY,3099
|
|
37
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000129_en.txt,sha256=8jdDSy9vMRJnaAbC3Au0EwROgQ1QbttEoxG0zQXXiwk,3474
|
|
38
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000129_zh.txt,sha256=eS-3OPr5YqgAJO9Xbw2ImBuj0NlEcpeWqoXO2Mu3mMo,3087
|
|
39
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000649_en.txt,sha256=meeVyJNSI1hMJye2se5fo43Mt409F0eoEsPltkdzQm0,4023
|
|
40
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000649_zh.txt,sha256=QzMJZObAXgVmIBZNUFxxV0cQuKawY6uqXOoqp-5AXHo,3126
|
|
41
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100301_000549_en.txt,sha256=jDwww5MYADdV-d-0c4b5rhUBx__egLc1utgxnKHXme8,3829
|
|
42
|
+
PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100301_000549_zh.txt,sha256=W1eed2ch9yX37oDNy0hhj1ZYc91joMRzfV3pJuCkRQE,3250
|
|
43
|
+
PgsFile/Corpora/Corpora/Parallel/Xi's Speech_CE_2021/Speech at a Ceremony Marking the Centenary of the CPC.txt,sha256=3suCjs2LF2_Endg2i_hc3GX1N8lTBORlqpMWEKsXFeM,54282
|
|
3
44
|
PgsFile/Corpora/Idioms/English_Idioms_8774.txt,sha256=qlsP0yI_XGECBRiPZuLkGZpdasc77sWSKexANu7v8_M,175905
|
|
4
45
|
PgsFile/Corpora/Monolingual/Chinese/People's Daily 20130605/Raw/00000000.txt,sha256=SLGGSMSb7Ff1RoBstsTW3yX2wNZpqEUchFNpcI-mrR4,1513
|
|
5
46
|
PgsFile/Corpora/Monolingual/Chinese/People's Daily 20130605/Raw/00000001.txt,sha256=imOa6UoCOIZoPXT4_HNHgCUJtd4FTIdk2FZNHNBgJyg,3372
|
|
@@ -2600,6 +2641,7 @@ PgsFile/Corpora/Stopwords/turkish.txt,sha256=uGUvjEm2GR8PuVY_JeHNxhD7cWlNlF7vc3V
|
|
|
2600
2641
|
PgsFile/Corpora/Stopwords/ukrainian.txt,sha256=fEzWLTwnWJriILkO-5jSfE2SpqY-GPf_kR4zid3MFUI,4131
|
|
2601
2642
|
PgsFile/Corpora/Stopwords/vietnamese.txt,sha256=88yRtVMaRSFqas1iGGa6kOGDCZTgtzRPmR3q9dHshdc,20485
|
|
2602
2643
|
PgsFile/Corpora/Terminology/Chinese_Thought.json,sha256=CdkuF2wLaDC5V3sRefcU1RZwXm4-wTZ-Qfk8r7gsu8I,2301866
|
|
2644
|
+
PgsFile/models/NLPIR.user,sha256=DykLJdr8_cVHrdCnDJES1O5dgmnYqfaSO1_dtAVKYJk,3356
|
|
2603
2645
|
PgsFile/models/czech.pickle,sha256=W6c9KTx9eVOVa88C82lexcHw1Sfyo8OAl_VZM5T6FpA,1265552
|
|
2604
2646
|
PgsFile/models/danish.pickle,sha256=6il2CgqRl_UspZ54rq_FpvVdBSWPr32xcJsrnrMh7yA,1264725
|
|
2605
2647
|
PgsFile/models/dutch.pickle,sha256=So4ms9aMRcOOWU0Z4tVndEe_3KpjbTsees_tDpJy1zw,742624
|
|
@@ -2611,6 +2653,8 @@ PgsFile/models/german.pickle,sha256=6rSX-ghUExMMj9D7E7kpEokwr-L2om6ocVyV33CI6Xw,
|
|
|
2611
2653
|
PgsFile/models/greek.pickle,sha256=IXUqZ2L61c_kb7XEX62ahUhKDo6Bxn5q9vuXPPwn1nw,1953106
|
|
2612
2654
|
PgsFile/models/italian.pickle,sha256=3LJxfXvl8m6GCpLgWs9psRI6X0UnzXommpq56eZoyAU,658331
|
|
2613
2655
|
PgsFile/models/malayalam.pickle,sha256=H4z1isvbf0cqxAr_wTZjvkLa-0fBUDDBGt4ERMng5T0,221207
|
|
2656
|
+
PgsFile/models/model_reviews2.2.bin,sha256=D6uL8KZIxD0rfWjH0kYEb7z_HE4aTJXpj82HzsCOpuk,1943196
|
|
2657
|
+
PgsFile/models/model_reviews_ReadMe.txt,sha256=Q9uLJwudMmsTKfd11l1tOcIP8lwsemIwnAVJG_3SYjU,11433
|
|
2614
2658
|
PgsFile/models/norwegian.pickle,sha256=5Kl_j5oDoDON10a8yJoK4PVK5DuDX6N9g-J54cp5T68,1259779
|
|
2615
2659
|
PgsFile/models/polish.pickle,sha256=FhJ7bRCTNCej6Q-yDpvlPh-zcf95pzDBAwc07YC5DJI,2042451
|
|
2616
2660
|
PgsFile/models/portuguese.pickle,sha256=uwG_fHmk6twheLvSCWZROaDks48tHET-8Jfek5VRQOA,649051
|
|
@@ -2619,8 +2663,14 @@ PgsFile/models/slovene.pickle,sha256=faxlAhKzeHs5mWwBvSCEEVST5vbsOQurYfdnUlsIuOo
|
|
|
2619
2663
|
PgsFile/models/spanish.pickle,sha256=Jx3GAnxKrgVvcqm_q1ZFz2fhmL9PlyiVhE5A9ZiczcM,597831
|
|
2620
2664
|
PgsFile/models/swedish.pickle,sha256=QNUOva1sqodxXy4wCxIX7JLELeIFpUPMSlaQO9LJrPo,1034496
|
|
2621
2665
|
PgsFile/models/turkish.pickle,sha256=065H12UB0CdpiAnRLnUpLJw5KRBIhUM0KAL5Xbl2XMw,1225013
|
|
2622
|
-
PgsFile
|
|
2623
|
-
PgsFile
|
|
2624
|
-
PgsFile
|
|
2625
|
-
PgsFile
|
|
2626
|
-
PgsFile
|
|
2666
|
+
PgsFile/models/fonts/DejaVuSans.ttf,sha256=faGVp0xVvvmI0NSPlQi9XYSUJcF3Dbpde_xs6e2EiVQ,757076
|
|
2667
|
+
PgsFile/models/fonts/书体坊赵九江钢笔行书体.ttf,sha256=fTOv4FFMnYtN1zCZghJ6-P1pzznA5qqoujwpDFY63Ek,3140656
|
|
2668
|
+
PgsFile/models/fonts/全新硬笔楷书简.ttf,sha256=mPemGYMpgQxvFL1pFjjnyUMIprHzcoOaw8oeZQ4k1x0,2397296
|
|
2669
|
+
PgsFile/models/fonts/全新硬笔行书简.ttf,sha256=bUtbl71eK_ellp1z0tCmmR_P-JhqVFIpzeuRlrEBo9g,2611516
|
|
2670
|
+
PgsFile/models/fonts/博洋行书3500.TTF,sha256=VrgeHr8cgOL6JD05QyuD9ZSyw4J2aIVxKxW8zSajq6Q,4410732
|
|
2671
|
+
PgsFile/models/fonts/陆柬之行书字体.ttf,sha256=Zpd4Z7E9w-Qy74yklXHk4vM7HOtHuQgllvygxZZ1Hvs,1247288
|
|
2672
|
+
PgsFile-0.2.5.dist-info/LICENSE,sha256=cE5c-QToSkG1KTUsU8drQXz1vG0EbJWuU4ybHTRb5SE,1138
|
|
2673
|
+
PgsFile-0.2.5.dist-info/METADATA,sha256=v1GYkJVW4R4MqIl9DYkg0zjNgH-oU5qoKH-S5-qubok,2711
|
|
2674
|
+
PgsFile-0.2.5.dist-info/WHEEL,sha256=eOLhNAGa2EW3wWl_TU484h7q1UNgy0JXjjoqKoxAAQc,92
|
|
2675
|
+
PgsFile-0.2.5.dist-info/top_level.txt,sha256=028hCfwhF3UpfD6X0rwtWpXI1RKSTeZ1ALwagWaSmX8,8
|
|
2676
|
+
PgsFile-0.2.5.dist-info/RECORD,,
|
PgsFile-0.2.3.dist-info/METADATA
DELETED
|
@@ -1,79 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.1
|
|
2
|
-
Name: PgsFile
|
|
3
|
-
Version: 0.2.3
|
|
4
|
-
Summary: This module aims to simplify Python package management, script execution, file handling, web scraping, multimedia download, data cleaning, NLP tasks like Chinese word tokenization and POS tagging, and word list generation for literary students, making it more accessible and convenient to use.
|
|
5
|
-
Home-page: https://mp.weixin.qq.com/s/12-KVLfaPszoZkCxuRd-nQ?token=1589547443&lang=zh_CN
|
|
6
|
-
Author: Pan Guisheng
|
|
7
|
-
Author-email: 895284504@qq.com
|
|
8
|
-
License: Educational free
|
|
9
|
-
Classifier: Programming Language :: Python :: 3
|
|
10
|
-
Classifier: License :: Free For Educational Use
|
|
11
|
-
Classifier: Operating System :: OS Independent
|
|
12
|
-
Requires-Python: >=3.8
|
|
13
|
-
Description-Content-Type: text/markdown
|
|
14
|
-
License-File: LICENSE
|
|
15
|
-
Requires-Dist: chardet
|
|
16
|
-
Requires-Dist: pandas
|
|
17
|
-
Requires-Dist: python-docx
|
|
18
|
-
Requires-Dist: pip
|
|
19
|
-
Requires-Dist: requests
|
|
20
|
-
Requires-Dist: fake-useragent
|
|
21
|
-
Requires-Dist: lxml
|
|
22
|
-
Requires-Dist: pimht
|
|
23
|
-
Requires-Dist: pysbd
|
|
24
|
-
Requires-Dist: nlpir-python
|
|
25
|
-
|
|
26
|
-
Purpose: This module aims to assist Python beginners, particularly instructors and students of foreign languages and literature, by providing a convenient way to manage Python packages, run Python scripts, and perform operations on various file types such as txt, xlsx, json, tsv, html, mhtml, and docx. It also includes functionality for data scraping, cleaning and generating word lists.
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
Function 1: Enables efficient data retrieval and storage in files with a single line of code.
|
|
30
|
-
|
|
31
|
-
Function 2: Facilitates retrieval of all absolute file paths and file names in any folder (including sub-folders) with a single line of code using "FilePath" and "FileName" functions.
|
|
32
|
-
|
|
33
|
-
Function 3: Simplifies creation of word lists and frequency sorting from a file or batch of files using "word_list" and "batch_word_list" functions in PgsFile.
|
|
34
|
-
|
|
35
|
-
Function 4: Pgs-Corpora is a comprehensive language resource included in this library, featuring a monolingual corpus of native and translational Chinese and native and non-native English, as well as a bi-directional parallel corpus of Chinese and English texts covering financial, legal, political, academic, and sports news topics. Additionally, the library includes a collection of 8774 English idioms, stopwords for 28 languages, and a termbank of Chinese thought and culture.
|
|
36
|
-
|
|
37
|
-
Function 5: This library provides support for common text cleaning tasks, such as removing empty text, empty lines, and folders containing empty text. It also offers functions for converting full-width characters to half-width characters and vice versa, as well as standardizing the format of Chinese and English punctuation. These features can help improve the quality and consistency of text data used in natural language processing tasks.
|
|
38
|
-
|
|
39
|
-
Function 6: It also manages Python package installations and uninstallations, and allows running scripts and commands in Python interactive command lines instead of Windows command prompt.
|
|
40
|
-
|
|
41
|
-
Function 7: Download audiovisual files like videos, images, and audio using audiovisual_downloader, which is extremely useful and efficient. Additionally, scrape newspaper data with PGScraper, a highly efficient tool for this purpose.
|
|
42
|
-
|
|
43
|
-
Table 1: The directory and size of Pgs-Corpora
|
|
44
|
-
├── Idioms (1, 171.78 KB)
|
|
45
|
-
├── Monolingual (2197, 63.65 MB)
|
|
46
|
-
│ ├── Chinese (456, 15.27 MB)
|
|
47
|
-
│ │ ├── People's Daily 20130605 (396, 1.38 MB)
|
|
48
|
-
│ │ │ ├── Raw (132, 261.73 KB)
|
|
49
|
-
│ │ │ ├── Seg_only (132, 471.47 KB)
|
|
50
|
-
│ │ │ └── Tagged (132, 675.30 KB)
|
|
51
|
-
│ │ └── Translational Fictions (60, 13.89 MB)
|
|
52
|
-
│ └── English (1741, 48.38 MB)
|
|
53
|
-
│ ├── Native (65, 44.14 MB)
|
|
54
|
-
│ │ ├── A Short Collection of British Fiction (27, 33.90 MB)
|
|
55
|
-
│ │ └── Preschoolers- and Teenagers-oriented Texts in English (36, 10.24 MB)
|
|
56
|
-
│ ├── Non-native (1675, 3.63 MB)
|
|
57
|
-
│ │ └── Shanghai Daily (1675, 3.63 MB)
|
|
58
|
-
│ │ └── Business_2019 (1675, 3.63 MB)
|
|
59
|
-
│ │ ├── 2019-01-01 (1, 3.35 KB)
|
|
60
|
-
│ │ ├── 2019-01-02 (1, 3.65 KB)
|
|
61
|
-
│ │ ├── 2019-01-03 (7, 10.90 KB)
|
|
62
|
-
│ │ ├── 2019-01-04 (5, 9.63 KB)
|
|
63
|
-
│ │ └── 2019-01-07 (4, 9.50 KB)
|
|
64
|
-
│ │ └── ... (and 245 more directories)
|
|
65
|
-
│ └── Translational (1, 622.57 KB)
|
|
66
|
-
├── Parallel (371, 24.67 MB)
|
|
67
|
-
│ ├── HK Financial and Legal EC Parallel Corpora (5, 19.17 MB)
|
|
68
|
-
│ ├── New Year Address_CE_2006-2021 (15, 147.49 KB)
|
|
69
|
-
│ ├── Sports News_CE_2010 (20, 66.42 KB)
|
|
70
|
-
│ ├── TED_EC_2017-2020 (330, 5.24 MB)
|
|
71
|
-
│ └── Xi's Speech_CE_2021 (1, 53.01 KB)
|
|
72
|
-
├── Stopwords (28, 88.09 KB)
|
|
73
|
-
└── Terminology (1, 2.20 MB)
|
|
74
|
-
|
|
75
|
-
...
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University
|
|
79
|
-
E-mail: 895284504@qq.com
|
|
File without changes
|
|
File without changes
|
|
File without changes
|