PgsFile 0.2.3__py3-none-any.whl → 0.2.5__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of PgsFile might be problematic. Click here for more details.

Files changed (58) hide show
  1. PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/HK-Press releases of the Financial Secretary Office (2007-2019).tsv +7348 -0
  2. PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Hong Kong bilingual court decisions (1997-2017).tsv +20000 -0
  3. PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/HongKong-Legislation.tsv +20000 -0
  4. PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Offering documents of financial products (updated as of October 2018).tsv +20000 -0
  5. PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Speeches delivered by SFC Executives (2006-2019).tsv +4680 -0
  6. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2006.txt +46 -0
  7. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2008.txt +48 -0
  8. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2009.txt +42 -0
  9. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2010.txt +42 -0
  10. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2011.txt +38 -0
  11. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2012.txt +28 -0
  12. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2013.txt +42 -0
  13. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2014.txt +68 -0
  14. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2015.txt +106 -0
  15. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2016.txt +82 -0
  16. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2017.txt +90 -0
  17. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2018.txt +136 -0
  18. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2019.txt +112 -0
  19. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2020.txt +124 -0
  20. PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2021.txt +94 -0
  21. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100201_000150_en.txt +6 -0
  22. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100201_000150_zh.txt +6 -0
  23. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100213_000135_en.txt +17 -0
  24. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100213_000135_zh.txt +17 -0
  25. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100215_000445_en.txt +10 -0
  26. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100215_000445_zh.txt +10 -0
  27. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000135_en.txt +12 -0
  28. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000135_zh.txt +12 -0
  29. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000205_en.txt +5 -0
  30. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000205_zh.txt +5 -0
  31. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000548_en.txt +9 -0
  32. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000548_zh.txt +9 -0
  33. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100225_001011_en.txt +8 -0
  34. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100225_001011_zh.txt +8 -0
  35. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000129_en.txt +8 -0
  36. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000129_zh.txt +8 -0
  37. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000649_en.txt +13 -0
  38. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000649_zh.txt +13 -0
  39. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100301_000549_en.txt +8 -0
  40. PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100301_000549_zh.txt +8 -0
  41. PgsFile/Corpora/Corpora/Parallel/Xi's Speech_CE_2021/Speech at a Ceremony Marking the Centenary of the CPC.txt +144 -0
  42. PgsFile/PgsFile.py +516 -33
  43. PgsFile/__init__.py +13 -3
  44. PgsFile/models/NLPIR.user +0 -0
  45. PgsFile/models/fonts/DejaVuSans.ttf +0 -0
  46. PgsFile/models/fonts//321/204/342/225/243/320/266/321/204/342/225/234/320/243/321/205/320/255/320/232/321/210/342/225/241/342/225/241/321/204/342/225/243/320/255/321/206/342/226/222/320/257/321/211/320/242/320/262/321/207/320/274/320/244/321/210/320/261/320/234/321/204/342/225/243/320/266/321/204/342/225/234/320/243.ttf +0 -0
  47. PgsFile/models/fonts//321/205/320/225/320/270/321/206/320/246/342/226/221/321/207/320/261/320/274/321/207/320/274/320/244/321/206/320/265/342/225/226/321/204/342/225/243/320/266/321/207/320/276/320/220.ttf +0 -0
  48. PgsFile/models/fonts//321/205/320/225/320/270/321/206/320/246/342/226/221/321/207/320/261/320/274/321/207/320/274/320/244/321/210/320/261/320/234/321/204/342/225/243/320/266/321/207/320/276/320/220.ttf +0 -0
  49. PgsFile/models/fonts//321/205/320/235/320/252/321/206/342/224/244/320/233/321/210/320/261/320/234/321/204/342/225/243/320/2663500.TTF +0 -0
  50. PgsFile/models/fonts//321/211/320/251/320/226/321/206/320/257/320/274/321/204/342/225/243/320/233/321/210/320/261/320/234/321/204/342/225/243/320/266/321/205/320/275/320/247/321/204/342/225/234/320/243.ttf +0 -0
  51. PgsFile/models/model_reviews2.2.bin +0 -0
  52. PgsFile/models/model_reviews_ReadMe.txt +134 -0
  53. PgsFile-0.2.5.dist-info/METADATA +41 -0
  54. {PgsFile-0.2.3.dist-info → PgsFile-0.2.5.dist-info}/RECORD +57 -7
  55. PgsFile-0.2.3.dist-info/METADATA +0 -79
  56. {PgsFile-0.2.3.dist-info → PgsFile-0.2.5.dist-info}/LICENSE +0 -0
  57. {PgsFile-0.2.3.dist-info → PgsFile-0.2.5.dist-info}/WHEEL +0 -0
  58. {PgsFile-0.2.3.dist-info → PgsFile-0.2.5.dist-info}/top_level.txt +0 -0
PgsFile/__init__.py CHANGED
@@ -7,6 +7,7 @@ from .PgsFile import headers, encode_chinese_keyword_for_url
7
7
  from .PgsFile import install_package, uninstall_package
8
8
  from .PgsFile import run_script, run_command
9
9
  from .PgsFile import get_library_location
10
+ from .PgsFile import conda_mirror_commands
10
11
 
11
12
  # 3. Text data retrieval
12
13
  from .PgsFile import get_data_text, get_data_lines, get_json_lines, get_tsv_lines
@@ -16,14 +17,18 @@ from .PgsFile import get_data_table_url, get_data_table_html_string
16
17
 
17
18
  # 4. Text data storage
18
19
  from .PgsFile import write_to_txt, write_to_excel, write_to_json, write_to_json_lines, append_dict_to_json, save_dict_to_excel
20
+ from .PgsFile import write_to_excel_normal
19
21
 
20
22
  # 5. File/folder process
21
23
  from .PgsFile import FilePath, FileName, DirList
22
- from .PgsFile import get_subfolder_path
24
+ from .PgsFile import get_subfolder_path, get_full_path
23
25
  from .PgsFile import makedirec, makefile
24
26
  from .PgsFile import source_path, next_folder_names, get_directory_tree_with_meta, find_txt_files_with_keyword
25
27
  from .PgsFile import remove_empty_folders, remove_empty_txts, remove_empty_lines, remove_empty_last_line, move_file, copy_file
26
28
  from .PgsFile import concatenate_excel_files
29
+ from .PgsFile import set_permanent_environment_variable
30
+ from .PgsFile import delete_permanent_environment_variable
31
+ from .PgsFile import get_env_variable
27
32
 
28
33
  # 6. Data cleaning
29
34
  from .PgsFile import BigPunctuation, StopTags, Special, yhd
@@ -32,18 +37,23 @@ from .PgsFile import nltk_en_tags, nltk_tag_mapping, thulac_tags, ICTCLAS2008, L
32
37
  from .PgsFile import check_contain_chinese, check_contain_number
33
38
  from .PgsFile import replace_chinese_punctuation_with_english
34
39
  from .PgsFile import replace_english_punctuation_with_chinese
35
- from .PgsFile import clean_list, clean_text_with_abbreviations
40
+ from .PgsFile import clean_list, clean_text, clean_text_with_abbreviations, clean_line_with_abbreviations
36
41
  from .PgsFile import extract_chinese_punctuation, generate_password, sort_strings_with_embedded_numbers
37
42
 
38
43
  # 7. NLP (natural language processing)
39
44
  from .PgsFile import strQ2B_raw, strQ2B_words
40
45
  from .PgsFile import ngrams, bigrams, trigrams, everygrams, compute_similarity
41
46
  from .PgsFile import word_list, batch_word_list
42
- from .PgsFile import cs, cs1, sent_tokenize, word_tokenize
47
+ from .PgsFile import cs, cs1, sent_tokenize, word_tokenize, word_tokenize2
43
48
 
44
49
  # 8. Maths
45
50
  from .PgsFile import len_rows, check_empty_cells
46
51
  from .PgsFile import format_float, decimal_to_percent, Percentage
47
52
  from .PgsFile import get_text_length_kb, extract_numbers
48
53
 
54
+ # 9. Visualization
55
+ from .PgsFile import replace_white_with_transparency
56
+ from .PgsFile import simhei_default_font_path_MacOS_Windows
57
+ from .PgsFile import get_font_path
58
+
49
59
  name = "PgsFile"
Binary file
Binary file
Binary file
@@ -0,0 +1,134 @@
1
+ model_1.0.bin ['samples: 30', 'precision: 0.7666666666666667', 'recall: 0.696969696969697', 'F1: 0.7301587301587302']
2
+ model_1.2.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.7575757575757576', 'F1: 0.7936507936507938']
3
+ model_1.4.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.7575757575757576', 'F1: 0.7936507936507938']
4
+ model_1.5.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.7575757575757576', 'F1: 0.7936507936507938']
5
+ model_1.6.bin ['samples: 30', 'precision: 0.9', 'recall: 0.8181818181818182', 'F1: 0.8571428571428572']
6
+ model_1.7.bin ['samples: 30', 'precision: 0.8666666666666667', 'recall: 0.7878787878787878', 'F1: 0.8253968253968254']
7
+ model_1.8.bin ['samples: 30', 'precision: 0.8', 'recall: 0.7272727272727273', 'F1: 0.761904761904762']
8
+ model_1.9.bin ['samples: 30', 'precision: 0.8', 'recall: 0.7272727272727273', 'F1: 0.761904761904762']
9
+ model_2.0.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.7575757575757576', 'F1: 0.7936507936507938']
10
+ model_2.1.bin ['samples: 30', 'precision: 0.8666666666666667', 'recall: 0.7878787878787878', 'F1: 0.8253968253968254']
11
+
12
+
13
+ model_1.0.bin ['samples: 292', 'precision: 0.5787671232876712', 'recall: 0.48011363636363635', 'F1: 0.5248447204968945']
14
+ model_1.2.bin ['samples: 292', 'precision: 0.636986301369863', 'recall: 0.5284090909090909', 'F1: 0.577639751552795']
15
+ model_1.4.bin ['samples: 292', 'precision: 0.7191780821917808', 'recall: 0.5965909090909091', 'F1: 0.6521739130434782']
16
+ model_1.5.bin ['samples: 292', 'precision: 0.6815068493150684', 'recall: 0.5653409090909091', 'F1: 0.6180124223602484']
17
+ model_1.6.bin ['samples: 292', 'precision: 0.726027397260274', 'recall: 0.6022727272727273', 'F1: 0.6583850931677019']
18
+ model_1.7.bin ['samples: 292', 'precision: 0.7363013698630136', 'recall: 0.6107954545454546', 'F1: 0.6677018633540373']
19
+ model_1.8.bin ['samples: 292', 'precision: 0.7431506849315068', 'recall: 0.6164772727272727', 'F1: 0.6739130434782609']
20
+ model_1.9.bin ['samples: 292', 'precision: 0.7773972602739726', 'recall: 0.6448863636363636', 'F1: 0.7049689440993789']
21
+ model_2.0.bin ['samples: 292', 'precision: 0.7636986301369864', 'recall: 0.6335227272727273', 'F1: 0.6925465838509317']
22
+ model_2.1.bin ['samples: 292', 'precision: 0.7671232876712328', 'recall: 0.6363636363636364', 'F1: 0.6956521739130435']
23
+
24
+
25
+ model_1.0.bin ['samples: 322', 'precision: 0.5962732919254659', 'recall: 0.4987012987012987', 'F1: 0.5431400282885432']
26
+ model_1.2.bin ['samples: 322', 'precision: 0.65527950310559', 'recall: 0.548051948051948', 'F1: 0.5968882602545968']
27
+ model_1.4.bin ['samples: 322', 'precision: 0.7267080745341615', 'recall: 0.6077922077922078', 'F1: 0.6619519094766619']
28
+ model_1.5.bin ['samples: 322', 'precision: 0.6956521739130435', 'recall: 0.5818181818181818', 'F1: 0.6336633663366337']
29
+ model_1.6.bin ['samples: 322', 'precision: 0.7422360248447205', 'recall: 0.6207792207792208', 'F1: 0.6760961810466761']
30
+ model_1.7.bin ['samples: 322', 'precision: 0.7484472049689441', 'recall: 0.625974025974026', 'F1: 0.6817538896746819']
31
+ model_1.8.bin ['samples: 322', 'precision: 0.7484472049689441', 'recall: 0.625974025974026', 'F1: 0.6817538896746819']
32
+ model_1.9.bin ['samples: 322', 'precision: 0.7795031055900621', 'recall: 0.6519480519480519', 'F1: 0.71004243281471']
33
+ model_2.0.bin ['samples: 322', 'precision: 0.7701863354037267', 'recall: 0.6441558441558441', 'F1: 0.7015558698727016']
34
+ model_2.1.bin ['samples: 322', 'precision: 0.7763975155279503', 'recall: 0.6493506493506493', 'F1: 0.7072135785007072']
35
+
36
+
37
+ =========================================================非重复验证集==================================================
38
+
39
+ model_1.2.bin ['samples: 303', 'precision: 0.6435643564356436', 'recall: 0.5342465753424658', 'F1: 0.5838323353293414']
40
+ model_1.4.bin ['samples: 303', 'precision: 0.7161716171617162', 'recall: 0.5945205479452055', 'F1: 0.6497005988023953']
41
+ model_1.5.bin ['samples: 303', 'precision: 0.6864686468646864', 'recall: 0.5698630136986301', 'F1: 0.6227544910179641']
42
+ model_1.6.bin ['samples: 303', 'precision: 0.7326732673267327', 'recall: 0.6082191780821918', 'F1: 0.6646706586826348']
43
+ model_1.7.bin ['samples: 303', 'precision: 0.7425742574257426', 'recall: 0.6164383561643836', 'F1: 0.6736526946107784']
44
+ model_1.8.bin ['samples: 303', 'precision: 0.7392739273927392', 'recall: 0.6136986301369863', 'F1: 0.6706586826347306']
45
+ model_1.9.bin ['samples: 303', 'precision: 0.7722772277227723', 'recall: 0.6410958904109589', 'F1: 0.7005988023952096']
46
+ model_2.0.bin ['samples: 303', 'precision: 0.759075907590759', 'recall: 0.6301369863013698', 'F1: 0.688622754491018']
47
+ model_2.1.bin ['samples: 303', 'precision: 0.7623762376237624', 'recall: 0.6328767123287671', 'F1: 0.6916167664670658']
48
+ model_2.2.bin ['samples: 303', 'precision: 0.7458745874587459', 'recall: 0.6191780821917808', 'F1: 0.6766467065868264']
49
+
50
+ =================================================非重复验证集+5分标签==================================================
51
+
52
+ model_1.2.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.78125', 'F1: 0.8064516129032259']
53
+ model_1.4.bin ['samples: 30', 'precision: 0.8666666666666667', 'recall: 0.8125', 'F1: 0.8387096774193549']
54
+ model_1.5.bin ['samples: 30', 'precision: 0.9', 'recall: 0.84375', 'F1: 0.870967741935484']
55
+ model_1.6.bin ['samples: 30', 'precision: 0.9', 'recall: 0.84375', 'F1: 0.870967741935484']
56
+ model_1.7.bin ['samples: 30', 'precision: 0.8', 'recall: 0.75', 'F1: 0.7741935483870969']
57
+ model_1.8.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.78125', 'F1: 0.8064516129032259']
58
+ model_1.9.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.78125', 'F1: 0.8064516129032259']
59
+ model_2.0.bin ['samples: 30', 'precision: 0.8333333333333334', 'recall: 0.78125', 'F1: 0.8064516129032259']
60
+ model_2.1.bin ['samples: 30', 'precision: 0.9', 'recall: 0.84375', 'F1: 0.870967741935484']
61
+ model_2.2.bin ['samples: 30', 'precision: 0.9', 'recall: 0.84375', 'F1: 0.870967741935484']
62
+
63
+
64
+ model_1.2.bin ['samples: 302', 'precision: 0.6721854304635762', 'recall: 0.6444444444444445', 'F1: 0.6580226904376014']
65
+ model_1.4.bin ['samples: 302', 'precision: 0.7019867549668874', 'recall: 0.6730158730158731', 'F1: 0.6871961102106969']
66
+ model_1.5.bin ['samples: 302', 'precision: 0.7185430463576159', 'recall: 0.6888888888888889', 'F1: 0.7034035656401946']
67
+ model_1.6.bin ['samples: 302', 'precision: 0.7086092715231788', 'recall: 0.6793650793650794', 'F1: 0.6936790923824959']
68
+ model_1.7.bin ['samples: 302', 'precision: 0.7052980132450332', 'recall: 0.6761904761904762', 'F1: 0.6904376012965965']
69
+ model_1.8.bin ['samples: 302', 'precision: 0.7317880794701986', 'recall: 0.7015873015873015', 'F1: 0.7163695299837927']
70
+ model_1.9.bin ['samples: 302', 'precision: 0.7317880794701986', 'recall: 0.7015873015873015', 'F1: 0.7163695299837927']
71
+ model_2.0.bin ['samples: 302', 'precision: 0.7417218543046358', 'recall: 0.7111111111111111', 'F1: 0.7260940032414911']
72
+ model_2.1.bin ['samples: 302', 'precision: 0.7516556291390728', 'recall: 0.7206349206349206', 'F1: 0.7358184764991895']
73
+ model_2.2.bin ['samples: 302', 'precision: 0.7582781456953642', 'recall: 0.726984126984127', 'F1: 0.7423014586709886']
74
+
75
+
76
+ model_1.2.bin ['samples: 303', 'precision: 0.6732673267326733', 'recall: 0.6455696202531646', 'F1: 0.6591276252019386']
77
+ model_1.4.bin ['samples: 303', 'precision: 0.7029702970297029', 'recall: 0.6740506329113924', 'F1: 0.6882067851373183']
78
+ model_1.5.bin ['samples: 303', 'precision: 0.7194719471947195', 'recall: 0.689873417721519', 'F1: 0.7043618739903069']
79
+ model_1.6.bin ['samples: 303', 'precision: 0.7095709570957096', 'recall: 0.680379746835443', 'F1: 0.6946688206785137']
80
+ model_1.7.bin ['samples: 303', 'precision: 0.7062706270627063', 'recall: 0.6772151898734177', 'F1: 0.6914378029079159']
81
+ model_1.8.bin ['samples: 303', 'precision: 0.7326732673267327', 'recall: 0.7025316455696202', 'F1: 0.7172859450726979']
82
+ model_1.9.bin ['samples: 303', 'precision: 0.7326732673267327', 'recall: 0.7025316455696202', 'F1: 0.7172859450726979']
83
+ model_2.0.bin ['samples: 303', 'precision: 0.7425742574257426', 'recall: 0.7120253164556962', 'F1: 0.7269789983844911']
84
+ model_2.1.bin ['samples: 303', 'precision: 0.7524752475247525', 'recall: 0.7215189873417721', 'F1: 0.7366720516962842']
85
+ model_2.2.bin ['samples: 303', 'precision: 0.759075907590759', 'recall: 0.7278481012658228', 'F1: 0.7431340872374799']
86
+
87
+
88
+ model_1.2.bin ['samples: 425', 'precision: 0.6470588235294118', 'recall: 0.5456349206349206', 'F1: 0.5920344456404736']
89
+ model_1.2.bin ['samples: 425', 'precision: 0.691764705882353', 'recall: 0.6621621621621622', 'F1: 0.6766398158803222']
90
+ model_1.4.bin ['samples: 425', 'precision: 0.7129411764705882', 'recall: 0.6824324324324325', 'F1: 0.6973532796317606']
91
+ model_1.5.bin ['samples: 425', 'precision: 0.7294117647058823', 'recall: 0.6981981981981982', 'F1: 0.713463751438435']
92
+ model_1.6.bin ['samples: 425', 'precision: 0.7129411764705882', 'recall: 0.6824324324324325', 'F1: 0.6973532796317606']
93
+ model_1.7.bin ['samples: 425', 'precision: 0.7105882352941176', 'recall: 0.6801801801801802', 'F1: 0.6950517836593786']
94
+ model_1.8.bin ['samples: 425', 'precision: 0.7505882352941177', 'recall: 0.7184684684684685', 'F1: 0.7341772151898734']
95
+ model_1.9.bin ['samples: 425', 'precision: 0.7529411764705882', 'recall: 0.7207207207207207', 'F1: 0.7364787111622554']
96
+ model_2.0.bin ['samples: 425', 'precision: 0.7670588235294118', 'recall: 0.7342342342342343', 'F1: 0.7502876869965478']
97
+ model_2.1.bin ['samples: 425', 'precision: 0.7717647058823529', 'recall: 0.7387387387387387', 'F1: 0.7548906789413118']
98
+ model_2.2.bin ['samples: 425', 'precision: 0.7764705882352941', 'recall: 0.7432432432432432', 'F1: 0.7594936708860759']
99
+
100
+ model_1.2.bin ['samples: 447', 'precision: 0.6935123042505593', 'recall: 0.6623931623931624', 'F1: 0.6775956284153005']
101
+ model_1.4.bin ['samples: 447', 'precision: 0.7158836689038032', 'recall: 0.6837606837606838', 'F1: 0.6994535519125684']
102
+ model_1.5.bin ['samples: 447', 'precision: 0.7337807606263982', 'recall: 0.7008547008547008', 'F1: 0.7169398907103826']
103
+ model_1.6.bin ['samples: 447', 'precision: 0.7203579418344519', 'recall: 0.688034188034188', 'F1: 0.7038251366120218']
104
+ model_1.7.bin ['samples: 447', 'precision: 0.7158836689038032', 'recall: 0.6837606837606838', 'F1: 0.6994535519125684']
105
+ model_1.8.bin ['samples: 447', 'precision: 0.7539149888143176', 'recall: 0.7200854700854701', 'F1: 0.7366120218579234']
106
+ model_1.9.bin ['samples: 447', 'precision: 0.7539149888143176', 'recall: 0.7200854700854701', 'F1: 0.7366120218579234']
107
+ model_2.0.bin ['samples: 447', 'precision: 0.7695749440715883', 'recall: 0.7350427350427351', 'F1: 0.7519125683060108']
108
+ model_2.1.bin ['samples: 447', 'precision: 0.7718120805369127', 'recall: 0.7371794871794872', 'F1: 0.7540983606557377']
109
+ model_2.2.bin ['samples: 447', 'precision: 0.7785234899328859', 'recall: 0.7435897435897436', 'F1: 0.760655737704918']
110
+
111
+ model_1.2.bin
112
+ model_1.4.bin
113
+ model_1.5.bin
114
+ model_1.6.bin
115
+ model_1.7.bin
116
+ model_1.8.bin
117
+ model_1.9.bin
118
+ model_2.0.bin
119
+ model_2.1.bin
120
+ model_2.2.bin
121
+
122
+ model_1.2.bin
123
+ model_1.4.bin
124
+ model_1.5.bin
125
+ model_1.6.bin
126
+ model_1.7.bin
127
+ model_1.8.bin
128
+ model_1.9.bin
129
+ model_2.0.bin
130
+ model_2.1.bin
131
+ model_2.2.bin
132
+
133
+
134
+
@@ -0,0 +1,41 @@
1
+ Metadata-Version: 2.1
2
+ Name: PgsFile
3
+ Version: 0.2.5
4
+ Summary: This module streamlines Python package management, script execution, file handling, web scraping, multimedia downloads, data cleaning, and NLP tasks such as word tokenization and POS tagging. It also assists with generating word lists and plotting data, making these tasks more accessible and convenient for literary students. Whether you need to scrape data from websites, clean text, or analyze language, this module provides user-friendly tools to simplify your workflow.
5
+ Home-page: https://mp.weixin.qq.com/s/12-KVLfaPszoZkCxuRd-nQ?token=1589547443&lang=zh_CN
6
+ Author: Pan Guisheng
7
+ Author-email: 895284504@qq.com
8
+ License: Educational free
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: License :: Free For Educational Use
11
+ Classifier: Operating System :: OS Independent
12
+ Requires-Python: >=3.8
13
+ Description-Content-Type: text/markdown
14
+ License-File: LICENSE
15
+ Requires-Dist: chardet
16
+ Requires-Dist: pandas
17
+ Requires-Dist: python-docx
18
+ Requires-Dist: pip
19
+ Requires-Dist: requests
20
+ Requires-Dist: fake-useragent
21
+ Requires-Dist: lxml
22
+ Requires-Dist: pimht
23
+ Requires-Dist: pysbd
24
+ Requires-Dist: nlpir-python
25
+ Requires-Dist: pillow
26
+
27
+ Purpose: This module is designed to make complex tasks accessible and convenient, even for beginners. By providing a unified set of tools, it simplifies the workflow for data collection, processing, and analysis. Whether you're scraping data from the web, cleaning text, or performing NLP tasks, this module ensures you can focus on your research without getting bogged down by technical challenges.
28
+
29
+ Key Features:
30
+ 1. Web Scraping: Easily scrape data from websites and download multimedia content.
31
+ 2. Package Management: Install, uninstall, and manage Python packages with simple commands.
32
+ 3. Data Retrieval: Extract data from various file formats like text, JSON, TSV, Excel, and HTML (both online and offline).
33
+ 4. Data Storage: Write and append data to text files, Excel, JSON, and JSON lines.
34
+ 5. File and Folder Processing: Manage file paths, create directories, move or copy files, and search for files with specific keywords.
35
+ 6. Data Cleaning: Clean text, handle punctuation, remove stopwords, and prepare data for analysis.
36
+ 7. NLP: Perform tokenization, generate n-grams, and create word lists for text analysis.
37
+ 8. Math Operations: Format numbers, convert decimals to percentages, and validate data.
38
+ 9. Visualization: Process images (e.g., make white pixels transparent) and manage fonts for rendering text.
39
+
40
+ Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University
41
+ E-mail: 895284504@qq.com
@@ -1,5 +1,46 @@
1
- PgsFile/PgsFile.py,sha256=MpXQK6MLMBh1JMAcBw5sRiRof--x4OyARcCsWwn7Z4A,85828
2
- PgsFile/__init__.py,sha256=E4VfPu1BxCBcZ5WXi5E6faPaNt_Shpvgh9LvBlg7eA0,2389
1
+ PgsFile/PgsFile.py,sha256=tOSOt3CJqkDp4t8_TwWUNMkqyXXrwTLHR5uNmTRAJsQ,104811
2
+ PgsFile/__init__.py,sha256=J2yHIlsR26lD7Si1ZVWJjYqOmy8eb5ygm0DRDxwWyhU,2880
3
+ PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/HK-Press releases of the Financial Secretary Office (2007-2019).tsv,sha256=IpLGQQY5cXbFWmUPFEdzEPz8CXuCdR2DdZOhBxA7FWw,2035252
4
+ PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Hong Kong bilingual court decisions (1997-2017).tsv,sha256=BMmPr5eYBIv06Wnfb8nOBrfIzpAl-LLoRk3R60dLxe0,5928126
5
+ PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/HongKong-Legislation.tsv,sha256=PJjiJIKV9aEzE0tAcqRNRCrunyWGiuD3sbkwkD9hoqo,4460018
6
+ PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Offering documents of financial products (updated as of October 2018).tsv,sha256=aoGw2XNahZ8K7B_PAi2Ca4l37xAKfo2xmTIMEGZGn8g,6361610
7
+ PgsFile/Corpora/Corpora/Parallel/HK Financial and Legal EC Parallel Corpora/Speeches delivered by SFC Executives (2006-2019).tsv,sha256=qsViJ3UbvmBLgUSTcbFKF55N5uYgNuIVuUFPkgJ3IP0,1315100
8
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2006.txt,sha256=4tpLYd28r2JLSpFvqoFtZs3KQaIsQomKi-mEUva7XuU,9817
9
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2008.txt,sha256=mABYTdKc_y5ZZVnFx16WBDJwM2Z0BU1DgtvksiTeRzU,8827
10
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2009.txt,sha256=rE5Ev7j3uQKLTkXWyYo4Har0bvXhtYEEvZK4JMppv-o,9193
11
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2010.txt,sha256=ihahkqoWwD-UsDWXmJ65VceBz7iEaMu4zayzsdeBnmY,9627
12
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2011.txt,sha256=0pw2kXh1RgFZyjR-A6McVLoU_0hwihuqfVJZwmLckiM,8551
13
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2012.txt,sha256=UJiQ6WKpOcl__E4npBVAgBSaYdRMTK9LInjLcf0mjK8,6717
14
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2013.txt,sha256=eYbA72SA1wTTsZ6tO02mFmLbji4_Yzqh44kEH8_D8Qo,7435
15
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2014.txt,sha256=0oPiksRAQjqgLfEdYcfmMIsA_i6NoLtGc6sFLy1rEDw,5706
16
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2015.txt,sha256=QQyMYL0LiCHpSUroa7FT5k4dko9qxs21ctJhEClecD4,10239
17
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2016.txt,sha256=Ysnu-wtaWwjonho7JzE_xnK_ziSsYlIpuGtl_h2hF4Q,10255
18
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2017.txt,sha256=hGRoKBuBYzpLeptXpAXSphnMC2sonh2XX7GA-1sTI8Y,11526
19
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2018.txt,sha256=j_AAJpsxyVDpmUJtYSVpVQaOfk6U3nUGfFoAqNV5kJQ,12752
20
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2019.txt,sha256=5929xIfazLnhQ0j4j3rQbfHfxXrsihIvJ6wRyDPiK9k,14064
21
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2020.txt,sha256=LUBRJ0_eaHDiCu6uLeYAKfTNJQUg05vdR1n1HBLaKEI,15030
22
+ PgsFile/Corpora/Corpora/Parallel/New Year Address_CE_2006-2021/2021.txt,sha256=S1-9qwrs4M5G2YNo8vNC1t_8z3f_RNVjKaGxuQUtD70,11292
23
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100201_000150_en.txt,sha256=q0E5jn267NoLl9gunb9GzogIbE54F24qgaH-kGUon8w,3752
24
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100201_000150_zh.txt,sha256=pa-aahoIxS6dMkE8dA879u9ldHzL4EZCISsdwMAH68U,2878
25
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100213_000135_en.txt,sha256=emg69zBQ_ju9e4homDsMD7LK48-LizQ8pe3zkWlu-oc,4301
26
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100213_000135_zh.txt,sha256=RBMMP3Q4SDgX6hbP7lyMKv_ENYRQBxgZ5_HuQpteU4c,3394
27
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100215_000445_en.txt,sha256=D8HrrYGIy8SERtJx1RqNm2-txfd3ZhPRIAiw-s-QzQI,3641
28
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100215_000445_zh.txt,sha256=G_9xGBv-mRsSR1DFEqRDM1-9ybEXbyARmHoiHpHtm_o,2798
29
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000135_en.txt,sha256=vrm538nymNexLjkBeGx_9Y0unnpPgDfI3tN2FU9-L4E,3858
30
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000135_zh.txt,sha256=fCKOw2Z-EApI2Adl7gTPXS2Zys8xHxsn5cwdr9oJEFg,2880
31
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000205_en.txt,sha256=12dT0idBdBZF9yFKYn-P2KZ2TOagT7N_Nt-gLk8AUm0,3529
32
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000205_zh.txt,sha256=BwTnWaiMlSddL393ROSpeTmZ2ZhyfZiljA2JG8C3BvI,2768
33
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000548_en.txt,sha256=w9Hloxk9-JZOlNRUGEjVo2jatrx-AABnSiEHDeM_GJQ,3675
34
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100222_000548_zh.txt,sha256=mEAjUaGa4WBnEwggXUKm_sLJiR0uUZqghw84fsxz0DY,2998
35
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100225_001011_en.txt,sha256=ZVXvK1DJpFlvbRyb2OQ4LRB8f579CW9H8rCkcaELdmM,3653
36
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100225_001011_zh.txt,sha256=tdipXKy-7U--_l01kyVFfsMmYiD_DKduuFt1dZlfpOY,3099
37
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000129_en.txt,sha256=8jdDSy9vMRJnaAbC3Au0EwROgQ1QbttEoxG0zQXXiwk,3474
38
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000129_zh.txt,sha256=eS-3OPr5YqgAJO9Xbw2ImBuj0NlEcpeWqoXO2Mu3mMo,3087
39
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000649_en.txt,sha256=meeVyJNSI1hMJye2se5fo43Mt409F0eoEsPltkdzQm0,4023
40
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100227_000649_zh.txt,sha256=QzMJZObAXgVmIBZNUFxxV0cQuKawY6uqXOoqp-5AXHo,3126
41
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100301_000549_en.txt,sha256=jDwww5MYADdV-d-0c4b5rhUBx__egLc1utgxnKHXme8,3829
42
+ PgsFile/Corpora/Corpora/Parallel/Sports News_CE_2010/20100301_000549_zh.txt,sha256=W1eed2ch9yX37oDNy0hhj1ZYc91joMRzfV3pJuCkRQE,3250
43
+ PgsFile/Corpora/Corpora/Parallel/Xi's Speech_CE_2021/Speech at a Ceremony Marking the Centenary of the CPC.txt,sha256=3suCjs2LF2_Endg2i_hc3GX1N8lTBORlqpMWEKsXFeM,54282
3
44
  PgsFile/Corpora/Idioms/English_Idioms_8774.txt,sha256=qlsP0yI_XGECBRiPZuLkGZpdasc77sWSKexANu7v8_M,175905
4
45
  PgsFile/Corpora/Monolingual/Chinese/People's Daily 20130605/Raw/00000000.txt,sha256=SLGGSMSb7Ff1RoBstsTW3yX2wNZpqEUchFNpcI-mrR4,1513
5
46
  PgsFile/Corpora/Monolingual/Chinese/People's Daily 20130605/Raw/00000001.txt,sha256=imOa6UoCOIZoPXT4_HNHgCUJtd4FTIdk2FZNHNBgJyg,3372
@@ -2600,6 +2641,7 @@ PgsFile/Corpora/Stopwords/turkish.txt,sha256=uGUvjEm2GR8PuVY_JeHNxhD7cWlNlF7vc3V
2600
2641
  PgsFile/Corpora/Stopwords/ukrainian.txt,sha256=fEzWLTwnWJriILkO-5jSfE2SpqY-GPf_kR4zid3MFUI,4131
2601
2642
  PgsFile/Corpora/Stopwords/vietnamese.txt,sha256=88yRtVMaRSFqas1iGGa6kOGDCZTgtzRPmR3q9dHshdc,20485
2602
2643
  PgsFile/Corpora/Terminology/Chinese_Thought.json,sha256=CdkuF2wLaDC5V3sRefcU1RZwXm4-wTZ-Qfk8r7gsu8I,2301866
2644
+ PgsFile/models/NLPIR.user,sha256=DykLJdr8_cVHrdCnDJES1O5dgmnYqfaSO1_dtAVKYJk,3356
2603
2645
  PgsFile/models/czech.pickle,sha256=W6c9KTx9eVOVa88C82lexcHw1Sfyo8OAl_VZM5T6FpA,1265552
2604
2646
  PgsFile/models/danish.pickle,sha256=6il2CgqRl_UspZ54rq_FpvVdBSWPr32xcJsrnrMh7yA,1264725
2605
2647
  PgsFile/models/dutch.pickle,sha256=So4ms9aMRcOOWU0Z4tVndEe_3KpjbTsees_tDpJy1zw,742624
@@ -2611,6 +2653,8 @@ PgsFile/models/german.pickle,sha256=6rSX-ghUExMMj9D7E7kpEokwr-L2om6ocVyV33CI6Xw,
2611
2653
  PgsFile/models/greek.pickle,sha256=IXUqZ2L61c_kb7XEX62ahUhKDo6Bxn5q9vuXPPwn1nw,1953106
2612
2654
  PgsFile/models/italian.pickle,sha256=3LJxfXvl8m6GCpLgWs9psRI6X0UnzXommpq56eZoyAU,658331
2613
2655
  PgsFile/models/malayalam.pickle,sha256=H4z1isvbf0cqxAr_wTZjvkLa-0fBUDDBGt4ERMng5T0,221207
2656
+ PgsFile/models/model_reviews2.2.bin,sha256=D6uL8KZIxD0rfWjH0kYEb7z_HE4aTJXpj82HzsCOpuk,1943196
2657
+ PgsFile/models/model_reviews_ReadMe.txt,sha256=Q9uLJwudMmsTKfd11l1tOcIP8lwsemIwnAVJG_3SYjU,11433
2614
2658
  PgsFile/models/norwegian.pickle,sha256=5Kl_j5oDoDON10a8yJoK4PVK5DuDX6N9g-J54cp5T68,1259779
2615
2659
  PgsFile/models/polish.pickle,sha256=FhJ7bRCTNCej6Q-yDpvlPh-zcf95pzDBAwc07YC5DJI,2042451
2616
2660
  PgsFile/models/portuguese.pickle,sha256=uwG_fHmk6twheLvSCWZROaDks48tHET-8Jfek5VRQOA,649051
@@ -2619,8 +2663,14 @@ PgsFile/models/slovene.pickle,sha256=faxlAhKzeHs5mWwBvSCEEVST5vbsOQurYfdnUlsIuOo
2619
2663
  PgsFile/models/spanish.pickle,sha256=Jx3GAnxKrgVvcqm_q1ZFz2fhmL9PlyiVhE5A9ZiczcM,597831
2620
2664
  PgsFile/models/swedish.pickle,sha256=QNUOva1sqodxXy4wCxIX7JLELeIFpUPMSlaQO9LJrPo,1034496
2621
2665
  PgsFile/models/turkish.pickle,sha256=065H12UB0CdpiAnRLnUpLJw5KRBIhUM0KAL5Xbl2XMw,1225013
2622
- PgsFile-0.2.3.dist-info/LICENSE,sha256=cE5c-QToSkG1KTUsU8drQXz1vG0EbJWuU4ybHTRb5SE,1138
2623
- PgsFile-0.2.3.dist-info/METADATA,sha256=a9KMN6LpC2raZYhWwrFhWCXKl7nWneiXT7KtvA74ruY,5070
2624
- PgsFile-0.2.3.dist-info/WHEEL,sha256=eOLhNAGa2EW3wWl_TU484h7q1UNgy0JXjjoqKoxAAQc,92
2625
- PgsFile-0.2.3.dist-info/top_level.txt,sha256=028hCfwhF3UpfD6X0rwtWpXI1RKSTeZ1ALwagWaSmX8,8
2626
- PgsFile-0.2.3.dist-info/RECORD,,
2666
+ PgsFile/models/fonts/DejaVuSans.ttf,sha256=faGVp0xVvvmI0NSPlQi9XYSUJcF3Dbpde_xs6e2EiVQ,757076
2667
+ PgsFile/models/fonts/书体坊赵九江钢笔行书体.ttf,sha256=fTOv4FFMnYtN1zCZghJ6-P1pzznA5qqoujwpDFY63Ek,3140656
2668
+ PgsFile/models/fonts/全新硬笔楷书简.ttf,sha256=mPemGYMpgQxvFL1pFjjnyUMIprHzcoOaw8oeZQ4k1x0,2397296
2669
+ PgsFile/models/fonts/全新硬笔行书简.ttf,sha256=bUtbl71eK_ellp1z0tCmmR_P-JhqVFIpzeuRlrEBo9g,2611516
2670
+ PgsFile/models/fonts/博洋行书3500.TTF,sha256=VrgeHr8cgOL6JD05QyuD9ZSyw4J2aIVxKxW8zSajq6Q,4410732
2671
+ PgsFile/models/fonts/陆柬之行书字体.ttf,sha256=Zpd4Z7E9w-Qy74yklXHk4vM7HOtHuQgllvygxZZ1Hvs,1247288
2672
+ PgsFile-0.2.5.dist-info/LICENSE,sha256=cE5c-QToSkG1KTUsU8drQXz1vG0EbJWuU4ybHTRb5SE,1138
2673
+ PgsFile-0.2.5.dist-info/METADATA,sha256=v1GYkJVW4R4MqIl9DYkg0zjNgH-oU5qoKH-S5-qubok,2711
2674
+ PgsFile-0.2.5.dist-info/WHEEL,sha256=eOLhNAGa2EW3wWl_TU484h7q1UNgy0JXjjoqKoxAAQc,92
2675
+ PgsFile-0.2.5.dist-info/top_level.txt,sha256=028hCfwhF3UpfD6X0rwtWpXI1RKSTeZ1ALwagWaSmX8,8
2676
+ PgsFile-0.2.5.dist-info/RECORD,,
@@ -1,79 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: PgsFile
3
- Version: 0.2.3
4
- Summary: This module aims to simplify Python package management, script execution, file handling, web scraping, multimedia download, data cleaning, NLP tasks like Chinese word tokenization and POS tagging, and word list generation for literary students, making it more accessible and convenient to use.
5
- Home-page: https://mp.weixin.qq.com/s/12-KVLfaPszoZkCxuRd-nQ?token=1589547443&lang=zh_CN
6
- Author: Pan Guisheng
7
- Author-email: 895284504@qq.com
8
- License: Educational free
9
- Classifier: Programming Language :: Python :: 3
10
- Classifier: License :: Free For Educational Use
11
- Classifier: Operating System :: OS Independent
12
- Requires-Python: >=3.8
13
- Description-Content-Type: text/markdown
14
- License-File: LICENSE
15
- Requires-Dist: chardet
16
- Requires-Dist: pandas
17
- Requires-Dist: python-docx
18
- Requires-Dist: pip
19
- Requires-Dist: requests
20
- Requires-Dist: fake-useragent
21
- Requires-Dist: lxml
22
- Requires-Dist: pimht
23
- Requires-Dist: pysbd
24
- Requires-Dist: nlpir-python
25
-
26
- Purpose: This module aims to assist Python beginners, particularly instructors and students of foreign languages and literature, by providing a convenient way to manage Python packages, run Python scripts, and perform operations on various file types such as txt, xlsx, json, tsv, html, mhtml, and docx. It also includes functionality for data scraping, cleaning and generating word lists.
27
-
28
-
29
- Function 1: Enables efficient data retrieval and storage in files with a single line of code.
30
-
31
- Function 2: Facilitates retrieval of all absolute file paths and file names in any folder (including sub-folders) with a single line of code using "FilePath" and "FileName" functions.
32
-
33
- Function 3: Simplifies creation of word lists and frequency sorting from a file or batch of files using "word_list" and "batch_word_list" functions in PgsFile.
34
-
35
- Function 4: Pgs-Corpora is a comprehensive language resource included in this library, featuring a monolingual corpus of native and translational Chinese and native and non-native English, as well as a bi-directional parallel corpus of Chinese and English texts covering financial, legal, political, academic, and sports news topics. Additionally, the library includes a collection of 8774 English idioms, stopwords for 28 languages, and a termbank of Chinese thought and culture.
36
-
37
- Function 5: This library provides support for common text cleaning tasks, such as removing empty text, empty lines, and folders containing empty text. It also offers functions for converting full-width characters to half-width characters and vice versa, as well as standardizing the format of Chinese and English punctuation. These features can help improve the quality and consistency of text data used in natural language processing tasks.
38
-
39
- Function 6: It also manages Python package installations and uninstallations, and allows running scripts and commands in Python interactive command lines instead of Windows command prompt.
40
-
41
- Function 7: Download audiovisual files like videos, images, and audio using audiovisual_downloader, which is extremely useful and efficient. Additionally, scrape newspaper data with PGScraper, a highly efficient tool for this purpose.
42
-
43
- Table 1: The directory and size of Pgs-Corpora
44
- ├── Idioms (1, 171.78 KB)
45
- ├── Monolingual (2197, 63.65 MB)
46
- │ ├── Chinese (456, 15.27 MB)
47
- │ │ ├── People's Daily 20130605 (396, 1.38 MB)
48
- │ │ │ ├── Raw (132, 261.73 KB)
49
- │ │ │ ├── Seg_only (132, 471.47 KB)
50
- │ │ │ └── Tagged (132, 675.30 KB)
51
- │ │ └── Translational Fictions (60, 13.89 MB)
52
- │ └── English (1741, 48.38 MB)
53
- │ ├── Native (65, 44.14 MB)
54
- │ │ ├── A Short Collection of British Fiction (27, 33.90 MB)
55
- │ │ └── Preschoolers- and Teenagers-oriented Texts in English (36, 10.24 MB)
56
- │ ├── Non-native (1675, 3.63 MB)
57
- │ │ └── Shanghai Daily (1675, 3.63 MB)
58
- │ │ └── Business_2019 (1675, 3.63 MB)
59
- │ │ ├── 2019-01-01 (1, 3.35 KB)
60
- │ │ ├── 2019-01-02 (1, 3.65 KB)
61
- │ │ ├── 2019-01-03 (7, 10.90 KB)
62
- │ │ ├── 2019-01-04 (5, 9.63 KB)
63
- │ │ └── 2019-01-07 (4, 9.50 KB)
64
- │ │ └── ... (and 245 more directories)
65
- │ └── Translational (1, 622.57 KB)
66
- ├── Parallel (371, 24.67 MB)
67
- │ ├── HK Financial and Legal EC Parallel Corpora (5, 19.17 MB)
68
- │ ├── New Year Address_CE_2006-2021 (15, 147.49 KB)
69
- │ ├── Sports News_CE_2010 (20, 66.42 KB)
70
- │ ├── TED_EC_2017-2020 (330, 5.24 MB)
71
- │ └── Xi's Speech_CE_2021 (1, 53.01 KB)
72
- ├── Stopwords (28, 88.09 KB)
73
- └── Terminology (1, 2.20 MB)
74
-
75
- ...
76
-
77
-
78
- Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University
79
- E-mail: 895284504@qq.com