greenmining 0.1.11__py3-none-any.whl → 1.0.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- greenmining/__init__.py +42 -1
- greenmining/__version__.py +1 -1
- greenmining/analyzers/__init__.py +17 -0
- greenmining/analyzers/code_diff_analyzer.py +238 -0
- greenmining/analyzers/ml_feature_extractor.py +512 -0
- greenmining/analyzers/nlp_analyzer.py +365 -0
- greenmining/analyzers/qualitative_analyzer.py +460 -0
- greenmining/analyzers/statistical_analyzer.py +245 -0
- greenmining/analyzers/temporal_analyzer.py +434 -0
- greenmining/cli.py +126 -25
- greenmining/config.py +21 -0
- greenmining/controllers/repository_controller.py +58 -3
- greenmining/gsf_patterns.py +10 -5
- greenmining/models/aggregated_stats.py +3 -1
- greenmining/models/commit.py +3 -0
- greenmining/models/repository.py +3 -1
- greenmining/presenters/console_presenter.py +3 -1
- greenmining/services/commit_extractor.py +27 -1
- greenmining/services/data_aggregator.py +163 -5
- greenmining/services/data_analyzer.py +111 -8
- greenmining/services/github_fetcher.py +62 -5
- greenmining/services/reports.py +123 -2
- greenmining-1.0.1.dist-info/METADATA +699 -0
- greenmining-1.0.1.dist-info/RECORD +36 -0
- greenmining-0.1.11.dist-info/METADATA +0 -335
- greenmining-0.1.11.dist-info/RECORD +0 -29
- {greenmining-0.1.11.dist-info → greenmining-1.0.1.dist-info}/WHEEL +0 -0
- {greenmining-0.1.11.dist-info → greenmining-1.0.1.dist-info}/entry_points.txt +0 -0
- {greenmining-0.1.11.dist-info → greenmining-1.0.1.dist-info}/licenses/LICENSE +0 -0
- {greenmining-0.1.11.dist-info → greenmining-1.0.1.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,699 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: greenmining
|
|
3
|
+
Version: 1.0.1
|
|
4
|
+
Summary: Analyze GitHub repositories to identify green software engineering patterns and energy-efficient practices
|
|
5
|
+
Author-email: Adam Bouafia <a.bouafia@student.vu.nl>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/adam-bouafia/greenmining
|
|
8
|
+
Project-URL: Documentation, https://github.com/adam-bouafia/greenmining#readme
|
|
9
|
+
Project-URL: Repository, https://github.com/adam-bouafia/greenmining
|
|
10
|
+
Project-URL: Issues, https://github.com/adam-bouafia/greenmining/issues
|
|
11
|
+
Project-URL: Changelog, https://github.com/adam-bouafia/greenmining/blob/main/CHANGELOG.md
|
|
12
|
+
Keywords: green-software,gsf,sustainability,carbon-footprint,microservices,mining,repository-analysis,energy-efficiency,github-analysis
|
|
13
|
+
Classifier: Development Status :: 3 - Alpha
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: Intended Audience :: Science/Research
|
|
16
|
+
Classifier: Topic :: Software Development :: Quality Assurance
|
|
17
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
18
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
19
|
+
Classifier: Programming Language :: Python :: 3
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
23
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
24
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
25
|
+
Classifier: Operating System :: OS Independent
|
|
26
|
+
Classifier: Environment :: Console
|
|
27
|
+
Requires-Python: >=3.9
|
|
28
|
+
Description-Content-Type: text/markdown
|
|
29
|
+
License-File: LICENSE
|
|
30
|
+
Requires-Dist: PyGithub>=2.1.1
|
|
31
|
+
Requires-Dist: PyDriller>=2.5
|
|
32
|
+
Requires-Dist: pandas>=2.2.0
|
|
33
|
+
Requires-Dist: click>=8.1.7
|
|
34
|
+
Requires-Dist: colorama>=0.4.6
|
|
35
|
+
Requires-Dist: tabulate>=0.9.0
|
|
36
|
+
Requires-Dist: tqdm>=4.66.0
|
|
37
|
+
Requires-Dist: matplotlib>=3.8.0
|
|
38
|
+
Requires-Dist: plotly>=5.18.0
|
|
39
|
+
Requires-Dist: python-dotenv>=1.0.0
|
|
40
|
+
Provides-Extra: dev
|
|
41
|
+
Requires-Dist: pytest>=7.4.0; extra == "dev"
|
|
42
|
+
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
|
|
43
|
+
Requires-Dist: pytest-mock>=3.12.0; extra == "dev"
|
|
44
|
+
Requires-Dist: black>=23.12.0; extra == "dev"
|
|
45
|
+
Requires-Dist: ruff>=0.1.9; extra == "dev"
|
|
46
|
+
Requires-Dist: mypy>=1.8.0; extra == "dev"
|
|
47
|
+
Requires-Dist: build>=1.0.3; extra == "dev"
|
|
48
|
+
Requires-Dist: twine>=4.0.2; extra == "dev"
|
|
49
|
+
Provides-Extra: docs
|
|
50
|
+
Requires-Dist: sphinx>=7.2.0; extra == "docs"
|
|
51
|
+
Requires-Dist: sphinx-rtd-theme>=2.0.0; extra == "docs"
|
|
52
|
+
Requires-Dist: myst-parser>=2.0.0; extra == "docs"
|
|
53
|
+
Dynamic: license-file
|
|
54
|
+
|
|
55
|
+
# greenmining
|
|
56
|
+
|
|
57
|
+
Green mining for microservices repositories.
|
|
58
|
+
|
|
59
|
+
[](https://pypi.org/project/greenmining/)
|
|
60
|
+
[](https://pypi.org/project/greenmining/)
|
|
61
|
+
[](LICENSE)
|
|
62
|
+
|
|
63
|
+
## Overview
|
|
64
|
+
|
|
65
|
+
`greenmining` is a Python library and CLI tool for analyzing GitHub repositories to identify green software engineering practices and energy-efficient patterns. It detects sustainable software patterns across cloud, web, AI, database, networking, and general categories.
|
|
66
|
+
|
|
67
|
+
## Installation
|
|
68
|
+
|
|
69
|
+
### Via pip
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
pip install greenmining
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### From source
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
git clone https://github.com/adam-bouafia/greenmining.git
|
|
79
|
+
cd greenmining
|
|
80
|
+
pip install -e .
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### With Docker
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
docker pull adambouafia/greenmining:latest
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Quick Start
|
|
90
|
+
|
|
91
|
+
### CLI Usage
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
# Set your GitHub token
|
|
95
|
+
export GITHUB_TOKEN="your_github_token"
|
|
96
|
+
|
|
97
|
+
# Run full analysis pipeline
|
|
98
|
+
greenmining pipeline --max-repos 100
|
|
99
|
+
|
|
100
|
+
# Fetch repositories with custom keywords
|
|
101
|
+
greenmining fetch --max-repos 100 --min-stars 100 --keywords "kubernetes docker cloud-native"
|
|
102
|
+
|
|
103
|
+
# Fetch with default (microservices)
|
|
104
|
+
greenmining fetch --max-repos 100 --min-stars 100
|
|
105
|
+
|
|
106
|
+
# Extract commits
|
|
107
|
+
greenmining extract --max-commits 50
|
|
108
|
+
|
|
109
|
+
# Analyze for green patterns
|
|
110
|
+
greenmining analyze
|
|
111
|
+
|
|
112
|
+
# Analyze with advanced features
|
|
113
|
+
greenmining analyze --enable-nlp --enable-ml-features --enable-diff-analysis
|
|
114
|
+
|
|
115
|
+
# Aggregate results with temporal analysis
|
|
116
|
+
greenmining aggregate --enable-temporal --temporal-granularity quarter --enable-enhanced-stats
|
|
117
|
+
|
|
118
|
+
# Generate report
|
|
119
|
+
greenmining report
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### Python API
|
|
123
|
+
|
|
124
|
+
#### Basic Pattern Detection
|
|
125
|
+
|
|
126
|
+
```python
|
|
127
|
+
from greenmining import GSF_PATTERNS, is_green_aware, get_pattern_by_keywords
|
|
128
|
+
|
|
129
|
+
# Check available patterns
|
|
130
|
+
print(f"Total patterns: {len(GSF_PATTERNS)}") # 122 patterns across 15 categories
|
|
131
|
+
|
|
132
|
+
# Detect green awareness in commit messages
|
|
133
|
+
commit_msg = "Optimize Redis caching to reduce energy consumption"
|
|
134
|
+
if is_green_aware(commit_msg):
|
|
135
|
+
patterns = get_pattern_by_keywords(commit_msg)
|
|
136
|
+
print(f"Matched patterns: {patterns}")
|
|
137
|
+
# Output: ['Cache Static Data', 'Use Efficient Cache Strategies']
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
#### Fetch Repositories with Custom Keywords (NEW)
|
|
141
|
+
|
|
142
|
+
```python
|
|
143
|
+
from greenmining import fetch_repositories
|
|
144
|
+
|
|
145
|
+
# Fetch repositories with custom search keywords
|
|
146
|
+
repos = fetch_repositories(
|
|
147
|
+
github_token="your_github_token", # Required: GitHub personal access token
|
|
148
|
+
max_repos=50, # Maximum number of repositories to fetch
|
|
149
|
+
min_stars=500, # Minimum star count filter
|
|
150
|
+
keywords="kubernetes cloud-native", # Search keywords (space-separated)
|
|
151
|
+
languages=["Python", "Go"], # Programming language filters
|
|
152
|
+
created_after="2020-01-01", # Filter by creation date (YYYY-MM-DD)
|
|
153
|
+
created_before="2024-12-31", # Filter by creation date (YYYY-MM-DD)
|
|
154
|
+
pushed_after="2023-01-01", # Filter by last push date (YYYY-MM-DD)
|
|
155
|
+
pushed_before="2024-12-31" # Filter by last push date (YYYY-MM-DD)
|
|
156
|
+
)
|
|
157
|
+
|
|
158
|
+
print(f"Found {len(repos)} repositories")
|
|
159
|
+
for repo in repos[:5]:
|
|
160
|
+
print(f"- {repo.full_name} ({repo.stars} stars)")
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
**Parameters:**
|
|
164
|
+
- `github_token` (str, required): GitHub personal access token for API authentication
|
|
165
|
+
- `max_repos` (int, default=100): Maximum number of repositories to fetch
|
|
166
|
+
- `min_stars` (int, default=100): Minimum GitHub stars filter
|
|
167
|
+
- `keywords` (str, default="microservices"): Space-separated search keywords
|
|
168
|
+
- `languages` (list[str], optional): Programming language filters (e.g., ["Python", "Go", "Java"])
|
|
169
|
+
- `created_after` (str, optional): Filter repos created after date (format: "YYYY-MM-DD")
|
|
170
|
+
- `created_before` (str, optional): Filter repos created before date (format: "YYYY-MM-DD")
|
|
171
|
+
- `pushed_after` (str, optional): Filter repos pushed after date (format: "YYYY-MM-DD")
|
|
172
|
+
- `pushed_before` (str, optional): Filter repos pushed before date (format: "YYYY-MM-DD")
|
|
173
|
+
|
|
174
|
+
#### Analyze Repository Commits
|
|
175
|
+
|
|
176
|
+
```python
|
|
177
|
+
from greenmining.services.commit_extractor import CommitExtractor
|
|
178
|
+
from greenmining.services.data_analyzer import DataAnalyzer
|
|
179
|
+
from greenmining.analyzers.nlp_analyzer import NLPAnalyzer
|
|
180
|
+
from greenmining.analyzers.ml_feature_extractor import MLFeatureExtractor
|
|
181
|
+
from greenmining import fetch_repositories
|
|
182
|
+
|
|
183
|
+
# Fetch repositories with custom keywords
|
|
184
|
+
repos = fetch_repositories(
|
|
185
|
+
github_token="your_token",
|
|
186
|
+
max_repos=10,
|
|
187
|
+
keywords="serverless edge-computing"
|
|
188
|
+
)
|
|
189
|
+
|
|
190
|
+
# Initialize commit extractor with parameters
|
|
191
|
+
extractor = CommitExtractor(
|
|
192
|
+
exclude_merge_commits=True, # Skip merge commits (default: True)
|
|
193
|
+
exclude_bot_commits=True, # Skip bot commits (default: True)
|
|
194
|
+
min_message_length=10 # Minimum commit message length (default: 10)
|
|
195
|
+
)
|
|
196
|
+
|
|
197
|
+
# Initialize analyzer with advanced features
|
|
198
|
+
analyzer = DataAnalyzer(
|
|
199
|
+
enable_diff_analysis=False, # Enable code diff analysis (slower but more accurate)
|
|
200
|
+
enable_nlp=True, # Enable NLP-enhanced pattern detection
|
|
201
|
+
enable_ml_features=True, # Enable ML feature extraction
|
|
202
|
+
patterns=None, # Custom pattern dict (default: GSF_PATTERNS)
|
|
203
|
+
batch_size=10 # Batch processing size (default: 10)
|
|
204
|
+
)
|
|
205
|
+
|
|
206
|
+
# Optional: Configure NLP analyzer separately
|
|
207
|
+
nlp_analyzer = NLPAnalyzer(
|
|
208
|
+
enable_stemming=True, # Enable morphological analysis (optimize→optimizing)
|
|
209
|
+
enable_synonyms=True # Enable semantic synonym matching (cache→buffer)
|
|
210
|
+
)
|
|
211
|
+
|
|
212
|
+
# Optional: Configure ML feature extractor
|
|
213
|
+
ml_extractor = MLFeatureExtractor(
|
|
214
|
+
green_keywords=None # Custom keyword list (default: built-in 19 keywords)
|
|
215
|
+
)
|
|
216
|
+
|
|
217
|
+
# Extract commits from first repo
|
|
218
|
+
commits = extractor.extract_commits(
|
|
219
|
+
repository=repos[0], # PyGithub Repository object
|
|
220
|
+
max_commits=50, # Maximum commits to extract per repository
|
|
221
|
+
since=None, # Start date filter (datetime object, optional)
|
|
222
|
+
until=None # End date filter (datetime object, optional)
|
|
223
|
+
)
|
|
224
|
+
|
|
225
|
+
**CommitExtractor Parameters:**
|
|
226
|
+
- `exclude_merge_commits` (bool, default=True): Skip merge commits during extraction
|
|
227
|
+
- `exclude_bot_commits` (bool, default=True): Skip commits from bot accounts
|
|
228
|
+
- `min_message_length` (int, default=10): Minimum length for commit message to be included
|
|
229
|
+
|
|
230
|
+
**DataAnalyzer Parameters:**
|
|
231
|
+
- `enable_diff_analysis` (bool, default=False): Enable code diff analysis (slower)
|
|
232
|
+
- `enable_nlp` (bool, default=False): Enable NLP-enhanced pattern detection
|
|
233
|
+
- `enable_ml_features` (bool, default=False): Enable ML feature extraction
|
|
234
|
+
- `patterns` (dict, optional): Custom pattern dictionary (default: GSF_PATTERNS)
|
|
235
|
+
- `batch_size` (int, default=10): Number of commits to process in each batch
|
|
236
|
+
|
|
237
|
+
**NLPAnalyzer Parameters:**
|
|
238
|
+
- `enable_stemming` (bool, default=True): Enable morphological variant matching
|
|
239
|
+
- `enable_synonyms` (bool, default=True): Enable semantic synonym expansion
|
|
240
|
+
|
|
241
|
+
**MLFeatureExtractor Parameters:**
|
|
242
|
+
- `green_keywords` (list[str], optional): Custom green keywords list
|
|
243
|
+
|
|
244
|
+
# Analyze commits for green patterns
|
|
245
|
+
results = []
|
|
246
|
+
for commit in commits:
|
|
247
|
+
result = analyzer.analyze_commit(commit)
|
|
248
|
+
if result['green_aware']:
|
|
249
|
+
results.append(result)
|
|
250
|
+
print(f"Green commit found: {commit.message[:50]}...")
|
|
251
|
+
print(f" Patterns: {result['known_pattern']}")
|
|
252
|
+
|
|
253
|
+
# Access NLP analysis results (NEW)
|
|
254
|
+
if 'nlp_analysis' in result:
|
|
255
|
+
nlp = result['nlp_analysis']
|
|
256
|
+
print(f" NLP: {nlp['morphological_count']} morphological matches, "
|
|
257
|
+
f"{nlp['semantic_count']} semantic matches")
|
|
258
|
+
|
|
259
|
+
# Access ML features (NEW)
|
|
260
|
+
if 'ml_features' in result:
|
|
261
|
+
ml = result['ml_features']['text']
|
|
262
|
+
print(f" ML Features: {ml['word_count']} words, "
|
|
263
|
+
f"keyword density: {ml['keyword_density']:.2f}")
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
#### Access Sustainability Patterns Data
|
|
267
|
+
|
|
268
|
+
```python
|
|
269
|
+
from greenmining import GSF_PATTERNS
|
|
270
|
+
|
|
271
|
+
# Get all patterns by category
|
|
272
|
+
cloud_patterns = {
|
|
273
|
+
pid: pattern for pid, pattern in GSF_PATTERNS.items()
|
|
274
|
+
if pattern['category'] == 'cloud'
|
|
275
|
+
}
|
|
276
|
+
print(f"Cloud patterns: {len(cloud_patterns)}") # 40 patterns
|
|
277
|
+
|
|
278
|
+
ai_patterns = {
|
|
279
|
+
pid: pattern for pid, pattern in GSF_PATTERNS.items()
|
|
280
|
+
if pattern['category'] == 'ai'
|
|
281
|
+
}
|
|
282
|
+
print(f"AI/ML patterns: {len(ai_patterns)}") # 19 patterns
|
|
283
|
+
|
|
284
|
+
# Get pattern details
|
|
285
|
+
cache_pattern = GSF_PATTERNS['gsf_001']
|
|
286
|
+
print(f"Pattern: {cache_pattern['name']}")
|
|
287
|
+
print(f"Category: {cache_pattern['category']}")
|
|
288
|
+
print(f"Keywords: {cache_pattern['keywords']}")
|
|
289
|
+
print(f"Impact: {cache_pattern['sci_impact']}")
|
|
290
|
+
|
|
291
|
+
# List all available categories
|
|
292
|
+
categories = set(p['category'] for p in GSF_PATTERNS.values())
|
|
293
|
+
print(f"Available categories: {sorted(categories)}")
|
|
294
|
+
# Output: ['ai', 'async', 'caching', 'cloud', 'code', 'data',
|
|
295
|
+
# 'database', 'general', 'infrastructure', 'microservices',
|
|
296
|
+
# 'monitoring', 'network', 'networking', 'resource', 'web']
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
#### Advanced Analysis: Temporal Trends (NEW)
|
|
300
|
+
|
|
301
|
+
```python
|
|
302
|
+
from greenmining.services.data_aggregator import DataAggregator
|
|
303
|
+
from greenmining.analyzers.temporal_analyzer import TemporalAnalyzer
|
|
304
|
+
from greenmining.analyzers.qualitative_analyzer import QualitativeAnalyzer
|
|
305
|
+
|
|
306
|
+
# Initialize aggregator with all advanced features
|
|
307
|
+
aggregator = DataAggregator(
|
|
308
|
+
config=None, # Config object (optional)
|
|
309
|
+
enable_enhanced_stats=True, # Enable statistical analysis (correlations, trends)
|
|
310
|
+
enable_temporal=True, # Enable temporal trend analysis
|
|
311
|
+
temporal_granularity="quarter" # Time granularity: day/week/month/quarter/year
|
|
312
|
+
)
|
|
313
|
+
|
|
314
|
+
# Optional: Configure temporal analyzer separately
|
|
315
|
+
temporal_analyzer = TemporalAnalyzer(
|
|
316
|
+
granularity="quarter" # Time period granularity for grouping commits
|
|
317
|
+
)
|
|
318
|
+
|
|
319
|
+
# Optional: Configure qualitative analyzer for validation sampling
|
|
320
|
+
qualitative_analyzer = QualitativeAnalyzer(
|
|
321
|
+
sample_size=30, # Number of samples for manual validation
|
|
322
|
+
stratify_by="pattern" # Stratification method: pattern/repository/time/random
|
|
323
|
+
)
|
|
324
|
+
|
|
325
|
+
# Aggregate results with temporal insights
|
|
326
|
+
aggregated = aggregator.aggregate(
|
|
327
|
+
analysis_results=analysis_results, # List of analysis result dictionaries
|
|
328
|
+
repositories=repositories # List of PyGithub repository objects
|
|
329
|
+
)
|
|
330
|
+
|
|
331
|
+
**DataAggregator Parameters:**
|
|
332
|
+
- `config` (Config, optional): Configuration object
|
|
333
|
+
- `enable_enhanced_stats` (bool, default=False): Enable pattern correlations and effect size analysis
|
|
334
|
+
- `enable_temporal` (bool, default=False): Enable temporal trend analysis over time
|
|
335
|
+
- `temporal_granularity` (str, default="quarter"): Time granularity (day/week/month/quarter/year)
|
|
336
|
+
|
|
337
|
+
**TemporalAnalyzer Parameters:**
|
|
338
|
+
- `granularity` (str, default="quarter"): Time period for grouping (day/week/month/quarter/year)
|
|
339
|
+
|
|
340
|
+
**QualitativeAnalyzer Parameters:**
|
|
341
|
+
- `sample_size` (int, default=30): Number of commits to sample for validation
|
|
342
|
+
- `stratify_by` (str, default="pattern"): Stratification method (pattern/repository/time/random)
|
|
343
|
+
|
|
344
|
+
# Access temporal analysis results
|
|
345
|
+
temporal = aggregated['temporal_analysis']
|
|
346
|
+
print(f"Time periods analyzed: {len(temporal['periods'])}")
|
|
347
|
+
|
|
348
|
+
# View pattern adoption trends over time
|
|
349
|
+
for period_data in temporal['periods']:
|
|
350
|
+
print(f"{period_data['period']}: {period_data['commit_count']} commits, "
|
|
351
|
+
f"{period_data['green_awareness_rate']:.1%} green awareness")
|
|
352
|
+
|
|
353
|
+
# Access pattern evolution insights
|
|
354
|
+
evolution = temporal.get('pattern_evolution', {})
|
|
355
|
+
print(f"Emerging patterns: {evolution.get('emerging', [])}")
|
|
356
|
+
print(f"Stable patterns: {evolution.get('stable', [])}")
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
#### Generate Custom Reports
|
|
360
|
+
|
|
361
|
+
```python
|
|
362
|
+
from greenmining.services.data_aggregator import DataAggregator
|
|
363
|
+
from greenmining.config import Config
|
|
364
|
+
|
|
365
|
+
config = Config()
|
|
366
|
+
aggregator = DataAggregator(config)
|
|
367
|
+
|
|
368
|
+
# Load analysis results
|
|
369
|
+
results = aggregator.load_analysis_results()
|
|
370
|
+
|
|
371
|
+
# Generate statistics
|
|
372
|
+
stats = aggregator.calculate_statistics(results)
|
|
373
|
+
print(f"Total commits analyzed: {stats['total_commits']}")
|
|
374
|
+
print(f"Green-aware commits: {stats['green_aware_count']}")
|
|
375
|
+
print(f"Top patterns: {stats['top_patterns'][:5]}")
|
|
376
|
+
|
|
377
|
+
# Export to CSV
|
|
378
|
+
aggregator.export_to_csv(results, "output.csv")
|
|
379
|
+
```
|
|
380
|
+
|
|
381
|
+
#### Batch Analysis
|
|
382
|
+
|
|
383
|
+
```python
|
|
384
|
+
from greenmining.controllers.repository_controller import RepositoryController
|
|
385
|
+
from greenmining.config import Config
|
|
386
|
+
|
|
387
|
+
config = Config()
|
|
388
|
+
controller = RepositoryController(config)
|
|
389
|
+
|
|
390
|
+
# Run full pipeline programmatically
|
|
391
|
+
controller.fetch_repositories(max_repos=50)
|
|
392
|
+
controller.extract_commits(max_commits=100)
|
|
393
|
+
controller.analyze_commits()
|
|
394
|
+
controller.aggregate_results()
|
|
395
|
+
controller.generate_report()
|
|
396
|
+
|
|
397
|
+
print("Analysis complete! Check data/ directory for results.")
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
### Docker Usage
|
|
401
|
+
|
|
402
|
+
```bash
|
|
403
|
+
# Run analysis pipeline
|
|
404
|
+
docker run -v $(pwd)/data:/app/data \
|
|
405
|
+
adambouafia/greenmining:latest --help
|
|
406
|
+
|
|
407
|
+
# With custom configuration
|
|
408
|
+
docker run -v $(pwd)/.env:/app/.env:ro \
|
|
409
|
+
-v $(pwd)/data:/app/data \
|
|
410
|
+
adambouafia/greenmining:latest pipeline --max-repos 50
|
|
411
|
+
|
|
412
|
+
# Interactive shell
|
|
413
|
+
docker run -it adambouafia/greenmining:latest /bin/bash
|
|
414
|
+
```
|
|
415
|
+
|
|
416
|
+
## Configuration
|
|
417
|
+
|
|
418
|
+
### Environment Variables
|
|
419
|
+
|
|
420
|
+
Create a `.env` file or set environment variables:
|
|
421
|
+
|
|
422
|
+
```bash
|
|
423
|
+
# Required
|
|
424
|
+
GITHUB_TOKEN=your_github_personal_access_token
|
|
425
|
+
|
|
426
|
+
# Optional - Repository Fetching
|
|
427
|
+
MAX_REPOS=100
|
|
428
|
+
MIN_STARS=100
|
|
429
|
+
SUPPORTED_LANGUAGES=Python,Java,Go,JavaScript,TypeScript
|
|
430
|
+
SEARCH_KEYWORDS=microservices
|
|
431
|
+
|
|
432
|
+
# Optional - Commit Extraction
|
|
433
|
+
COMMITS_PER_REPO=50
|
|
434
|
+
EXCLUDE_MERGE_COMMITS=true
|
|
435
|
+
EXCLUDE_BOT_COMMITS=true
|
|
436
|
+
|
|
437
|
+
# Optional - Analysis Features
|
|
438
|
+
ENABLE_DIFF_ANALYSIS=false
|
|
439
|
+
ENABLE_NLP=true
|
|
440
|
+
ENABLE_ML_FEATURES=true
|
|
441
|
+
BATCH_SIZE=10
|
|
442
|
+
|
|
443
|
+
# Optional - Temporal Analysis
|
|
444
|
+
ENABLE_TEMPORAL=true
|
|
445
|
+
TEMPORAL_GRANULARITY=quarter
|
|
446
|
+
ENABLE_ENHANCED_STATS=true
|
|
447
|
+
|
|
448
|
+
# Optional - Output
|
|
449
|
+
OUTPUT_DIR=./data
|
|
450
|
+
REPORT_FORMAT=markdown
|
|
451
|
+
```
|
|
452
|
+
|
|
453
|
+
### Config Object Parameters
|
|
454
|
+
|
|
455
|
+
```python
|
|
456
|
+
from greenmining.config import Config
|
|
457
|
+
|
|
458
|
+
config = Config(
|
|
459
|
+
# GitHub API
|
|
460
|
+
github_token="your_token", # GitHub personal access token (required)
|
|
461
|
+
|
|
462
|
+
# Repository Fetching
|
|
463
|
+
max_repos=100, # Maximum repositories to fetch
|
|
464
|
+
min_stars=100, # Minimum star threshold
|
|
465
|
+
supported_languages=["Python", "Go"], # Language filters
|
|
466
|
+
search_keywords="microservices", # Default search keywords
|
|
467
|
+
|
|
468
|
+
# Commit Extraction
|
|
469
|
+
max_commits=50, # Commits per repository
|
|
470
|
+
exclude_merge_commits=True, # Skip merge commits
|
|
471
|
+
exclude_bot_commits=True, # Skip bot commits
|
|
472
|
+
min_message_length=10, # Minimum commit message length
|
|
473
|
+
|
|
474
|
+
# Analysis Options
|
|
475
|
+
enable_diff_analysis=False, # Enable code diff analysis
|
|
476
|
+
enable_nlp=True, # Enable NLP features
|
|
477
|
+
enable_ml_features=True, # Enable ML feature extraction
|
|
478
|
+
batch_size=10, # Batch processing size
|
|
479
|
+
|
|
480
|
+
# Temporal Analysis
|
|
481
|
+
enable_temporal=True, # Enable temporal trend analysis
|
|
482
|
+
temporal_granularity="quarter", # day/week/month/quarter/year
|
|
483
|
+
enable_enhanced_stats=True, # Enable statistical analysis
|
|
484
|
+
|
|
485
|
+
# Output Configuration
|
|
486
|
+
output_dir="./data", # Output directory path
|
|
487
|
+
repos_file="repositories.json", # Repositories filename
|
|
488
|
+
commits_file="commits.json", # Commits filename
|
|
489
|
+
analysis_file="analysis_results.json", # Analysis results filename
|
|
490
|
+
stats_file="aggregated_statistics.json", # Statistics filename
|
|
491
|
+
report_file="green_analysis.md" # Report filename
|
|
492
|
+
)
|
|
493
|
+
```
|
|
494
|
+
|
|
495
|
+
## Features
|
|
496
|
+
|
|
497
|
+
### Core Capabilities
|
|
498
|
+
|
|
499
|
+
- **Pattern Detection**: Automatically identifies 122 sustainability patterns across 15 categories
|
|
500
|
+
- **Keyword Analysis**: Scans commit messages using 321 green software keywords
|
|
501
|
+
- **Custom Repository Fetching**: Fetch repositories with custom search keywords (not limited to microservices)
|
|
502
|
+
- **Repository Analysis**: Analyzes repositories from GitHub with flexible filtering
|
|
503
|
+
- **Batch Processing**: Analyze hundreds of repositories and thousands of commits
|
|
504
|
+
- **Multi-format Output**: Generates Markdown reports, CSV exports, and JSON data
|
|
505
|
+
- **Statistical Analysis**: Calculates green-awareness metrics, pattern distribution, and trends
|
|
506
|
+
- **Docker Support**: Pre-built images for containerized analysis
|
|
507
|
+
- **Programmatic API**: Full Python API for custom workflows and integrations
|
|
508
|
+
- **Clean Architecture**: Modular design with services layer (Fetcher, Extractor, Analyzer, Aggregator, Reports)
|
|
509
|
+
|
|
510
|
+
### Pattern Database
|
|
511
|
+
|
|
512
|
+
**122 green software patterns based on:**
|
|
513
|
+
- Green Software Foundation (GSF) Patterns Catalog
|
|
514
|
+
- VU Amsterdam 2024 research on ML system sustainability
|
|
515
|
+
- ICSE 2024 conference papers on sustainable software
|
|
516
|
+
|
|
517
|
+
### Detection Performance
|
|
518
|
+
|
|
519
|
+
- **Coverage**: 67% of patterns actively detect in real-world commits
|
|
520
|
+
- **Accuracy**: 100% true positive rate for green-aware commits
|
|
521
|
+
- **Categories**: 15 distinct sustainability domains covered
|
|
522
|
+
- **Keywords**: 321 detection terms across all patterns
|
|
523
|
+
|
|
524
|
+
## GSF Pattern Categories
|
|
525
|
+
|
|
526
|
+
**122 patterns across 15 categories:**
|
|
527
|
+
|
|
528
|
+
### 1. Cloud (40 patterns)
|
|
529
|
+
Auto-scaling, serverless computing, right-sizing instances, region selection for renewable energy, spot instances, idle resource detection, cloud-native architectures
|
|
530
|
+
|
|
531
|
+
### 2. Web (17 patterns)
|
|
532
|
+
CDN usage, caching strategies, lazy loading, asset compression, image optimization, minification, code splitting, tree shaking, prefetching
|
|
533
|
+
|
|
534
|
+
### 3. AI/ML (19 patterns)
|
|
535
|
+
Model optimization, pruning, quantization, edge inference, batch optimization, efficient training, model compression, hardware acceleration, green ML pipelines
|
|
536
|
+
|
|
537
|
+
### 4. Database (5 patterns)
|
|
538
|
+
Indexing strategies, query optimization, connection pooling, prepared statements, database views, denormalization for efficiency
|
|
539
|
+
|
|
540
|
+
### 5. Networking (8 patterns)
|
|
541
|
+
Protocol optimization, connection reuse, HTTP/2, gRPC, efficient serialization, compression, persistent connections
|
|
542
|
+
|
|
543
|
+
### 6. Network (6 patterns)
|
|
544
|
+
Request batching, GraphQL optimization, API gateway patterns, circuit breakers, rate limiting, request deduplication
|
|
545
|
+
|
|
546
|
+
### 7. Caching (2 patterns)
|
|
547
|
+
Multi-level caching, cache invalidation strategies, data deduplication, distributed caching
|
|
548
|
+
|
|
549
|
+
### 8. Resource (2 patterns)
|
|
550
|
+
Resource limits, dynamic allocation, memory management, CPU throttling
|
|
551
|
+
|
|
552
|
+
### 9. Data (3 patterns)
|
|
553
|
+
Efficient serialization formats, pagination, streaming, data compression
|
|
554
|
+
|
|
555
|
+
### 10. Async (3 patterns)
|
|
556
|
+
Event-driven architecture, reactive streams, polling elimination, non-blocking I/O
|
|
557
|
+
|
|
558
|
+
### 11. Code (4 patterns)
|
|
559
|
+
Algorithm optimization, code efficiency, garbage collection tuning, memory profiling
|
|
560
|
+
|
|
561
|
+
### 12. Monitoring (3 patterns)
|
|
562
|
+
Energy monitoring, performance profiling, APM tools, observability patterns
|
|
563
|
+
|
|
564
|
+
### 13. Microservices (4 patterns)
|
|
565
|
+
Service decomposition, colocation strategies, graceful shutdown, service mesh optimization
|
|
566
|
+
|
|
567
|
+
### 14. Infrastructure (4 patterns)
|
|
568
|
+
Alpine containers, Infrastructure as Code, renewable energy regions, container optimization
|
|
569
|
+
|
|
570
|
+
### 15. General (8 patterns)
|
|
571
|
+
Feature flags, incremental processing, precomputation, background jobs, workflow optimization
|
|
572
|
+
|
|
573
|
+
## CLI Commands
|
|
574
|
+
|
|
575
|
+
| Command | Description | Key Options |
|
|
576
|
+
|---------|-------------|-------------|
|
|
577
|
+
| `fetch` | Fetch repositories from GitHub with custom keywords | `--max-repos`, `--min-stars`, `--languages`, `--keywords` |
|
|
578
|
+
| `extract` | Extract commit history from repositories | `--max-commits` per repository |
|
|
579
|
+
| `analyze` | Analyze commits for green patterns | `--enable-nlp`, `--enable-ml-features`, `--enable-diff-analysis` |
|
|
580
|
+
| `aggregate` | Aggregate analysis results | `--enable-temporal`, `--temporal-granularity`, `--enable-enhanced-stats` |
|
|
581
|
+
| `report` | Generate comprehensive report | Creates Markdown and CSV outputs |
|
|
582
|
+
| `pipeline` | Run complete analysis pipeline | `--max-repos`, `--max-commits` (all-in-one) |
|
|
583
|
+
| `status` | Show current analysis status | Displays progress and file statistics |
|
|
584
|
+
|
|
585
|
+
### Command Details
|
|
586
|
+
|
|
587
|
+
#### Fetch Repositories
|
|
588
|
+
```bash
|
|
589
|
+
# Fetch with custom search keywords
|
|
590
|
+
greenmining fetch --max-repos 100 --min-stars 50 --languages Python --keywords "kubernetes docker"
|
|
591
|
+
|
|
592
|
+
# Fetch microservices (default)
|
|
593
|
+
greenmining fetch --max-repos 100 --min-stars 50 --languages Python
|
|
594
|
+
```
|
|
595
|
+
Options:
|
|
596
|
+
- `--max-repos`: Maximum repositories to fetch (default: 100)
|
|
597
|
+
- `--min-stars`: Minimum GitHub stars (default: 100)
|
|
598
|
+
- `--languages`: Filter by programming languages (default: "Python,Java,Go,JavaScript,TypeScript")
|
|
599
|
+
- `--keywords`: Custom search keywords (default: "microservices")
|
|
600
|
+
|
|
601
|
+
#### Extract Commits
|
|
602
|
+
```bash
|
|
603
|
+
greenmining extract --max-commits 50
|
|
604
|
+
```
|
|
605
|
+
Options:
|
|
606
|
+
- `--max-commits`: Maximum commits per repository (default: 50)
|
|
607
|
+
|
|
608
|
+
#### Analyze Commits (with Advanced Features)
|
|
609
|
+
```bash
|
|
610
|
+
# Basic analysis
|
|
611
|
+
greenmining analyze
|
|
612
|
+
|
|
613
|
+
# Advanced analysis with all features
|
|
614
|
+
greenmining analyze --enable-nlp --enable-ml-features --enable-diff-analysis --batch-size 20
|
|
615
|
+
```
|
|
616
|
+
Options:
|
|
617
|
+
- `--batch-size`: Batch size for processing (default: 10)
|
|
618
|
+
- `--enable-diff-analysis`: Enable code diff analysis (slower but more accurate)
|
|
619
|
+
- `--enable-nlp`: Enable NLP-enhanced pattern detection with morphological variants and synonyms
|
|
620
|
+
- `--enable-ml-features`: Enable ML feature extraction for model training
|
|
621
|
+
|
|
622
|
+
#### Aggregate Results (with Temporal Analysis)
|
|
623
|
+
```bash
|
|
624
|
+
# Basic aggregation
|
|
625
|
+
greenmining aggregate
|
|
626
|
+
|
|
627
|
+
# Advanced aggregation with temporal trends
|
|
628
|
+
greenmining aggregate --enable-temporal --temporal-granularity quarter --enable-enhanced-stats
|
|
629
|
+
```
|
|
630
|
+
Options:
|
|
631
|
+
- `--enable-enhanced-stats`: Enable enhanced statistical analysis (correlations, effect sizes)
|
|
632
|
+
- `--enable-temporal`: Enable temporal trend analysis
|
|
633
|
+
- `--temporal-granularity`: Time period granularity (choices: day, week, month, quarter, year)
|
|
634
|
+
|
|
635
|
+
#### Run Pipeline
|
|
636
|
+
```bash
|
|
637
|
+
greenmining pipeline --max-repos 50 --max-commits 100
|
|
638
|
+
```
|
|
639
|
+
Options:
|
|
640
|
+
- `--max-repos`: Repositories to analyze
|
|
641
|
+
- `--max-commits`: Commits per repository
|
|
642
|
+
- Executes: fetch → extract → analyze → aggregate → report
|
|
643
|
+
|
|
644
|
+
## Output Files
|
|
645
|
+
|
|
646
|
+
All outputs are saved to the `data/` directory:
|
|
647
|
+
|
|
648
|
+
- `repositories.json` - Repository metadata
|
|
649
|
+
- `commits.json` - Extracted commit data
|
|
650
|
+
- `analysis_results.json` - Pattern analysis results
|
|
651
|
+
- `aggregated_statistics.json` - Summary statistics
|
|
652
|
+
- `green_analysis_results.csv` - CSV export for spreadsheets
|
|
653
|
+
- `green_microservices_analysis.md` - Final report
|
|
654
|
+
|
|
655
|
+
## Development
|
|
656
|
+
|
|
657
|
+
```bash
|
|
658
|
+
# Clone repository
|
|
659
|
+
git clone https://github.com/adam-bouafia/greenmining.git
|
|
660
|
+
cd greenmining
|
|
661
|
+
|
|
662
|
+
# Install development dependencies
|
|
663
|
+
pip install -e ".[dev]"
|
|
664
|
+
|
|
665
|
+
# Run tests
|
|
666
|
+
pytest tests/
|
|
667
|
+
|
|
668
|
+
# Run with coverage
|
|
669
|
+
pytest --cov=greenmining tests/
|
|
670
|
+
|
|
671
|
+
# Format code
|
|
672
|
+
black greenmining/ tests/
|
|
673
|
+
ruff check greenmining/ tests/
|
|
674
|
+
```
|
|
675
|
+
|
|
676
|
+
## Requirements
|
|
677
|
+
|
|
678
|
+
- Python 3.9+
|
|
679
|
+
- PyGithub >= 2.1.1
|
|
680
|
+
- PyDriller >= 2.5
|
|
681
|
+
- pandas >= 2.2.0
|
|
682
|
+
- click >= 8.1.7
|
|
683
|
+
|
|
684
|
+
## License
|
|
685
|
+
|
|
686
|
+
MIT License - See [LICENSE](LICENSE) for details.
|
|
687
|
+
|
|
688
|
+
## Contributing
|
|
689
|
+
|
|
690
|
+
Contributions are welcome! Please open an issue or submit a pull request.
|
|
691
|
+
|
|
692
|
+
## Links
|
|
693
|
+
|
|
694
|
+
- **GitHub**: https://github.com/adam-bouafia/greenmining
|
|
695
|
+
- **PyPI**: https://pypi.org/project/greenmining/
|
|
696
|
+
- **Docker Hub**: https://hub.docker.com/r/adambouafia/greenmining
|
|
697
|
+
- **Documentation**: https://github.com/adam-bouafia/greenmining#readme
|
|
698
|
+
|
|
699
|
+
|