greenmining 1.1.8__tar.gz → 1.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. greenmining-1.2.0/CHANGELOG.md +96 -0
  2. greenmining-1.2.0/PKG-INFO +311 -0
  3. greenmining-1.2.0/README.md +254 -0
  4. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/__init__.py +29 -10
  5. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/analyzers/__init__.py +0 -8
  6. greenmining-1.2.0/greenmining/controllers/repository_controller.py +156 -0
  7. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/services/local_repo_analyzer.py +1 -1
  8. greenmining-1.2.0/greenmining.egg-info/PKG-INFO +311 -0
  9. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining.egg-info/SOURCES.txt +0 -12
  10. {greenmining-1.1.8 → greenmining-1.2.0}/pyproject.toml +1 -1
  11. greenmining-1.1.8/CHANGELOG.md +0 -154
  12. greenmining-1.1.8/PKG-INFO +0 -865
  13. greenmining-1.1.8/README.md +0 -808
  14. greenmining-1.1.8/greenmining/analyzers/power_regression.py +0 -211
  15. greenmining-1.1.8/greenmining/analyzers/qualitative_analyzer.py +0 -394
  16. greenmining-1.1.8/greenmining/analyzers/version_power_analyzer.py +0 -246
  17. greenmining-1.1.8/greenmining/config.py +0 -91
  18. greenmining-1.1.8/greenmining/controllers/repository_controller.py +0 -98
  19. greenmining-1.1.8/greenmining/presenters/__init__.py +0 -7
  20. greenmining-1.1.8/greenmining/presenters/console_presenter.py +0 -143
  21. greenmining-1.1.8/greenmining.egg-info/PKG-INFO +0 -865
  22. {greenmining-1.1.8 → greenmining-1.2.0}/LICENSE +0 -0
  23. {greenmining-1.1.8 → greenmining-1.2.0}/MANIFEST.in +0 -0
  24. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/__main__.py +0 -0
  25. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/analyzers/code_diff_analyzer.py +0 -0
  26. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/analyzers/metrics_power_correlator.py +0 -0
  27. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/analyzers/statistical_analyzer.py +0 -0
  28. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/analyzers/temporal_analyzer.py +0 -0
  29. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/controllers/__init__.py +0 -0
  30. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/energy/__init__.py +0 -0
  31. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/energy/base.py +0 -0
  32. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/energy/carbon_reporter.py +0 -0
  33. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/energy/codecarbon_meter.py +0 -0
  34. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/energy/cpu_meter.py +0 -0
  35. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/energy/rapl.py +0 -0
  36. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/gsf_patterns.py +0 -0
  37. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/models/__init__.py +0 -0
  38. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/models/aggregated_stats.py +0 -0
  39. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/models/analysis_result.py +0 -0
  40. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/models/commit.py +0 -0
  41. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/models/repository.py +0 -0
  42. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/services/__init__.py +0 -0
  43. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/services/commit_extractor.py +0 -0
  44. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/services/data_aggregator.py +0 -0
  45. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/services/data_analyzer.py +0 -0
  46. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/services/github_graphql_fetcher.py +0 -0
  47. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/services/reports.py +0 -0
  48. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining/utils.py +0 -0
  49. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining.egg-info/dependency_links.txt +0 -0
  50. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining.egg-info/requires.txt +0 -0
  51. {greenmining-1.1.8 → greenmining-1.2.0}/greenmining.egg-info/top_level.txt +0 -0
  52. {greenmining-1.1.8 → greenmining-1.2.0}/setup.cfg +0 -0
  53. {greenmining-1.1.8 → greenmining-1.2.0}/setup.py +0 -0
@@ -0,0 +1,96 @@
1
+ # Changelog
2
+
3
+ ## [1.2.0] - 2026-01-31
4
+
5
+ ### Added
6
+ - `clone_repositories()` top-level function for cloning repos into `./greenmining_repos/` with sanitized directory names
7
+ - Repository name sanitization (`_sanitize_repo_name`) to prevent filesystem issues from special characters
8
+ - 2 missing official GSF patterns: "Match Utilization Requirements with Pre-configured Servers", "Optimize Impact on Customer Devices and Equipment"
9
+ - 11 new green keywords (energy proportionality, backward compatible, customer device, device lifetime, etc.)
10
+ - GSF pattern database now covers 100% of the official Green Software Foundation catalog (61/61)
11
+
12
+ ### Changed
13
+ - Repositories now clone to `./greenmining_repos/` instead of `/tmp` (fixes OS cleanup and permission issues)
14
+ - `fetch_repositories()` takes direct parameters -- no Config intermediary
15
+ - All function defaults are explicit parameters instead of config file values
16
+ - Default supported languages updated from 7 to 20 (matches experiment scope)
17
+ - Library reference documentation added to mkdocs navigation
18
+
19
+ ### Removed
20
+ - **`config.py`** module entirely (Config class, get_config singleton, .env/YAML loading layer)
21
+ - **`__version__.py`** (stale orphaned file with wrong version 1.0.5)
22
+ - **`services/github_fetcher.py`** (empty deprecated REST API stub)
23
+ - **`analyzers/power_regression.py`** (PowerRegressionDetector -- requires running repo code, not feasible in current pipeline)
24
+ - **`analyzers/version_power_analyzer.py`** (VersionPowerAnalyzer -- same reason)
25
+ - **`analyzers/qualitative_analyzer.py`** (QualitativeAnalyzer -- unused)
26
+ - **`presenters/`** module (ConsolePresenter -- never used by any code)
27
+ - **`docs/reference/config-options.md`** (obsolete config reference page)
28
+ - 10 dead utility functions (estimate_tokens, estimate_cost, print_banner, print_section, load_csv_file, handle_github_rate_limit, format_duration, truncate_text, create_checkpoint, load_checkpoint)
29
+ - 35+ unused Config attributes that were set but never read
30
+ - Dead imports across 14 files
31
+ - Dead methods: DataAnalyzer._check_green_awareness, DataAnalyzer._detect_known_pattern, CommitExtractor._extract_commit_metadata, StatisticalAnalyzer.pattern_adoption_rate_analysis, CodeCarbonMeter.get_carbon_intensity, Config.validate
32
+
33
+ ## [1.1.9] - 2026-01-31
34
+
35
+ ### Removed
36
+ - Web dashboard module (`greenmining/dashboard/`) and Flask dependency
37
+ - Dashboard documentation page and all dashboard references
38
+
39
+ ### Fixed
40
+ - ReadTheDocs experiment page not rendering (trailing whitespace in mkdocs nav)
41
+ - Plotly rendering in notebook (nbformat dependency)
42
+
43
+ ## [1.1.6] - 2026-01-31
44
+
45
+ ### Fixed
46
+ - EnergyMetrics property aliases (`energy_joules`, `average_power_watts`)
47
+ - Parallel energy measurement conflict with shared meter instance
48
+ - StatisticalAnalyzer timezone-aware date handling
49
+ - DataFrame column collision in pattern correlation analysis
50
+
51
+ ### Added
52
+ - `since_date` / `to_date` parameters for date-bounded commit analysis
53
+ - `created_before` / `pushed_after` search filters
54
+ - GraphQL API and experiment documentation pages
55
+ - Full process metrics and method-level metrics documentation
56
+
57
+ ### Changed
58
+ - Energy measurement demonstrates all 4 backends: RAPL, CPU Meter, CodeCarbon, tracemalloc
59
+ - Removed all PyDriller references (replaced with gitpython + lizard)
60
+
61
+ ### Removed
62
+ - Qualitative Validation and Carbon Footprint Reporting steps from experiment
63
+
64
+ ## [0.1.12] - 2025-12-03
65
+
66
+ ### Added
67
+ - Custom search keywords for repository fetching (`--keywords` option)
68
+ - `fetch_repositories()` function exposed in public API
69
+
70
+ ### Changed
71
+ - README updated to reflect 122 patterns (was showing 76 in PyPI description)
72
+
73
+ ## [0.1.11] - 2025-12-03
74
+
75
+ ### Added
76
+ - Expanded pattern database from 76 to 122 patterns
77
+ - Added 9 new categories
78
+ - Expanded keywords from 190 to 321
79
+ - VU Amsterdam 2024 research patterns for ML systems
80
+
81
+ ## [0.1.0] - 2025-12-02
82
+
83
+ ### Added
84
+ - Initial release
85
+ - Core functionality for GSF pattern mining
86
+ - Support for 100 microservices repositories
87
+ - Pattern matching with 76 GSF patterns
88
+ - Green awareness analysis
89
+ - Docker containerization
90
+
91
+ [1.2.0]: https://github.com/adam-bouafia/greenmining/compare/v1.1.9...v1.2.0
92
+ [1.1.9]: https://github.com/adam-bouafia/greenmining/compare/v1.1.6...v1.1.9
93
+ [1.1.6]: https://github.com/adam-bouafia/greenmining/compare/v0.1.12...v1.1.6
94
+ [0.1.12]: https://github.com/adam-bouafia/greenmining/compare/v0.1.11...v0.1.12
95
+ [0.1.11]: https://github.com/adam-bouafia/greenmining/compare/v0.1.0...v0.1.11
96
+ [0.1.0]: https://github.com/adam-bouafia/greenmining/releases/tag/v0.1.0
@@ -0,0 +1,311 @@
1
+ Metadata-Version: 2.4
2
+ Name: greenmining
3
+ Version: 1.2.0
4
+ Summary: An empirical Python library for Mining Software Repositories (MSR) in Green IT research
5
+ Author-email: Adam Bouafia <a.bouafia@student.vu.nl>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/adam-bouafia/greenmining
8
+ Project-URL: Documentation, https://github.com/adam-bouafia/greenmining#readme
9
+ Project-URL: Linkedin, https://www.linkedin.com/in/adam-bouafia/
10
+ Project-URL: Repository, https://github.com/adam-bouafia/greenmining
11
+ Project-URL: Issues, https://github.com/adam-bouafia/greenmining/issues
12
+ Project-URL: Changelog, https://github.com/adam-bouafia/greenmining/blob/main/CHANGELOG.md
13
+ Keywords: green-software,gsf,msr,mining-software-repositories,green-it,sustainability,carbon-footprint,energy-efficiency,repository-analysis,github-analysis,pydriller,empirical-software-engineering
14
+ Classifier: Development Status :: 3 - Alpha
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: Intended Audience :: Science/Research
17
+ Classifier: Topic :: Software Development :: Quality Assurance
18
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
19
+ Classifier: License :: OSI Approved :: MIT License
20
+ Classifier: Programming Language :: Python :: 3
21
+ Classifier: Programming Language :: Python :: 3.9
22
+ Classifier: Programming Language :: Python :: 3.10
23
+ Classifier: Programming Language :: Python :: 3.11
24
+ Classifier: Programming Language :: Python :: 3.12
25
+ Classifier: Programming Language :: Python :: 3.13
26
+ Classifier: Operating System :: OS Independent
27
+ Requires-Python: >=3.9
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ Requires-Dist: PyGithub
31
+ Requires-Dist: PyDriller
32
+ Requires-Dist: pandas
33
+ Requires-Dist: colorama
34
+ Requires-Dist: tabulate
35
+ Requires-Dist: tqdm
36
+ Requires-Dist: matplotlib
37
+ Requires-Dist: plotly
38
+ Requires-Dist: python-dotenv
39
+ Requires-Dist: requests
40
+ Provides-Extra: dev
41
+ Requires-Dist: pytest; extra == "dev"
42
+ Requires-Dist: pytest-cov; extra == "dev"
43
+ Requires-Dist: pytest-mock; extra == "dev"
44
+ Requires-Dist: black; extra == "dev"
45
+ Requires-Dist: ruff; extra == "dev"
46
+ Requires-Dist: mypy; extra == "dev"
47
+ Requires-Dist: build; extra == "dev"
48
+ Requires-Dist: twine; extra == "dev"
49
+ Provides-Extra: energy
50
+ Requires-Dist: psutil; extra == "energy"
51
+ Requires-Dist: codecarbon; extra == "energy"
52
+ Provides-Extra: docs
53
+ Requires-Dist: sphinx; extra == "docs"
54
+ Requires-Dist: sphinx-rtd-theme; extra == "docs"
55
+ Requires-Dist: myst-parser; extra == "docs"
56
+ Dynamic: license-file
57
+
58
+ # greenmining
59
+
60
+ An empirical Python library for Mining Software Repositories (MSR) in Green IT research.
61
+
62
+ [![PyPI](https://img.shields.io/pypi/v/greenmining)](https://pypi.org/project/greenmining/)
63
+ [![Python](https://img.shields.io/pypi/pyversions/greenmining)](https://pypi.org/project/greenmining/)
64
+ [![License](https://img.shields.io/github/license/adam-bouafia/greenmining)](LICENSE)
65
+ [![Documentation](https://img.shields.io/badge/docs-readthedocs-blue)](https://greenmining.readthedocs.io/)
66
+
67
+ ## Overview
68
+
69
+ `greenmining` is a research-grade Python library designed for **empirical Mining Software Repositories (MSR)** studies in **Green IT**. It enables researchers and practitioners to:
70
+
71
+ - **Mine repositories at scale** - Search, fetch, and clone GitHub repositories via GraphQL API with configurable filters
72
+ - **Classify green commits** - Detect 124 sustainability patterns from the Green Software Foundation (GSF) catalog using 332 keywords
73
+ - **Analyze any repository by URL** - Direct Git-based analysis with support for private repositories
74
+ - **Measure energy consumption** - RAPL, CodeCarbon, and CPU Energy Meter backends for power profiling
75
+ - **Carbon footprint reporting** - CO2 emissions calculation with 20+ country profiles and cloud region support
76
+ - **Method-level analysis** - Per-method complexity and metrics via Lizard integration
77
+ - **Generate research datasets** - Statistical analysis, temporal trends, and publication-ready reports
78
+
79
+ ## Installation
80
+
81
+ ### Via pip
82
+
83
+ ```bash
84
+ pip install greenmining
85
+ ```
86
+
87
+ ### With energy measurement
88
+
89
+ ```bash
90
+ pip install greenmining[energy]
91
+ ```
92
+
93
+ ### From source
94
+
95
+ ```bash
96
+ git clone https://github.com/adam-bouafia/greenmining.git
97
+ cd greenmining
98
+ pip install -e .
99
+ ```
100
+
101
+ ## Quick Start
102
+
103
+ ### Pattern Detection
104
+
105
+ ```python
106
+ from greenmining import GSF_PATTERNS, is_green_aware, get_pattern_by_keywords
107
+
108
+ print(f"Total patterns: {len(GSF_PATTERNS)}") # 124 patterns across 15 categories
109
+
110
+ commit_msg = "Optimize Redis caching to reduce energy consumption"
111
+ if is_green_aware(commit_msg):
112
+ patterns = get_pattern_by_keywords(commit_msg)
113
+ print(f"Matched patterns: {patterns}")
114
+ ```
115
+
116
+ ### Fetch Repositories
117
+
118
+ ```python
119
+ from greenmining import fetch_repositories
120
+
121
+ repos = fetch_repositories(
122
+ github_token="your_token",
123
+ max_repos=50,
124
+ min_stars=500,
125
+ keywords="kubernetes cloud-native",
126
+ languages=["Python", "Go"],
127
+ created_after="2020-01-01",
128
+ pushed_after="2023-01-01",
129
+ )
130
+
131
+ for repo in repos[:5]:
132
+ print(f"- {repo.full_name} ({repo.stars} stars)")
133
+ ```
134
+
135
+ ### Clone Repositories
136
+
137
+ ```python
138
+ from greenmining import fetch_repositories, clone_repositories
139
+
140
+ repos = fetch_repositories(github_token="your_token", max_repos=10, keywords="android")
141
+
142
+ # Clone into ./greenmining_repos/ with sanitized directory names
143
+ paths = clone_repositories(repos)
144
+ print(f"Cloned {len(paths)} repositories")
145
+ ```
146
+
147
+ ### Analyze Repositories by URL
148
+
149
+ ```python
150
+ from greenmining import analyze_repositories
151
+
152
+ results = analyze_repositories(
153
+ urls=[
154
+ "https://github.com/kubernetes/kubernetes",
155
+ "https://github.com/istio/istio",
156
+ ],
157
+ max_commits=100,
158
+ parallel_workers=2,
159
+ energy_tracking=True,
160
+ energy_backend="auto",
161
+ method_level_analysis=True,
162
+ include_source_code=True,
163
+ github_token="your_token",
164
+ since_date="2020-01-01",
165
+ to_date="2025-12-31",
166
+ )
167
+
168
+ for result in results:
169
+ print(f"{result.name}: {result.green_commit_rate:.1%} green")
170
+ ```
171
+
172
+ ### Access Pattern Data
173
+
174
+ ```python
175
+ from greenmining import GSF_PATTERNS
176
+
177
+ # Get patterns by category
178
+ cloud = {k: v for k, v in GSF_PATTERNS.items() if v['category'] == 'cloud'}
179
+ print(f"Cloud patterns: {len(cloud)}")
180
+
181
+ # All categories
182
+ categories = set(p['category'] for p in GSF_PATTERNS.values())
183
+ print(f"Categories: {sorted(categories)}")
184
+ ```
185
+
186
+ ### Energy Measurement
187
+
188
+ ```python
189
+ from greenmining.energy import get_energy_meter, CPUEnergyMeter
190
+
191
+ # Auto-detect best backend
192
+ meter = get_energy_meter("auto")
193
+ meter.start()
194
+ # ... your workload ...
195
+ result = meter.stop()
196
+ print(f"Energy: {result.joules:.2f} J, Power: {result.watts_avg:.2f} W")
197
+ ```
198
+
199
+ ### Statistical Analysis
200
+
201
+ ```python
202
+ from greenmining.analyzers import StatisticalAnalyzer, TemporalAnalyzer
203
+
204
+ stat = StatisticalAnalyzer()
205
+ temporal = TemporalAnalyzer(granularity="quarter")
206
+
207
+ # Pattern correlations, effect sizes, temporal trends
208
+ # See experiment notebook for full usage
209
+ ```
210
+
211
+ ### Metrics-to-Power Correlation
212
+
213
+ ```python
214
+ from greenmining.analyzers import MetricsPowerCorrelator
215
+
216
+ correlator = MetricsPowerCorrelator()
217
+ correlator.fit(
218
+ metrics=["complexity", "nloc", "code_churn"],
219
+ metrics_values={
220
+ "complexity": [10, 20, 30, 40],
221
+ "nloc": [100, 200, 300, 400],
222
+ "code_churn": [50, 100, 150, 200],
223
+ },
224
+ power_measurements=[5.0, 8.0, 12.0, 15.0],
225
+ )
226
+ print(f"Feature importance: {correlator.feature_importance}")
227
+ ```
228
+
229
+ ## Features
230
+
231
+ ### Core Capabilities
232
+
233
+ - **Pattern Detection**: 124 sustainability patterns across 15 categories from the GSF catalog
234
+ - **Keyword Analysis**: 332 green software detection keywords
235
+ - **Repository Fetching**: GraphQL API with date, star, and language filters
236
+ - **Repository Cloning**: Sanitized directory names in `./greenmining_repos/`
237
+ - **URL-Based Analysis**: Direct Git-based analysis from GitHub URLs (HTTPS and SSH)
238
+ - **Batch Processing**: Parallel analysis of multiple repositories
239
+ - **Private Repository Support**: Authentication via SSH keys or GitHub tokens
240
+
241
+ ### Analysis & Measurement
242
+
243
+ - **Energy Measurement**: RAPL, CodeCarbon, and CPU Energy Meter backends
244
+ - **Carbon Footprint Reporting**: CO2 emissions with 20+ country profiles (AWS, GCP, Azure)
245
+ - **Metrics-to-Power Correlation**: Pearson and Spearman analysis between code metrics and power
246
+ - **Method-Level Analysis**: Per-method complexity metrics via Lizard integration
247
+ - **Source Code Access**: Before/after source code for refactoring detection
248
+ - **Process Metrics**: DMM size, complexity, interfacing via PyDriller
249
+ - **Statistical Analysis**: Correlations, effect sizes, and temporal trends
250
+ - **Multi-format Output**: JSON, CSV, pandas DataFrame
251
+
252
+ ### Energy Backends
253
+
254
+ | Backend | Platform | Metrics | Requirements |
255
+ |---------|----------|---------|--------------|
256
+ | **RAPL** | Linux (Intel/AMD) | CPU/RAM energy (Joules) | `/sys/class/powercap/` access |
257
+ | **CodeCarbon** | Cross-platform | Energy + Carbon emissions (gCO2) | `pip install codecarbon` |
258
+ | **CPU Meter** | All platforms | Estimated CPU energy (Joules) | Optional: `pip install psutil` |
259
+ | **Auto** | All platforms | Best available backend | Automatic detection |
260
+
261
+ ### GSF Pattern Categories
262
+
263
+ **124 patterns across 15 categories:**
264
+
265
+ | Category | Patterns | Examples |
266
+ |----------|----------|----------|
267
+ | Cloud | 42 | Auto-scaling, serverless, right-sizing, region selection |
268
+ | Web | 17 | CDN, caching, lazy loading, compression |
269
+ | AI/ML | 19 | Model pruning, quantization, edge inference |
270
+ | Database | 5 | Indexing, query optimization, connection pooling |
271
+ | Networking | 8 | Protocol optimization, HTTP/2, gRPC |
272
+ | Network | 6 | Request batching, GraphQL, circuit breakers |
273
+ | Microservices | 4 | Service decomposition, graceful shutdown |
274
+ | Infrastructure | 4 | Alpine containers, IaC, renewable regions |
275
+ | General | 8 | Feature flags, precomputation, background jobs |
276
+ | Others | 11 | Caching, resource, data, async, code, monitoring |
277
+
278
+ ## Development
279
+
280
+ ```bash
281
+ git clone https://github.com/adam-bouafia/greenmining.git
282
+ cd greenmining
283
+ pip install -e ".[dev]"
284
+
285
+ pytest tests/
286
+ black greenmining/ tests/
287
+ ruff check greenmining/ tests/
288
+ ```
289
+
290
+ ## Requirements
291
+
292
+ - Python 3.9+
293
+ - PyGithub, PyDriller, pandas, colorama, tqdm
294
+
295
+ **Optional:**
296
+
297
+ ```bash
298
+ pip install greenmining[energy] # psutil, codecarbon
299
+ pip install greenmining[dev] # pytest, black, ruff, mypy
300
+ ```
301
+
302
+ ## License
303
+
304
+ MIT License - See [LICENSE](LICENSE) for details.
305
+
306
+ ## Links
307
+
308
+ - **GitHub**: https://github.com/adam-bouafia/greenmining
309
+ - **PyPI**: https://pypi.org/project/greenmining/
310
+ - **Documentation**: https://greenmining.readthedocs.io/
311
+ - **Docker Hub**: https://hub.docker.com/r/adambouafia/greenmining
@@ -0,0 +1,254 @@
1
+ # greenmining
2
+
3
+ An empirical Python library for Mining Software Repositories (MSR) in Green IT research.
4
+
5
+ [![PyPI](https://img.shields.io/pypi/v/greenmining)](https://pypi.org/project/greenmining/)
6
+ [![Python](https://img.shields.io/pypi/pyversions/greenmining)](https://pypi.org/project/greenmining/)
7
+ [![License](https://img.shields.io/github/license/adam-bouafia/greenmining)](LICENSE)
8
+ [![Documentation](https://img.shields.io/badge/docs-readthedocs-blue)](https://greenmining.readthedocs.io/)
9
+
10
+ ## Overview
11
+
12
+ `greenmining` is a research-grade Python library designed for **empirical Mining Software Repositories (MSR)** studies in **Green IT**. It enables researchers and practitioners to:
13
+
14
+ - **Mine repositories at scale** - Search, fetch, and clone GitHub repositories via GraphQL API with configurable filters
15
+ - **Classify green commits** - Detect 124 sustainability patterns from the Green Software Foundation (GSF) catalog using 332 keywords
16
+ - **Analyze any repository by URL** - Direct Git-based analysis with support for private repositories
17
+ - **Measure energy consumption** - RAPL, CodeCarbon, and CPU Energy Meter backends for power profiling
18
+ - **Carbon footprint reporting** - CO2 emissions calculation with 20+ country profiles and cloud region support
19
+ - **Method-level analysis** - Per-method complexity and metrics via Lizard integration
20
+ - **Generate research datasets** - Statistical analysis, temporal trends, and publication-ready reports
21
+
22
+ ## Installation
23
+
24
+ ### Via pip
25
+
26
+ ```bash
27
+ pip install greenmining
28
+ ```
29
+
30
+ ### With energy measurement
31
+
32
+ ```bash
33
+ pip install greenmining[energy]
34
+ ```
35
+
36
+ ### From source
37
+
38
+ ```bash
39
+ git clone https://github.com/adam-bouafia/greenmining.git
40
+ cd greenmining
41
+ pip install -e .
42
+ ```
43
+
44
+ ## Quick Start
45
+
46
+ ### Pattern Detection
47
+
48
+ ```python
49
+ from greenmining import GSF_PATTERNS, is_green_aware, get_pattern_by_keywords
50
+
51
+ print(f"Total patterns: {len(GSF_PATTERNS)}") # 124 patterns across 15 categories
52
+
53
+ commit_msg = "Optimize Redis caching to reduce energy consumption"
54
+ if is_green_aware(commit_msg):
55
+ patterns = get_pattern_by_keywords(commit_msg)
56
+ print(f"Matched patterns: {patterns}")
57
+ ```
58
+
59
+ ### Fetch Repositories
60
+
61
+ ```python
62
+ from greenmining import fetch_repositories
63
+
64
+ repos = fetch_repositories(
65
+ github_token="your_token",
66
+ max_repos=50,
67
+ min_stars=500,
68
+ keywords="kubernetes cloud-native",
69
+ languages=["Python", "Go"],
70
+ created_after="2020-01-01",
71
+ pushed_after="2023-01-01",
72
+ )
73
+
74
+ for repo in repos[:5]:
75
+ print(f"- {repo.full_name} ({repo.stars} stars)")
76
+ ```
77
+
78
+ ### Clone Repositories
79
+
80
+ ```python
81
+ from greenmining import fetch_repositories, clone_repositories
82
+
83
+ repos = fetch_repositories(github_token="your_token", max_repos=10, keywords="android")
84
+
85
+ # Clone into ./greenmining_repos/ with sanitized directory names
86
+ paths = clone_repositories(repos)
87
+ print(f"Cloned {len(paths)} repositories")
88
+ ```
89
+
90
+ ### Analyze Repositories by URL
91
+
92
+ ```python
93
+ from greenmining import analyze_repositories
94
+
95
+ results = analyze_repositories(
96
+ urls=[
97
+ "https://github.com/kubernetes/kubernetes",
98
+ "https://github.com/istio/istio",
99
+ ],
100
+ max_commits=100,
101
+ parallel_workers=2,
102
+ energy_tracking=True,
103
+ energy_backend="auto",
104
+ method_level_analysis=True,
105
+ include_source_code=True,
106
+ github_token="your_token",
107
+ since_date="2020-01-01",
108
+ to_date="2025-12-31",
109
+ )
110
+
111
+ for result in results:
112
+ print(f"{result.name}: {result.green_commit_rate:.1%} green")
113
+ ```
114
+
115
+ ### Access Pattern Data
116
+
117
+ ```python
118
+ from greenmining import GSF_PATTERNS
119
+
120
+ # Get patterns by category
121
+ cloud = {k: v for k, v in GSF_PATTERNS.items() if v['category'] == 'cloud'}
122
+ print(f"Cloud patterns: {len(cloud)}")
123
+
124
+ # All categories
125
+ categories = set(p['category'] for p in GSF_PATTERNS.values())
126
+ print(f"Categories: {sorted(categories)}")
127
+ ```
128
+
129
+ ### Energy Measurement
130
+
131
+ ```python
132
+ from greenmining.energy import get_energy_meter, CPUEnergyMeter
133
+
134
+ # Auto-detect best backend
135
+ meter = get_energy_meter("auto")
136
+ meter.start()
137
+ # ... your workload ...
138
+ result = meter.stop()
139
+ print(f"Energy: {result.joules:.2f} J, Power: {result.watts_avg:.2f} W")
140
+ ```
141
+
142
+ ### Statistical Analysis
143
+
144
+ ```python
145
+ from greenmining.analyzers import StatisticalAnalyzer, TemporalAnalyzer
146
+
147
+ stat = StatisticalAnalyzer()
148
+ temporal = TemporalAnalyzer(granularity="quarter")
149
+
150
+ # Pattern correlations, effect sizes, temporal trends
151
+ # See experiment notebook for full usage
152
+ ```
153
+
154
+ ### Metrics-to-Power Correlation
155
+
156
+ ```python
157
+ from greenmining.analyzers import MetricsPowerCorrelator
158
+
159
+ correlator = MetricsPowerCorrelator()
160
+ correlator.fit(
161
+ metrics=["complexity", "nloc", "code_churn"],
162
+ metrics_values={
163
+ "complexity": [10, 20, 30, 40],
164
+ "nloc": [100, 200, 300, 400],
165
+ "code_churn": [50, 100, 150, 200],
166
+ },
167
+ power_measurements=[5.0, 8.0, 12.0, 15.0],
168
+ )
169
+ print(f"Feature importance: {correlator.feature_importance}")
170
+ ```
171
+
172
+ ## Features
173
+
174
+ ### Core Capabilities
175
+
176
+ - **Pattern Detection**: 124 sustainability patterns across 15 categories from the GSF catalog
177
+ - **Keyword Analysis**: 332 green software detection keywords
178
+ - **Repository Fetching**: GraphQL API with date, star, and language filters
179
+ - **Repository Cloning**: Sanitized directory names in `./greenmining_repos/`
180
+ - **URL-Based Analysis**: Direct Git-based analysis from GitHub URLs (HTTPS and SSH)
181
+ - **Batch Processing**: Parallel analysis of multiple repositories
182
+ - **Private Repository Support**: Authentication via SSH keys or GitHub tokens
183
+
184
+ ### Analysis & Measurement
185
+
186
+ - **Energy Measurement**: RAPL, CodeCarbon, and CPU Energy Meter backends
187
+ - **Carbon Footprint Reporting**: CO2 emissions with 20+ country profiles (AWS, GCP, Azure)
188
+ - **Metrics-to-Power Correlation**: Pearson and Spearman analysis between code metrics and power
189
+ - **Method-Level Analysis**: Per-method complexity metrics via Lizard integration
190
+ - **Source Code Access**: Before/after source code for refactoring detection
191
+ - **Process Metrics**: DMM size, complexity, interfacing via PyDriller
192
+ - **Statistical Analysis**: Correlations, effect sizes, and temporal trends
193
+ - **Multi-format Output**: JSON, CSV, pandas DataFrame
194
+
195
+ ### Energy Backends
196
+
197
+ | Backend | Platform | Metrics | Requirements |
198
+ |---------|----------|---------|--------------|
199
+ | **RAPL** | Linux (Intel/AMD) | CPU/RAM energy (Joules) | `/sys/class/powercap/` access |
200
+ | **CodeCarbon** | Cross-platform | Energy + Carbon emissions (gCO2) | `pip install codecarbon` |
201
+ | **CPU Meter** | All platforms | Estimated CPU energy (Joules) | Optional: `pip install psutil` |
202
+ | **Auto** | All platforms | Best available backend | Automatic detection |
203
+
204
+ ### GSF Pattern Categories
205
+
206
+ **124 patterns across 15 categories:**
207
+
208
+ | Category | Patterns | Examples |
209
+ |----------|----------|----------|
210
+ | Cloud | 42 | Auto-scaling, serverless, right-sizing, region selection |
211
+ | Web | 17 | CDN, caching, lazy loading, compression |
212
+ | AI/ML | 19 | Model pruning, quantization, edge inference |
213
+ | Database | 5 | Indexing, query optimization, connection pooling |
214
+ | Networking | 8 | Protocol optimization, HTTP/2, gRPC |
215
+ | Network | 6 | Request batching, GraphQL, circuit breakers |
216
+ | Microservices | 4 | Service decomposition, graceful shutdown |
217
+ | Infrastructure | 4 | Alpine containers, IaC, renewable regions |
218
+ | General | 8 | Feature flags, precomputation, background jobs |
219
+ | Others | 11 | Caching, resource, data, async, code, monitoring |
220
+
221
+ ## Development
222
+
223
+ ```bash
224
+ git clone https://github.com/adam-bouafia/greenmining.git
225
+ cd greenmining
226
+ pip install -e ".[dev]"
227
+
228
+ pytest tests/
229
+ black greenmining/ tests/
230
+ ruff check greenmining/ tests/
231
+ ```
232
+
233
+ ## Requirements
234
+
235
+ - Python 3.9+
236
+ - PyGithub, PyDriller, pandas, colorama, tqdm
237
+
238
+ **Optional:**
239
+
240
+ ```bash
241
+ pip install greenmining[energy] # psutil, codecarbon
242
+ pip install greenmining[dev] # pytest, black, ruff, mypy
243
+ ```
244
+
245
+ ## License
246
+
247
+ MIT License - See [LICENSE](LICENSE) for details.
248
+
249
+ ## Links
250
+
251
+ - **GitHub**: https://github.com/adam-bouafia/greenmining
252
+ - **PyPI**: https://pypi.org/project/greenmining/
253
+ - **Documentation**: https://greenmining.readthedocs.io/
254
+ - **Docker Hub**: https://hub.docker.com/r/adambouafia/greenmining