coreLearn 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,482 @@
1
+ Metadata-Version: 2.4
2
+ Name: coreLearn
3
+ Version: 0.1.0
4
+ Summary: Basic ML algorithms library built from scratch (KNN + Linear Regression)
5
+ Requires-Python: >=3.9
6
+ Description-Content-Type: text/markdown
7
+ Requires-Dist: numpy>=1.21
8
+ Provides-Extra: dev
9
+ Requires-Dist: pytest>=7.0; extra == "dev"
10
+ Requires-Dist: scikit-learn; extra == "dev"
11
+ Requires-Dist: jupyter; extra == "dev"
12
+
13
+ # CoreLearn
14
+
15
+ A lightweight Python machine learning library built from scratch using only **NumPy**.
16
+ Implements KNN classification and Linear Regression with a focus on **software design**, not just accuracy.
17
+
18
+ ---
19
+
20
+ ## Installation
21
+
22
+ ```bash
23
+ # Clone or download the project, then from the coreLearn/ directory:
24
+ pip install -e .
25
+
26
+ # Install all dependencies (including dev tools):
27
+ pip install -r requirements.txt
28
+ ```
29
+
30
+ After installation, import from anywhere:
31
+
32
+ ```python
33
+ from coreLearn import KNNClassifier, LinearRegression, Evaluator
34
+ ```
35
+
36
+ ---
37
+
38
+ ## Quick Start
39
+
40
+ ```python
41
+ from coreLearn import KNNClassifier, LinearRegression, Evaluator, accuracy, mae
42
+
43
+ # --- KNN Classification ---
44
+ knn = KNNClassifier(k=5, distance="euclidean", n_jobs=2)
45
+ knn.fit(X_train, y_train)
46
+ predictions = knn.predict(X_test)
47
+ print(accuracy(y_test, predictions))
48
+
49
+ # --- Linear Regression ---
50
+ lr = LinearRegression(strategy="normal")
51
+ lr.fit(X_train, y_train)
52
+ predictions = lr.predict(X_test)
53
+ print(mae(y_test, predictions))
54
+
55
+ # --- Evaluator ---
56
+ print(Evaluator.evaluate_regression(y_test, predictions))
57
+ # {'mae': ..., 'mse': ..., 'rmse': ...}
58
+
59
+ print(Evaluator.evaluate_classification(y_test, knn_preds))
60
+ # {'accuracy': ..., 'precision': ..., 'recall': ..., 'f1': ...}
61
+ ```
62
+
63
+ ---
64
+
65
+ ## Package Structure
66
+
67
+ ```
68
+ coreLearn/
69
+ ├── __init__.py ← Public API
70
+ ├── base.py ← Abstract base class — Template Method Pattern
71
+ ├── distances.py ← Distance metrics — Factory Pattern
72
+ ├── knn.py ← KNN Classifier — Recursion + Concurrency + OOP
73
+ ├── linear_regression.py ← Linear Regression — Strategy Pattern + OOP
74
+ ├── evaluator.py ← Metric engine — Functional Programming
75
+ ├── examples/
76
+ │ ├── demo_notebook.ipynb
77
+ │ ├── housing.csv
78
+ │ └── penguin.csv
79
+ └── tests/
80
+ ├── test_knn.py
81
+ ├── test_linear_regression.py
82
+ ├── test_distances.py
83
+ └── test_evaluator.py
84
+ ```
85
+
86
+ ---
87
+
88
+ ## Running Tests
89
+
90
+ ```bash
91
+ cd coreLearn/
92
+ pytest coreLearn/tests/ -v
93
+ ```
94
+
95
+ ---
96
+
97
+ ## Learning Outcomes
98
+
99
+ ### 1 — Object-Oriented Programming (OOP)
100
+
101
+ **File:** `base.py`, `knn.py`, `linear_regression.py`, `distances.py`
102
+
103
+ #### Abstract Base Class & Inheritance
104
+
105
+ `BaseModel` is an abstract class that defines the contract every model must follow.
106
+ `KNNClassifier` and `LinearRegression` both inherit from it:
107
+
108
+ ```python
109
+ # base.py
110
+ class BaseModel(ABC):
111
+ @abstractmethod
112
+ def fit(self, X, y) -> "BaseModel": ...
113
+
114
+ @abstractmethod
115
+ def predict(self, X) -> list: ...
116
+
117
+ # knn.py
118
+ class KNNClassifier(BaseModel): # ← inheritance
119
+ def fit(self, X, y): ...
120
+ def predict(self, X): ...
121
+
122
+ # linear_regression.py
123
+ class LinearRegression(BaseModel): # ← inheritance
124
+ def fit(self, X, y): ...
125
+ def predict(self, X): ...
126
+ ```
127
+
128
+ #### Polymorphism
129
+
130
+ Both models share the same interface — they can be used interchangeably:
131
+
132
+ ```python
133
+ for model in [KNNClassifier(k=3), LinearRegression()]:
134
+ model.fit(X_train, y_train) # same call, different behaviour
135
+ model.predict(X_test) # same call, different behaviour
136
+ ```
137
+
138
+ #### Encapsulation
139
+
140
+ Internal state is hidden with `_` prefixes. Users interact only through the public API:
141
+
142
+ ```python
143
+ # knn.py
144
+ self._metric = DistanceMetricFactory.create(distance) # private
145
+ self._tree = None # private
146
+
147
+ # linear_regression.py — controlled read access via properties
148
+ @property
149
+ def coef_(self) -> np.ndarray:
150
+ return self._weights[1:]
151
+
152
+ @property
153
+ def intercept_(self) -> float:
154
+ return float(self._weights[0])
155
+ ```
156
+
157
+ `OptimizationStrategy`, `NormalEquationStrategy`, and `GradientDescentStrategy` inside
158
+ `linear_regression.py` form an additional hierarchy demonstrating inheritance within the library.
159
+
160
+ ---
161
+
162
+ ### 2 — Functional Programming
163
+
164
+ **File:** `evaluator.py`
165
+
166
+ #### Functions as First-Class Objects
167
+
168
+ Metric functions are stored in dictionaries as values and called dynamically:
169
+
170
+ ```python
171
+ # evaluator.py
172
+ _regression_metrics: dict[str, callable] = {
173
+ "mae": mae,
174
+ "mse": mse,
175
+ "rmse": rmse,
176
+ }
177
+
178
+ @classmethod
179
+ def evaluate_regression(cls, y_true, y_pred) -> dict:
180
+ # applies every registered function — no if/elif chain
181
+ return {name: fn(y_true, y_pred) for name, fn in cls._regression_metrics.items()}
182
+ ```
183
+
184
+ #### Higher-Order Function — `register()`
185
+
186
+ `Evaluator.register()` accepts any callable and plugs it in at runtime.
187
+ This is the classic higher-order function pattern: a function (or method) that takes another function as an argument.
188
+
189
+ ```python
190
+ # Add a custom metric without modifying the Evaluator class
191
+ Evaluator.register(
192
+ "max_error",
193
+ lambda y_true, y_pred: max(abs(a - b) for a, b in zip(y_true, y_pred)),
194
+ kind="regression",
195
+ )
196
+ result = Evaluator.evaluate_regression(y_test, y_pred)
197
+ print(result["max_error"]) # available immediately
198
+ ```
199
+
200
+ #### Pure Functions
201
+
202
+ `mae`, `mse`, `rmse`, `accuracy`, `precision`, `recall`, `f1_score` are all pure functions:
203
+ - No side effects
204
+ - No mutation of inputs
205
+ - Same inputs always produce the same output
206
+
207
+ ```python
208
+ from coreLearn import mae, accuracy
209
+ mae([1.0, 2.0, 3.0], [1.5, 2.5, 3.5]) # → 0.5 (always)
210
+ accuracy([0, 1, 1], [0, 1, 0]) # → 0.666 (always)
211
+ ```
212
+
213
+ ---
214
+
215
+ ### 3 — Concurrency
216
+
217
+ **File:** `knn.py` — `KNNClassifier.predict()`
218
+
219
+ `KNNClassifier` uses `ProcessPoolExecutor` to classify test samples in parallel across
220
+ multiple CPU processes. Unlike threads, each worker runs in its own process with its
221
+ own GIL — enabling true CPU-bound parallelism.
222
+
223
+ ```python
224
+ # knn.py
225
+ def predict(self, X) -> list:
226
+ ...
227
+ if self.n_jobs == 1:
228
+ # sequential — no overhead for small datasets
229
+ return [self._predict_one(x) for x in samples]
230
+
231
+ # parallel — distribute samples across n_jobs worker processes
232
+ args = [(self._tree, x, self.k, self._metric) for x in samples]
233
+ with ProcessPoolExecutor(max_workers=self.n_jobs) as executor:
234
+ return list(executor.map(_predict_worker, args))
235
+ ```
236
+
237
+ **Why no race conditions?**
238
+ Each worker receives its own pickled copy of the KD-Tree and metric via `ProcessPoolExecutor`.
239
+ No shared memory is used, so no synchronization primitives are needed.
240
+
241
+ ```python
242
+ # n_jobs=1 → sequential (default, safe for notebooks)
243
+ knn = KNNClassifier(k=5, n_jobs=1)
244
+
245
+ # n_jobs=4 → 4 parallel worker processes
246
+ knn = KNNClassifier(k=5, n_jobs=4)
247
+ knn.fit(X_train, y_train)
248
+ preds = knn.predict(X_test)
249
+ ```
250
+
251
+ > **Note:** `ProcessPoolExecutor` requires the `if __name__ == "__main__":` guard on
252
+ > Windows/macOS when used in scripts. The `n_jobs=1` default is safe everywhere.
253
+
254
+ ---
255
+
256
+ ### 4 — Recursion
257
+
258
+ **File:** `knn.py` — `KDTree`
259
+
260
+ The KD-Tree data structure is built and searched using **mutual recursion**.
261
+ Both `_build` and `_search` call themselves with a strictly smaller subproblem each time.
262
+
263
+ #### `_build` — Recursive Tree Construction
264
+
265
+ **Base case:** empty data → return `None`.
266
+ **Recursive case:** split on the median, call `_build` on each half with `depth + 1`.
267
+
268
+ ```python
269
+ # knn.py
270
+ def _build(self, data: list, depth: int):
271
+ if not data: # ← base case
272
+ return None
273
+ axis = depth % len(data[0][0])
274
+ data.sort(key=lambda item: item[0][axis])
275
+ mid = len(data) // 2
276
+ return KDNode(
277
+ point = data[mid][0],
278
+ label = data[mid][1],
279
+ left = self._build(data[:mid], depth + 1), # ← recursion
280
+ right = self._build(data[mid + 1:], depth + 1), # ← recursion
281
+ )
282
+ ```
283
+
284
+ #### `_search` — Recursive Nearest-Neighbour Search
285
+
286
+ **Base case:** node is `None` → return.
287
+ **Recursive case:** visit the near branch, then prune and optionally visit the far branch.
288
+
289
+ ```python
290
+ # knn.py
291
+ def _search(self, node, target, k, metric, depth, best):
292
+ if node is None: # ← base case
293
+ return
294
+ dist = metric(target, node.point)
295
+ # update best list ...
296
+ self._search(near, target, k, metric, depth + 1, best) # ← recursion
297
+ if len(best) < k or abs(diff) < best[-1][0]:
298
+ self._search(far, target, k, metric, depth + 1, best) # ← recursion (pruned)
299
+ ```
300
+
301
+ **Pruning:** the `abs(diff) < best[-1][0]` condition skips the far branch when it cannot
302
+ contain a closer neighbour — achieving O(log n) average search complexity.
303
+
304
+ ---
305
+
306
+ ### 5 — SOLID Principles
307
+
308
+ **Files:** all modules
309
+
310
+ #### S — Single Responsibility
311
+
312
+ Every class has exactly one reason to change:
313
+
314
+ | Class | Sole Responsibility |
315
+ |-------|-------------------|
316
+ | `BaseModel` | Define the common model contract |
317
+ | `KDTree` | Spatial nearest-neighbour search |
318
+ | `KNNClassifier` | KNN classification logic |
319
+ | `LinearRegression` | Linear regression logic |
320
+ | `NormalEquationStrategy` | Closed-form weight computation |
321
+ | `GradientDescentStrategy` | Iterative gradient-based weight computation |
322
+ | `DistanceMetricFactory` | Instantiate distance metric objects by name |
323
+ | `Evaluator` | Compute and manage evaluation metrics |
324
+
325
+ #### O — Open/Closed
326
+
327
+ Classes are open for extension, closed for modification.
328
+ New metrics and distance functions can be added **without editing any existing class**:
329
+
330
+ ```python
331
+ # Add a new metric — Evaluator source code untouched
332
+ Evaluator.register("r2", lambda t, p: ..., kind="regression")
333
+
334
+ # Add a new distance — KNNClassifier source code untouched
335
+ DistanceMetricFactory.register("chebyshev", ChebyshevDistance)
336
+ knn = KNNClassifier(k=3, distance="chebyshev")
337
+ ```
338
+
339
+ #### L — Liskov Substitution
340
+
341
+ Any `BaseModel` subclass can replace `BaseModel` without breaking callers:
342
+
343
+ ```python
344
+ def train_and_score(model: BaseModel, X_train, y_train, X_test, y_test):
345
+ preds = model.fit_predict(X_train, y_train, X_test)
346
+ return accuracy(y_test, preds)
347
+
348
+ train_and_score(KNNClassifier(k=3), ...) # works
349
+ train_and_score(LinearRegression(), ...) # works
350
+ ```
351
+
352
+ #### I — Interface Segregation
353
+
354
+ `DistanceMetric` exposes only what is needed — a single `compute()` method.
355
+ Implementors are not forced to implement anything they do not use:
356
+
357
+ ```python
358
+ # distances.py
359
+ class DistanceMetric(ABC):
360
+ @abstractmethod
361
+ def compute(self, a: list, b: list) -> float: ...
362
+ # nothing else required
363
+ ```
364
+
365
+ #### D — Dependency Inversion
366
+
367
+ `LinearRegression` depends on the **abstraction** `OptimizationStrategy`,
368
+ not on any concrete strategy class:
369
+
370
+ ```python
371
+ # linear_regression.py
372
+ self._weights = self._strategy.fit(X_b, y)
373
+ # ↑ OptimizationStrategy interface — concrete class unknown here
374
+ ```
375
+
376
+ ---
377
+
378
+ ### 6 — Architectural & Design Patterns
379
+
380
+ **Architecture:** Layered
381
+ - **Core layer** (`base.py`, `distances.py`): abstractions and shared contracts
382
+ - **Algorithm layer** (`knn.py`, `linear_regression.py`): concrete ML algorithms
383
+ - **Evaluation layer** (`evaluator.py`): metric computation
384
+ - **Public API** (`__init__.py`): single entry point, re-exports everything
385
+
386
+ #### Pattern 1 — Template Method (`base.py`)
387
+
388
+ `fit_predict` defines the fixed skeleton (fit → predict).
389
+ Subclasses fill in each step without altering the sequence:
390
+
391
+ ```python
392
+ # base.py
393
+ def fit_predict(self, X_train, y_train, X_test) -> list:
394
+ self.fit(X_train, y_train) # ← step 1: implemented by subclass
395
+ return self.predict(X_test) # ← step 2: implemented by subclass
396
+ ```
397
+
398
+ Every model gets `fit_predict` for free through inheritance.
399
+
400
+ #### Pattern 2 — Strategy (`linear_regression.py`)
401
+
402
+ The optimisation algorithm is swapped at construction time.
403
+ `LinearRegression.fit()` never knows which concrete strategy it is using:
404
+
405
+ ```python
406
+ lr_ne = LinearRegression(strategy="normal") # uses NormalEquationStrategy
407
+ lr_gd = LinearRegression(strategy="gradient_descent") # uses GradientDescentStrategy
408
+
409
+ # Both models have the same interface — caller code is identical
410
+ lr_ne.fit(X_train, y_train)
411
+ lr_gd.fit(X_train, y_train)
412
+ ```
413
+
414
+ To add a third optimiser (e.g. Adam), only a new `OptimizationStrategy` subclass is needed.
415
+
416
+ #### Pattern 3 — Factory (`distances.py`)
417
+
418
+ `DistanceMetricFactory` centralises object creation.
419
+ `KNNClassifier` never imports `EuclideanDistance` or `ManhattanDistance` directly:
420
+
421
+ ```python
422
+ # distances.py
423
+ class DistanceMetricFactory:
424
+ _registry = {"euclidean": EuclideanDistance, "manhattan": ManhattanDistance}
425
+
426
+ @classmethod
427
+ def create(cls, name: str) -> DistanceMetric:
428
+ return cls._registry[name]() # create and return
429
+
430
+ @classmethod
431
+ def register(cls, name: str, metric_class: type) -> None:
432
+ cls._registry[name] = metric_class # extend without modifying
433
+
434
+ # knn.py — only depends on the factory, not the concrete classes
435
+ self._metric = DistanceMetricFactory.create(distance)
436
+ ```
437
+
438
+ ---
439
+
440
+ ## API Reference
441
+
442
+ ### `KNNClassifier`
443
+
444
+ | Parameter | Type | Default | Description |
445
+ |-----------|------|---------|-------------|
446
+ | `k` | `int` | `5` | Number of neighbours |
447
+ | `distance` | `str` | `"euclidean"` | `"euclidean"` or `"manhattan"` (or any registered name) |
448
+ | `n_jobs` | `int` | `1` | Worker processes for prediction (`1` = sequential) |
449
+
450
+ ### `LinearRegression`
451
+
452
+ | Parameter | Type | Default | Description |
453
+ |-----------|------|---------|-------------|
454
+ | `strategy` | `str` | `"normal"` | `"normal"` (closed-form) or `"gradient_descent"` |
455
+ | `learning_rate` | `float` | `0.01` | Learning rate — gradient descent only |
456
+ | `epochs` | `int` | `1000` | Iterations — gradient descent only |
457
+
458
+ ### `Evaluator`
459
+
460
+ | Method | Description |
461
+ |--------|-------------|
462
+ | `evaluate_regression(y_true, y_pred)` | Returns `{"mae", "mse", "rmse"}` |
463
+ | `evaluate_classification(y_true, y_pred)` | Returns `{"accuracy", "precision", "recall", "f1"}` |
464
+ | `register(name, fn, kind)` | Add a custom metric at runtime |
465
+
466
+ ### Standalone metric functions
467
+
468
+ ```python
469
+ from coreLearn import accuracy, mae, mse, rmse, precision, recall, f1_score
470
+ ```
471
+
472
+ ---
473
+
474
+ ## Dependencies
475
+
476
+ | Package | Purpose |
477
+ |---------|---------|
478
+ | `numpy` | Matrix operations, vectorised arithmetic |
479
+ | `pytest` | Unit testing |
480
+ | `scikit-learn` | Datasets and preprocessing in examples only |
481
+ | `pandas` | Data loading in examples only |
482
+ | `matplotlib` | Visualisation in examples only |