npm - @dniskav/neuron - Versions diffs - 0.3.0 → 0.3.2 - Mend

@dniskav/neuron 0.3.0 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/README.md CHANGED Viewed

@@ -3,7 +3,7 @@
 A minimal, dependency-free neural network library built from scratch in TypeScript. Designed for learning and experimentation — every line of math is readable.
-Each class is a building block for the next: from a single neuron to a full Transformer with causal attention. v0.3.0 adds classical ML, unsupervised learning, generative models, autograd, and training utilities — all in pure TypeScript, zero dependencies.
+Each class is a building block for the next: from a single neuron to a full Transformer with causal attention. Includes classical ML, unsupervised learning, generative models, embeddings, and autograd — all in pure TypeScript, zero dependencies.
 ```mermaid
 graph TD
@@ -126,6 +126,17 @@ graph TD
 |--------|-------------|
 | `Value` | Scalar autograd node. Builds a computational graph and propagates gradients with `.backward()`. Inspired by micrograd. |
+### Embeddings
+| Export | Description |
+|--------|-------------|
+| `Word2Vec` | Learns word embeddings via Skip-gram or CBOW. Full-softmax, cosine similarity, analogies (`king - man + woman ≈ queen`). |
+| `TSNE` | t-SNE dimensionality reduction. Binary-search perplexity, Student-t kernel, KL gradient, early exaggeration. |
+| `PositionalEncoding` | Sinusoidal positional encoding (Vaswani et al.). Static — no parameters, generalizes to unseen lengths. |
+| `LearnedPositionalEncoding` | Trainable positional encoding. Xavier-initialized, learnable up to a fixed `maxSeqLen`. |
+| `ContrastiveLearning` | SimCLR-style self-supervised learning. NT-Xent loss, encoder + projection head, temperature τ. |
+| `Augmenter` | Data augmentation helpers for contrastive pairs: Gaussian noise, feature dropout, `makePair()`. |
 ### Activations & math
 | Export | Description |
@@ -396,6 +407,79 @@ const { mu, logVar } = vae.encode(image);  // encode → distribution params
 const z = vae.reparametrize(mu, logVar);   // sample z ~ N(μ, σ²)
 ```
+### Word2Vec — aprende embeddings de palabras
+```ts
+import { Word2Vec } from "@dniskav/neuron";
+const w2v = new Word2Vec(64, { model: 'skipgram', windowSize: 2 });
+const corpus = [
+  ["the", "king", "rules", "the", "kingdom"],
+  ["the", "queen", "rules", "the", "land"],
+  ["man", "and", "woman", "are", "human"],
+];
+w2v.buildVocab(corpus);
+w2v.train(corpus, 0.05, 200);
+console.log(w2v.similarity("king", "queen")); // high
+console.log(w2v.mostSimilar("king", 3));
+// [{ word: 'queen', score: 0.91 }, ...]
+// Vector arithmetic: king - man + woman ≈ queen
+console.log(w2v.analogy("king", "man", "woman", 1));
+// [{ word: 'queen', score: 0.87 }]
+```
+### t-SNE — visualiza embeddings en 2D
+```ts
+import { TSNE } from "@dniskav/neuron";
+// Reduce 128-dim embeddings → 2D for plotting
+const tsne = new TSNE({ perplexity: 30, nIter: 1000, seed: 42 });
+const points2D = tsne.fitTransform(embeddings128D); // [n][2]
+console.log(tsne.kl()); // KL divergence — lower is better
+// Plot points2D with any charting library
+```
+### PositionalEncoding — orden sin parámetros
+```ts
+import { PositionalEncoding, LearnedPositionalEncoding } from "@dniskav/neuron";
+// Sinusoidal — deterministic, no training needed
+const pe = PositionalEncoding.encodeSequence(512, 128); // [512][128]
+const withPos = PositionalEncoding.apply(tokenEmbeddings); // add PE to embeddings
+// Learned — trainable, fixed maxSeqLen
+const lpe = new LearnedPositionalEncoding(512, 128);
+const withLearnedPos = lpe.apply(tokenEmbeddings);
+lpe.update(gradients, 0.001); // update during backprop
+```
+### ContrastiveLearning — representaciones sin etiquetas
+```ts
+import { ContrastiveLearning, Augmenter } from "@dniskav/neuron";
+// Encoder: 128 → [256, 128] → 64 latent, projection head: 64 → 32
+const cl = new ContrastiveLearning(128, [256, 128], 64, { temperature: 0.5 });
+// Create positive pairs from unlabeled data (two augmented views per sample)
+const pairs = unlabeledData.map(x => Augmenter.makePair(x));
+for (let step = 0; step < 1000; step++) {
+  const loss = cl.trainStep(pairs, 0.001);
+  if (step % 100 === 0) console.log(`step ${step}: ${loss.toFixed(4)}`);
+}
+// Use encoder for downstream tasks (classification, clustering, etc.)
+const representation = cl.encode(newSample); // 64-dim vector
+```
 ### Value / Tape — automatic differentiation
 ```ts
@@ -594,8 +678,45 @@ npm test        # run test suite
 If you are an AI agent or LLM working with this codebase, read [AGENTS.md](AGENTS.md) first. It contains the full class hierarchy, design constraints, and what this library does not do.
+## Roadmap (nice to have)
+These features are intentionally out of scope for the current didactic focus but are documented here for reference.
+### ONNX export
+Export trained models to the [ONNX](https://onnx.ai/) interchange format so they can be run in Python (onnxruntime), browsers (onnxruntime-web), mobile, or production inference servers.
+**What it would require:**
+- Serialize each layer's weights + op type to the protobuf ONNX schema (`onnx.proto`).
+- Map neuron layer types to standard ONNX operators (`Gemm`, `MatMul`, `LSTM`, `Conv`, `Relu`, `Softmax`, …).
+- Handle dynamic batch dimensions in the graph IR.
+- Ship a build step that compiles the `.proto` definitions (adds a dev dependency on `protobufjs` or `onnx-proto`).
+**Why it's skipped:** It adds a non-trivial build pipeline and a dependency. The library has zero runtime dependencies by design. ONNX export makes sense once you outgrow the library for training — at that point PyTorch/TF are the right tools.
+### WebGL / WASM backend
+Replace the current pure-JS number arrays with a GPU-accelerated or WASM-compiled backend so larger models (e.g. ViT, GPT-2 scale) become feasible in the browser.
+**What it would require:**
+- Abstract `Tensor` type that backends implement (JS arrays, WebGL textures, WASM memory).
+- WebGL backend: encode matrix ops as fragment-shader programs (similar to `gpu.js` or `tfjs-backend-webgl`).
+- WASM backend: compile a BLAS-like C/Rust core (e.g. `wasm-bindgen` + `ndarray`) and bind it to TypeScript.
+- Every layer's `forward` / `backward` rewritten against the `Tensor` API.
+**Why it's skipped:** The goal is to make the math readable. GPU shader code and WASM bindings are implementation details that obscure the algorithms. The library intentionally trades performance for pedagogical clarity.
+---
 ## Changelog
+### v0.3.2
+- **New — NLP:** `Tokenizer` (char / word / whitespace modes, special tokens PAD/UNK/BOS/EOS, one-hot encoding, `fit` / `encode` / `decode` / `encodeBatch`, JSON serialization)
+- **New — Data:** `DatasetLoader` (parse CSV and JSON into `DataPair`; auto one-hot encoding for string columns; returns `categoricalMaps` for decoding predictions)
+### v0.3.1
+- **New — Embeddings:** `Word2Vec` (Skip-gram + CBOW, full-softmax, cosine similarity, analogies), `TSNE` (binary-search perplexity, Student-t kernel, KL gradient, early exaggeration, seeded PRNG), `PositionalEncoding` (sinusoidal, Vaswani et al.), `LearnedPositionalEncoding` (trainable), `ContrastiveLearning` (NT-Xent, SimCLR encoder + projection head), `Augmenter` (noise, feature dropout, `makePair`)
 ### v0.3.0
 - **New — Classical ML:** `Perceptron`, `LinearRegression` (normal equation + GD), `LogisticRegression`, `SoftmaxRegression`, `GaussianNaiveBayes`, `DecisionTree` (CART, Gini/MSE)
 - **New — Unsupervised:** `KMeans` (K-Means++ init), `PCA` (power iteration + Hotelling deflation), `SOM` (Kohonen map), `HopfieldNetwork` (Hebbian storage + energy), `Autoencoder`

package/dist/index.d.mts CHANGED Viewed

@@ -570,6 +570,63 @@ declare class Trainer {
     private _computeMetricsArray;
 }
+interface DatasetLoaderOptions {
+    /** Column names to use as input features. */
+    featureCols: string[];
+    /** Column names to use as targets / labels. */
+    targetCols: string[];
+    /**
+     * When true, string values in feature/target columns are one-hot encoded.
+     * When false, non-numeric values throw an error. Default: true.
+     */
+    encodeStrings?: boolean;
+}
+/**
+ * Maps a column name to its {value → one-hot index} dictionary.
+ * Useful for decoding model predictions back to class names.
+ */
+type CategoricalMap = Record<string, Record<string, number>>;
+interface DatasetLoaderResult extends DataPair {
+    /**
+     * For each string column that was one-hot encoded, maps the column name to
+     * the {category → index} dictionary used during encoding.
+     */
+    categoricalMaps: CategoricalMap;
+    /** Column names in the order they appear in each input vector. */
+    featureNames: string[];
+    /** Column names (or expanded one-hot names) in the order they appear in each target vector. */
+    targetNames: string[];
+    /** Total number of rows parsed. */
+    numRows: number;
+}
+declare class DatasetLoader {
+    /**
+     * Parse a CSV string into a DataPair.
+     *
+     * - The first non-empty row is treated as a header.
+     * - Numeric values are parsed with parseFloat.
+     * - String values are one-hot encoded (one column → N binary columns).
+     * - Empty rows and comment lines (starting with #) are skipped.
+     *
+     * @param csv     - raw CSV text
+     * @param options - which columns to use as features / targets
+     */
+    static fromCSV(csv: string, options: DatasetLoaderOptions): DatasetLoaderResult;
+    /**
+     * Parse a JSON string (array of objects) into a DataPair.
+     *
+     * Expected format:
+     *   [{ "col1": 1.0, "col2": "cat", "label": "dog" }, ...]
+     *
+     * @param json    - raw JSON text or a pre-parsed array of objects
+     * @param options - which columns to use as features / targets
+     */
+    static fromJSON(json: string | Record<string, unknown>[], options: DatasetLoaderOptions): DatasetLoaderResult;
+    private static _buildDataPair;
+    private static _parseCSV;
+    private static _parseCSVRow;
+}
 declare class LRScheduler {
     stepDecay(lr: number, epoch: number, dropRate: number, epochsDrop: number): number;
     exponentialDecay(lr: number, epoch: number, decayRate: number): number;
@@ -893,6 +950,123 @@ declare class TCN {
     train(sequence: number[][], targets: number[][], lr: number): number;
 }
+type Word2VecModel = 'skipgram' | 'cbow';
+interface Word2VecOptions {
+    /** Size of the sliding context window on each side of the center word. Default 2. */
+    windowSize?: number;
+    /** Training architecture. Default 'skipgram'. */
+    model?: Word2VecModel;
+    /** Ignore words with corpus frequency below this threshold. Default 1. */
+    minCount?: number;
+}
+declare class Word2Vec {
+    /** Learned word vectors, shape [vocabSize][embeddingDim]. */
+    embeddings: number[][];
+    /** Maps each vocabulary word to its integer index. */
+    vocab: Map<string, number>;
+    vocabSize: number;
+    embeddingDim: number;
+    private _indexToWord;
+    private _W2;
+    private _windowSize;
+    private _model;
+    private _minCount;
+    private _trained;
+    constructor(embeddingDim?: number, options?: Word2VecOptions);
+    buildVocab(sentences: string[][]): void;
+    static tokenize(text: string): string[];
+    train(sentences: string[][], lr?: number, epochs?: number): number[];
+    getEmbedding(word: string): number[];
+    similarity(word1: string, word2: string): number;
+    mostSimilar(word: string, topK?: number): {
+        word: string;
+        score: number;
+    }[];
+    analogy(positive1: string, negative: string, positive2: string, topK?: number): {
+        word: string;
+        score: number;
+    }[];
+    private _skipgramStep;
+    private _cbowStep;
+    private _hiddenToScores;
+    private _nearestByVector;
+    private _cosine;
+}
+interface TSNEOptions {
+    /** Dimensionality of the output embedding. Default 2. */
+    nComponents?: number;
+    /**
+     * Perplexity — loosely controls the effective number of neighbors considered
+     * for each point. Typical values: 5–50. Default 30.
+     * Must be less than the number of data points.
+     */
+    perplexity?: number;
+    /** Learning rate for gradient descent. Default 200. */
+    lr?: number;
+    /** Number of gradient-descent iterations. Default 1000. */
+    nIter?: number;
+    /**
+     * Seed for the pseudo-random number generator.
+     * Set to any integer for reproducible results. Default uses Math.random.
+     */
+    seed?: number;
+}
+declare class TSNE {
+    /** Result of the embedding, shape [n][nComponents]. Available after fit(). */
+    embedding: number[][];
+    private readonly _nComponents;
+    private readonly _perplexity;
+    private readonly _lr;
+    private readonly _nIter;
+    private readonly _seed;
+    private _klDivergence;
+    private _P;
+    constructor(options?: TSNEOptions);
+    fit(X: number[][]): void;
+    fitTransform(X: number[][]): number[][];
+    kl(): number;
+    private _computePcond;
+}
+declare class PositionalEncoding {
+    static encode(pos: number, dModel: number): number[];
+    static encodeSequence(seqLen: number, dModel: number): number[][];
+    static apply(embeddings: number[][], seqLen?: number): number[][];
+}
+declare class LearnedPositionalEncoding {
+    readonly maxSeqLen: number;
+    readonly dModel: number;
+    weights: number[][];
+    constructor(maxSeqLen: number, dModel: number);
+    getEncoding(pos: number): number[];
+    apply(embeddings: number[][], seqLen?: number): number[][];
+    update(dWeights: number[][], lr: number): void;
+}
+declare class Augmenter {
+    static addNoise(x: number[], sigma?: number): number[];
+    static dropoutFeatures(x: number[], rate?: number): number[];
+    static augment(x: number[], noiseStd?: number, dropRate?: number): number[];
+    static makePair(x: number[]): [number[], number[]];
+}
+declare class ContrastiveLearning {
+    encoder: NetworkN;
+    projectionHead: NetworkN;
+    temperature: number;
+    constructor(inputSize: number, encoderHidden: number[], projectionDim: number, options?: {
+        temperature?: number;
+        encoderOptions?: NetworkNOptions;
+    });
+    encode(x: number[]): number[];
+    project(x: number[]): number[];
+    static cosineSimilarity(a: number[], b: number[]): number;
+    computeLoss(pairs: [number[], number[]][]): number;
+    trainStep(pairs: [number[], number[]][], lr: number): number;
+    private _forwardProjections;
+    private _ntXentLoss;
+}
 declare class GAN {
     readonly generator: NetworkN;
     readonly discriminator: NetworkN;
@@ -990,6 +1164,121 @@ declare function perplexity(yTrue: number[], probabilities: number[][]): number;
 declare function printConfusionMatrix(matrix: number[][], labels?: string[]): void;
 declare function classificationReport(yTrue: number[], yPred: number[], labels?: string[]): void;
+type TokenizerMode = 'char' | 'word' | 'whitespace';
+interface TokenizerOptions {
+    /** Tokenization strategy. Default: 'word' */
+    mode?: TokenizerMode;
+    /** Normalize text to lowercase before processing. Default: true */
+    lowercase?: boolean;
+    /** Maximum vocabulary size (most frequent tokens kept). 0 = unlimited. Default: 0 */
+    maxVocab?: number;
+    /** Additional special tokens to reserve (appended after PAD/UNK/BOS/EOS). */
+    specialTokens?: string[];
+}
+interface EncodeOptions {
+    /** Prepend <BOS> token. Default: false */
+    addBOS?: boolean;
+    /** Append <EOS> token. Default: false */
+    addEOS?: boolean;
+}
+interface EncodeBatchOptions extends EncodeOptions {
+    /**
+     * Pad or truncate all sequences to this length.
+     * Sequences shorter than padTo are right-padded with <PAD> (id 0).
+     * Sequences longer than padTo are truncated on the right.
+     * If omitted, sequences are left at their natural length.
+     */
+    padTo?: number;
+}
+interface TokenizerSnapshot {
+    mode: TokenizerMode;
+    lowercase: boolean;
+    maxVocab: number;
+    token2id: Record<string, number>;
+}
+declare class Tokenizer {
+    static readonly PAD = "<PAD>";
+    static readonly UNK = "<UNK>";
+    static readonly BOS = "<BOS>";
+    static readonly EOS = "<EOS>";
+    private readonly _mode;
+    private readonly _lowercase;
+    private readonly _maxVocab;
+    private readonly _extraSpecial;
+    private _token2id;
+    private _id2token;
+    private _fitted;
+    constructor(options?: TokenizerOptions);
+    /**
+     * Build vocabulary from an array of text strings.
+     * Calling fit() again resets and rebuilds the vocabulary from scratch.
+     *
+     * @param texts - corpus to build the vocabulary from
+     * @returns this (chainable)
+     */
+    fit(texts: string[]): this;
+    /**
+     * Split raw text into an array of string tokens (no ID conversion yet).
+     * Useful for inspecting what the tokenizer produces before encoding.
+     */
+    tokenize(text: string): string[];
+    /**
+     * Convert a text string to a sequence of token IDs.
+     * Unknown tokens map to <UNK> (id 1).
+     *
+     * @param text    - input text
+     * @param options - addBOS / addEOS flags
+     */
+    encode(text: string, options?: EncodeOptions): number[];
+    /**
+     * Encode an array of texts, optionally padding/truncating to a fixed length.
+     *
+     * @param texts   - array of input texts
+     * @param options - addBOS / addEOS / padTo
+     */
+    encodeBatch(texts: string[], options?: EncodeBatchOptions): number[][];
+    /**
+     * Convert a sequence of token IDs back to a human-readable string.
+     *
+     * @param ids          - array of token IDs
+     * @param stripSpecial - remove PAD/BOS/EOS tokens from output. Default: true
+     */
+    decode(ids: number[], stripSpecial?: boolean): string;
+    /**
+     * Convert a sequence of token IDs to one-hot vectors.
+     * Each vector has length `vocabSize` with a single 1 at the token's position.
+     * Useful when feeding tokens directly into a Network without an embedding layer.
+     *
+     * @param ids - array of token IDs (e.g. from encode())
+     * @returns   - 2D array of shape [seqLen, vocabSize]
+     */
+    oneHot(ids: number[]): number[][];
+    /** Number of tokens in the vocabulary (including special tokens). */
+    get vocabSize(): number;
+    /** True if fit() has been called at least once. */
+    get isFitted(): boolean;
+    /** Get the integer ID for a token string, or undefined if not in vocabulary. */
+    tokenToId(token: string): number | undefined;
+    /** Get the token string for an integer ID, or undefined if out of range. */
+    idToToken(id: number): string | undefined;
+    /**
+     * Return the full vocabulary as an array ordered by ID.
+     * Index i of the returned array is the token with ID i.
+     */
+    getVocabulary(): string[];
+    /**
+     * Serialize the fitted tokenizer to a plain JSON-compatible object.
+     * Store it with JSON.stringify(); reload with Tokenizer.fromJSON().
+     */
+    toJSON(): TokenizerSnapshot;
+    /**
+     * Restore a Tokenizer from a snapshot produced by toJSON().
+     */
+    static fromJSON(snapshot: TokenizerSnapshot): Tokenizer;
+    private _register;
+    private _assertFitted;
+}
 declare class EarlyStopping {
     bestValue: number;
     readonly patience: number;
@@ -1062,4 +1351,4 @@ declare class DataAugmentation {
     };
 }
-export { type Activation, Adam, AttentionHead, Autoencoder, BatchNorm, BiasVector, CausalConv1D, ClipOptimizer, ClippedOptimizerFactory, Conv1D, Conv2D, DataAugmentation, DataLoader, type DataPair, DecisionTree, Dropout, EarlyStopping, EmbeddingMatrix, Flatten, GAN, GRULayer, GaussianNaiveBayes, HopfieldNetwork, KMeans, type KMeansOptions, LRScheduler, LSTMLayer, Layer, LayerNorm, LinearRegression, LogisticRegression, LossPlotter, MaxPool2D, ModelSaver, Momentum, MultiHeadAttention, Network, NetworkLSTM, type NetworkLSTMOptions, NetworkN, type NetworkNOptions, NetworkTransformer, type NetworkTransformerOptions, NetworkTransformerRL, type NetworkTransformerRLOptions, Neuron, NeuronN, type Optimizer, type OptimizerFactory, PCA, Perceptron, RNN, SGD, SOM, type SOMOptions, Seq2Seq, type Serializable, SoftmaxRegression, TCN, type TrainDataset, type TrainMetrics, type TrainableNetwork, type TrainableNetworkWithWeights, Trainer, type TrainerOptions, TransformerBlock, type TransformerBlockOptions, VAE, Value, WeightInspector, WeightMatrix, type WeightStats, accuracy, auc, classificationReport, confusionMatrix, crossEntropy, crossEntropyDelta, crossEntropyDeltaRaw, defaultOptimizer, elu, f1Score, leakyRelu, linear, mae, makeElu, makeLeakyRelu, matMul, mse, mseDelta, perplexity, precision, printConfusionMatrix, r2Score, recall, relu, rmse, rocCurve, sigmoid, softmax, softmaxBackward, tanh, transpose, validate2DArray, validateArray, validateArrayMinLength, validateNumber };
+export { type Activation, Adam, AttentionHead, Augmenter, Autoencoder, BatchNorm, BiasVector, type CategoricalMap, CausalConv1D, ClipOptimizer, ClippedOptimizerFactory, ContrastiveLearning, Conv1D, Conv2D, DataAugmentation, DataLoader, type DataPair, DatasetLoader, type DatasetLoaderOptions, type DatasetLoaderResult, DecisionTree, Dropout, EarlyStopping, EmbeddingMatrix, type EncodeBatchOptions, type EncodeOptions, Flatten, GAN, GRULayer, GaussianNaiveBayes, HopfieldNetwork, KMeans, type KMeansOptions, LRScheduler, LSTMLayer, Layer, LayerNorm, LearnedPositionalEncoding, LinearRegression, LogisticRegression, LossPlotter, MaxPool2D, ModelSaver, Momentum, MultiHeadAttention, Network, NetworkLSTM, type NetworkLSTMOptions, NetworkN, type NetworkNOptions, NetworkTransformer, type NetworkTransformerOptions, NetworkTransformerRL, type NetworkTransformerRLOptions, Neuron, NeuronN, type Optimizer, type OptimizerFactory, PCA, Perceptron, PositionalEncoding, RNN, SGD, SOM, type SOMOptions, Seq2Seq, type Serializable, SoftmaxRegression, TCN, TSNE, type TSNEOptions, Tokenizer, type TokenizerMode, type TokenizerOptions, type TokenizerSnapshot, type TrainDataset, type TrainMetrics, type TrainableNetwork, type TrainableNetworkWithWeights, Trainer, type TrainerOptions, TransformerBlock, type TransformerBlockOptions, VAE, Value, WeightInspector, WeightMatrix, type WeightStats, Word2Vec, type Word2VecModel, type Word2VecOptions, accuracy, auc, classificationReport, confusionMatrix, crossEntropy, crossEntropyDelta, crossEntropyDeltaRaw, defaultOptimizer, elu, f1Score, leakyRelu, linear, mae, makeElu, makeLeakyRelu, matMul, mse, mseDelta, perplexity, precision, printConfusionMatrix, r2Score, recall, relu, rmse, rocCurve, sigmoid, softmax, softmaxBackward, tanh, transpose, validate2DArray, validateArray, validateArrayMinLength, validateNumber };