PyPI - pybgpkitstream - Versions diffs - 0.1.0__tar.gz - Mend

pybgpkitstream 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

pybgpkitstream-0.1.0/PKG-INFO +65 -0
pybgpkitstream-0.1.0/README.md +53 -0
pybgpkitstream-0.1.0/pyproject.toml +28 -0
pybgpkitstream-0.1.0/src/pybgpkitstream/__init__.py +4 -0
pybgpkitstream-0.1.0/src/pybgpkitstream/bgpelement.py +44 -0
pybgpkitstream-0.1.0/src/pybgpkitstream/bgpkitstream.py +228 -0
pybgpkitstream-0.1.0/src/pybgpkitstream/bgpstreamconfig.py +62 -0
pybgpkitstream-0.1.0/src/pybgpkitstream/cli.py +151 -0
pybgpkitstream-0.1.0/src/pybgpkitstream/py.typed +0 -0

pybgpkitstream-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,65 @@
+Metadata-Version: 2.3
+Name: pybgpkitstream
+Version: 0.1.0
+Summary: Drop-in replacement for PyBGPStream using BGPKIT
+Author: JustinLoye
+Author-email: JustinLoye <jloye@iij.ad.jp>
+Requires-Dist: aiohttp>=3.12.15
+Requires-Dist: pybgpkit>=0.6.2
+Requires-Dist: pydantic>=2.11.9
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+# PyBGPKITStream
+A drop-in replacement for PyBGPStream using BGPKIT
+## Features
+- Effortlessly switch to another BGP stream:
+  - Designed to be a seamless, drop-in replacement ([example](tests/test_stream.py#L38))
+  - Lazy message generation: generates time-ordered BGP messages on the fly, consuming minimal memory and making it suitable for large datasets
+  - Supports multiple route collectors
+  - Supports both ribs and updates
+- Caching with concurrent downloading is enabled, and the caching mechanism is fully compatible with the BGPKIT parser's caching functionality
+- [Good-enough performances](examples/perf.ipynb)
+- A CLI tool
+## Quick start
+```python
+import datetime
+from pybgpkitstream import BGPStreamConfig, BGPKITStream
+config = BGPStreamConfig(
+    start_time=datetime.datetime(2010, 9, 1, 0, 0),
+    end_time=datetime.datetime(2010, 9, 1, 1, 59),
+    collectors=["route-views.sydney", "route-views.wide"],
+    data_types=["ribs", "updates"],
+)
+stream = BGPKITStream.from_config(config)
+n_elems = 0
+for _ in stream:
+    n_elems += 1
+print(f"Processed {n_elems} BGP elements")
+```
+## Motivation
+PyBGPStream is great but the implementation is complex and stops working when UC San Diego experiences a power outage.
+BGPKIT broker and parser are great, but cannot be used to create an ordered stream of BGP messages from multiple collectors and multiple data types.
+## Missing features
+- live mode
+- `pybgpkitstream.BGPElement` is not fully compatible with `pybgpstream.BGPElem`: missing record_type (BGPKIT limitation), project (BGPKIT limitation), router (could be improved), router_ip (could be improved)
+- CLI output is not yet compatible with `bgpdump -m` or `bgpreader` (right now a similar-looking output is produced)
+## Issues
+- Program will crash when working with many update files per collector (~ more than few hours of data), only when caching is disabled. This might be caused by [BGPKIT parser not being lazy](https://github.com/bgpkit/bgpkit-parser/pull/239). See [details and workaround fix](examples/many_updates.ipynb)
+- Filters are designed with BGPKIT in mind, and can slightly differ to pybgpstream. See [this file](tests/pybgpstream_utils.py) for a conversion to PyBGPStream filter. Note that for now the filters have not been heavily tested...
+- ... just like the rest of the project. Use at your own risk. The only tests I did are in /tests

pybgpkitstream-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,53 @@
+# PyBGPKITStream
+A drop-in replacement for PyBGPStream using BGPKIT
+## Features
+- Effortlessly switch to another BGP stream:
+  - Designed to be a seamless, drop-in replacement ([example](tests/test_stream.py#L38))
+  - Lazy message generation: generates time-ordered BGP messages on the fly, consuming minimal memory and making it suitable for large datasets
+  - Supports multiple route collectors
+  - Supports both ribs and updates
+- Caching with concurrent downloading is enabled, and the caching mechanism is fully compatible with the BGPKIT parser's caching functionality
+- [Good-enough performances](examples/perf.ipynb)
+- A CLI tool
+## Quick start
+```python
+import datetime
+from pybgpkitstream import BGPStreamConfig, BGPKITStream
+config = BGPStreamConfig(
+    start_time=datetime.datetime(2010, 9, 1, 0, 0),
+    end_time=datetime.datetime(2010, 9, 1, 1, 59),
+    collectors=["route-views.sydney", "route-views.wide"],
+    data_types=["ribs", "updates"],
+)
+stream = BGPKITStream.from_config(config)
+n_elems = 0
+for _ in stream:
+    n_elems += 1
+print(f"Processed {n_elems} BGP elements")
+```
+## Motivation
+PyBGPStream is great but the implementation is complex and stops working when UC San Diego experiences a power outage.
+BGPKIT broker and parser are great, but cannot be used to create an ordered stream of BGP messages from multiple collectors and multiple data types.
+## Missing features
+- live mode
+- `pybgpkitstream.BGPElement` is not fully compatible with `pybgpstream.BGPElem`: missing record_type (BGPKIT limitation), project (BGPKIT limitation), router (could be improved), router_ip (could be improved)
+- CLI output is not yet compatible with `bgpdump -m` or `bgpreader` (right now a similar-looking output is produced)
+## Issues
+- Program will crash when working with many update files per collector (~ more than few hours of data), only when caching is disabled. This might be caused by [BGPKIT parser not being lazy](https://github.com/bgpkit/bgpkit-parser/pull/239). See [details and workaround fix](examples/many_updates.ipynb)
+- Filters are designed with BGPKIT in mind, and can slightly differ to pybgpstream. See [this file](tests/pybgpstream_utils.py) for a conversion to PyBGPStream filter. Note that for now the filters have not been heavily tested...
+- ... just like the rest of the project. Use at your own risk. The only tests I did are in /tests

pybgpkitstream-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,28 @@
+[project]
+name = "pybgpkitstream"
+version = "0.1.0"
+description = "Drop-in replacement for PyBGPStream using BGPKIT"
+readme = "README.md"
+authors = [
+    { name = "JustinLoye", email = "jloye@iij.ad.jp" }
+]
+requires-python = ">=3.10"
+dependencies = [
+    "aiohttp>=3.12.15",
+    "pybgpkit>=0.6.2",
+    "pydantic>=2.11.9",
+]
+[build-system]
+requires = ["uv_build>=0.8.12,<0.9.0"]
+build-backend = "uv_build"
+[dependency-groups]
+test = [
+    "jupyter>=1.1.1",
+    "pybgpstream>=2.0.2",
+    "pytest>=8.4.2",
+]
+[project.scripts]
+pybgpkitstream = "pybgpkitstream.cli:main"

pybgpkitstream-0.1.0/src/pybgpkitstream/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .bgpstreamconfig import BGPStreamConfig, FilterOptions
+from .bgpkitstream import BGPKITStream
+__all__ = ["BGPStreamConfig", "FilterOptions", "BGPKITStream"]

pybgpkitstream-0.1.0/src/pybgpkitstream/bgpelement.py ADDED Viewed

@@ -0,0 +1,44 @@
+from typing import NamedTuple, TypedDict
+class ElementFields(TypedDict):
+    # need to use annotations because of '-'
+    __annotations__ = {
+        "next-hop": str,
+        "as-path": str,
+        "communities": list[str],
+        "prefix": str,
+    }
+class BGPElement(NamedTuple):
+    """Compatible with pybgpstream.BGPElem"""
+    type: str
+    collector: str
+    time: float
+    peer_asn: int
+    peer_address: str
+    fields: ElementFields
+    def __str__(self):
+        """Credit to pybgpstream"""
+        return "%s|%f|%s|%s|%s|%s|%s|%s|%s|%s|%s" % (
+            self.type,
+            self.time,
+            self.collector,
+            self.peer_asn,
+            self.peer_address,
+            self._maybe_field("prefix"),
+            self._maybe_field("next-hop"),
+            self._maybe_field("as-path"),
+            " ".join(self.fields["communities"])
+            if "communities" in self.fields
+            else None,
+            self._maybe_field("old-state"),
+            self._maybe_field("new-state"),
+        )
+    def _maybe_field(self, field):
+        """Credit to pybgpstream"""
+        return self.fields[field] if field in self.fields else None

pybgpkitstream-0.1.0/src/pybgpkitstream/bgpkitstream.py ADDED Viewed

@@ -0,0 +1,228 @@
+import asyncio
+import os
+import re
+import datetime
+from typing import Iterator, Literal
+from collections import defaultdict
+from itertools import chain
+from heapq import merge
+from operator import itemgetter
+import binascii
+import logging
+import aiohttp
+import bgpkit
+from bgpkit.bgpkit_broker import BrokerItem
+from pybgpkitstream.bgpstreamconfig import BGPStreamConfig
+from pybgpkitstream.bgpelement import BGPElement
+logger = logging.getLogger(__name__)
+def convert_bgpkit_elem(element, is_rib: bool, collector: str) -> BGPElement:
+    """Convert pybgpkit element to pybgpstream-like element"""
+    return BGPElement(
+        type="R" if is_rib else element.elem_type,
+        collector=collector,
+        time=element.timestamp,
+        peer_asn=element.peer_asn,
+        peer_address=element.peer_ip,
+        fields={
+            "next-hop": element.next_hop,
+            "as-path": element.as_path,
+            "communities": [] if not element.communities else element.communities,
+            "prefix": element.prefix,
+        },
+    )
+def crc32(input_str: str):
+    input_bytes = input_str.encode("utf-8")
+    crc = binascii.crc32(input_bytes) & 0xFFFFFFFF
+    return f"{crc:08x}"
+class BGPKITStream:
+    def __init__(
+        self,
+        ts_start: datetime.datetime,
+        ts_end: datetime.datetime,
+        collector_id: str,
+        data_type: list[Literal["update", "rib"]],
+        cache_dir: str | None,
+        filters: dict = {},
+    ):
+        self.ts_start = ts_start
+        self.ts_end = ts_end
+        self.collector_id = collector_id
+        self.data_type = data_type
+        self.cache_dir = cache_dir
+        self.filters = filters
+        self.broker = bgpkit.Broker()
+    @staticmethod
+    def _generate_cache_filename(url):
+        """Generate a cache filename compatible with BGPKIT parser."""
+        hash_suffix = crc32(url)
+        print(url)
+        if "updates." in url:
+            data_type = "updates"
+        elif "rib" in url or "view" in url:
+            data_type = "rib"
+        else:
+            raise ValueError("Could not understand data type from url")
+        # Look for patterns like rib.20100901.0200 or updates.20100831.2345
+        timestamp_match = re.search(r"(\d{8})\.(\d{4})", url)
+        if timestamp_match:
+            timestamp = f"{timestamp_match.group(1)}.{timestamp_match.group(2)}"
+        else:
+            raise ValueError("Could not parse timestamp from url")
+        if url.endswith(".bz2"):
+            compression_ext = "bz2"
+        elif url.endswith(".gz"):
+            compression_ext = "gz"
+        else:
+            raise ValueError("Could not parse extension from url")
+        return f"cache-{data_type}.{timestamp}.{hash_suffix}.{compression_ext}"
+    def _set_urls(self):
+        """Set archive files URL with bgpkit broker"""
+        # Set the urls with bgpkit broker
+        self.urls = {"rib": defaultdict(list), "update": defaultdict(list)}
+        for data_type in self.data_type:
+            items: list[BrokerItem] = self.broker.query(
+                ts_start=datetime.datetime.fromtimestamp(self.ts_start)
+                - datetime.timedelta(minutes=1),
+                ts_end=datetime.datetime.fromtimestamp(self.ts_end),
+                collector_id=self.collector_id,
+                data_type=data_type,
+            )
+            for item in items:
+                self.urls[data_type][item.collector_id].append(item.url)
+    async def _download_file(self, semaphore, session, url, filepath, data_type, rc):
+        """Helper coroutine to download a single file, controlled by a semaphore"""
+        async with semaphore:
+            logging.debug(f"{filepath} is a cache miss. Downloading {url}")
+            try:
+                async with session.get(url) as resp:
+                    resp.raise_for_status()
+                    with open(filepath, "wb") as fd:
+                        async for chunk in resp.content.iter_chunked(8192):
+                            fd.write(chunk)
+                    return data_type, rc, filepath
+            except aiohttp.ClientError as e:
+                logging.error(f"Failed to download {url}: {e}")
+                # Return None on failure so asyncio.gather doesn't cancel everything.
+                return None
+    async def _prefetch_data(self):
+        """Download archive files concurrently and cache to `self.cache_dir`"""
+        self.paths = {"rib": defaultdict(list), "update": defaultdict(list)}
+        tasks = []
+        CONCURRENT_DOWNLOADS = 10
+        semaphore = asyncio.Semaphore(CONCURRENT_DOWNLOADS)
+        conn = aiohttp.TCPConnector()
+        async with aiohttp.ClientSession(connector=conn) as session:
+            # Create all the download tasks.
+            for data_type in self.data_type:
+                for rc, rc_urls in self.urls[data_type].items():
+                    for url in rc_urls:
+                        filename = self._generate_cache_filename(url)
+                        filepath = os.path.join(self.cache_dir, filename)
+                        if os.path.exists(filepath):
+                            logging.debug(f"{filepath} is a cache hit")
+                            self.paths[data_type][rc].append(filepath)
+                        else:
+                            task = asyncio.create_task(
+                                self._download_file(
+                                    semaphore, session, url, filepath, data_type, rc
+                                )
+                            )
+                            tasks.append(task)
+            if tasks:
+                logging.info(
+                    f"Starting download of {len(tasks)} files with a concurrency of {CONCURRENT_DOWNLOADS}..."
+                )
+                results = await asyncio.gather(*tasks)
+                # Process the results, skipping any 'None' values from failed downloads.
+                for result in results:
+                    if result:
+                        data_type, rc, filepath = result
+                        self.paths[data_type][rc].append(filepath)
+                logging.info("All downloads finished.")
+    def _create_tagged_iterator(self, iterator, is_rib, collector):
+        """Creates a generator that tags elements with metadata missing in bgpkit."""
+        return ((elem.timestamp, elem, is_rib, collector) for elem in iterator)
+    def __iter__(self) -> Iterator[BGPElement]:
+        self._set_urls()
+        if self.cache_dir:
+            asyncio.run(self._prefetch_data())
+        # One iterator for each data_type * collector combinations
+        # To be merged according to the elements timestamp
+        iterators_to_merge = []
+        for data_type in self.data_type:
+            is_rib = data_type == "rib"
+            # Get rib or update files per collector
+            if self.cache_dir:
+                rc_to_urls = self.paths[data_type]
+            else:
+                rc_to_urls = self.urls[data_type]
+            # Chain rib or update iterators to get one stream per collector / data_type
+            for rc, urls in rc_to_urls.items():
+                parsers = [bgpkit.Parser(url=url, filters=self.filters) for url in urls]
+                chained_iterator = chain.from_iterable(parsers)
+                # Add metadata lost by bgpkit for compatibility with pubgpstream
+                iterators_to_merge.append((chained_iterator, is_rib, rc))
+        # Make a generator to tag each bgpkit element with metadata
+        # Benefit 1: full compat with pybgpstream
+        # Benefit 2: we give a key easy to access for heapq to merge
+        tagged_iterators = [
+            self._create_tagged_iterator(it, is_rib, rc)
+            for it, is_rib, rc in iterators_to_merge
+        ]
+        # Merge and convert to pybgpstream format
+        for timestamp, bgpkit_elem, is_rib, rc in merge(
+            *tagged_iterators, key=itemgetter(0)
+        ):
+            if self.ts_start <= timestamp <= self.ts_end:
+                yield convert_bgpkit_elem(bgpkit_elem, is_rib, rc)
+    @classmethod
+    def from_config(cls, config: BGPStreamConfig):
+        return cls(
+            ts_start=config.start_time.timestamp(),
+            ts_end=config.end_time.timestamp(),
+            collector_id=",".join(config.collectors),
+            data_type=[
+                dtype[:-1] for dtype in config.data_types
+            ],  # removes plural form
+            cache_dir=str(config.cache_dir) if config.cache_dir else None,
+            filters=config.filters.model_dump(exclude_unset=True)
+            if config.filters
+            else {},
+        )

pybgpkitstream-0.1.0/src/pybgpkitstream/bgpstreamconfig.py ADDED Viewed

@@ -0,0 +1,62 @@
+import datetime
+from pydantic import BaseModel, Field, DirectoryPath
+from typing import Literal
+from ipaddress import IPv4Address, IPv6Address
+class FilterOptions(BaseModel):
+    """A unified model for the available filter options."""
+    origin_asn: int | None = Field(
+        default=None, description="Filter by the origin AS number."
+    )
+    prefix: str | None = Field(
+        default=None, description="Filter by an exact prefix match."
+    )
+    prefix_super: str | None = Field(
+        default=None,
+        description="Filter by the exact prefix and its more general super-prefixes.",
+    )
+    prefix_sub: str | None = Field(
+        default=None,
+        description="Filter by the exact prefix and its more specific sub-prefixes.",
+    )
+    prefix_super_sub: str | None = Field(
+        default=None,
+        description="Filter by the exact prefix and both its super- and sub-prefixes.",
+    )
+    peer_ip: str | IPv4Address | IPv6Address | None = Field(
+        default=None, description="Filter by the IP address of a single BGP peer."
+    )
+    peer_ips: list[str | IPv4Address | IPv6Address] | None = Field(
+        default=None, description="Filter by a list of BGP peer IP addresses."
+    )
+    peer_asn: str | None = Field(
+        default=None, description="Filter by the AS number of the BGP peer."
+    )
+    update_type: Literal["withdraw", "announce"] | None = Field(
+        default=None, description="Filter by the BGP update message type."
+    )
+    as_path: str | None = Field(
+        default=None, description="Filter by a regular expression matching the AS path."
+    )
+class BGPStreamConfig(BaseModel):
+    """
+    Unified BGPStream config.
+    Filters are primarily written for BGPKit but utils to convert to pybgpstream are provided in tests/pybgpstream_utils.
+    """
+    start_time: datetime.datetime = Field(description="Start of the stream")
+    end_time: datetime.datetime = Field(description="End of the stream")
+    collectors: list[str] = Field(description="List of collectors to get data from")
+    data_types: list[Literal["ribs", "updates"]] = Field(
+        description="List of archives files to consider (`ribs` or `updates`)"
+    )
+    cache_dir: DirectoryPath | None = Field(
+        default=None,
+        description="Specifies the directory for caching downloaded files.",
+    )
+    filters: FilterOptions | None = Field(default=None, description="Optional filters")

pybgpkitstream-0.1.0/src/pybgpkitstream/cli.py ADDED Viewed

@@ -0,0 +1,151 @@
+import argparse
+import sys
+import datetime
+from pybgpkitstream import BGPStreamConfig, FilterOptions
+from pybgpkitstream import BGPKITStream
+def main():
+    parser = argparse.ArgumentParser(
+        description="Stream and filter BGP data.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    # Arguments with default values for BGPStreamConfig
+    parser.add_argument(
+        "--start-time",
+        type=datetime.datetime.fromisoformat,
+        default=datetime.datetime(2010, 9, 1, 0, 0),
+        help="Start of the stream in ISO format.",
+    )
+    parser.add_argument(
+        "--end-time",
+        type=datetime.datetime.fromisoformat,
+        default=datetime.datetime(2010, 9, 1, 2, 0),
+        help="End of the stream in ISO format.",
+    )
+    parser.add_argument(
+        "--collectors",
+        type=str,
+        nargs="+",
+        default=["route-views.sydney", "route-views.wide"],
+        help="List of collectors to get data from.",
+    )
+    parser.add_argument(
+        "--data-types",
+        type=str,
+        nargs="+",
+        choices=["ribs", "updates"],
+        default=["updates"],
+        help="List of archives to consider ('ribs' or 'updates').",
+    )
+    parser.add_argument(
+        "--cache-dir",
+        type=str,
+        default=None,
+        help="Directory for caching downloaded files.",
+    )
+    # Arguments for FilterOptions
+    parser.add_argument(
+        "--origin-asn",
+        type=int,
+        default=None,
+        help="Filter by the origin AS number.",
+    )
+    parser.add_argument(
+        "--prefix",
+        type=str,
+        default=None,
+        help="Filter by an exact prefix match.",
+    )
+    parser.add_argument(
+        "--prefix-super",
+        type=str,
+        default=None,
+        help="Filter by the exact prefix and its more general super-prefixes.",
+    )
+    parser.add_argument(
+        "--prefix-sub",
+        type=str,
+        default=None,
+        help="Filter by the exact prefix and its more specific sub-prefixes.",
+    )
+    parser.add_argument(
+        "--prefix-super-sub",
+        type=str,
+        default=None,
+        help="Filter by the exact prefix and both its super- and sub-prefixes.",
+    )
+    parser.add_argument(
+        "--peer-ip",
+        type=str,  # Note: argparse does not directly handle Union types, so we use string here.
+        default=None,
+        help="Filter by the IP address of a single BGP peer.",
+    )
+    parser.add_argument(
+        "--peer-ips",
+        type=str,
+        nargs="+",
+        default=None,
+        help="Filter by a list of BGP peer IP addresses.",
+    )
+    parser.add_argument(
+        "--peer-asn",
+        type=str,
+        default=None,
+        help="Filter by the AS number of the BGP peer.",
+    )
+    parser.add_argument(
+        "--update-type",
+        type=str,
+        choices=["withdraw", "announce"],
+        default=None,
+        help="Filter by the BGP update message type.",
+    )
+    parser.add_argument(
+        "--as-path",
+        type=str,
+        default=None,
+        help="Filter by a regular expression matching the AS path.",
+    )
+    args = parser.parse_args()
+    filter_options = FilterOptions(
+        origin_asn=args.origin_asn,
+        prefix=args.prefix,
+        prefix_super=args.prefix_super,
+        prefix_sub=args.prefix_sub,
+        prefix_super_sub=args.prefix_super_sub,
+        peer_ip=args.peer_ip,
+        peer_ips=args.peer_ips,
+        peer_asn=args.peer_asn,
+        update_type=args.update_type,
+        as_path=args.as_path,
+    )
+    # Convert filter to None if all filter attributes are None
+    if all(value is None for value in filter_options.model_dump().values()):
+        filter_options = None
+    config = BGPStreamConfig(
+        start_time=args.start_time,
+        end_time=args.end_time,
+        collectors=args.collectors,
+        data_types=args.data_types,
+        cache_dir=args.cache_dir,
+        filters=filter_options,
+    )
+    try:
+        for element in BGPKITStream.from_config(config):
+            print(element)
+    except Exception as e:
+        print(f"An error occurred during streaming: {e}", file=sys.stderr)
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

pybgpkitstream-0.1.0/src/pybgpkitstream/py.typed ADDED Viewed

File without changes