PyPI - imdb-sqlite - Versions diffs - 1.1.0__tar.gz → 2.0.0__tar.gz - Mend

imdb-sqlite 1.1.0tar.gz → 2.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.4
 Name: imdb-sqlite
-Version: 1.1.0
+Version: 2.0.0
 Summary: Imports IMDB TSV files into a SQLite database
 Home-page: https://github.com/jojje/imdb-sqlite
 Author: Jonas Tingeborn
@@ -12,6 +12,17 @@ Classifier: License :: OSI Approved :: GNU General Public License v2 (GPLv2)
 Classifier: Operating System :: OS Independent
 Description-Content-Type: text/markdown
 License-File: LICENSE
+Requires-Dist: tqdm>=4.4.1
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: summary
 # imdb-sqlite
 Imports IMDB TSV files into a SQLite database.
@@ -35,8 +46,8 @@ The program relies on the following IMDB tab separated files:
     usage: imdb-sqlite [OPTIONS]
-    Imports imdb tsv interface files into a new sqlitedatabase. Fetches them from
-    imdb if not present onthe machine.
+    Imports imdb tsv interface files into a new sqlite database. Fetches them from imdb
+    if not present on the machine.
     optional arguments:
       -h, --help       show this help message and exit
@@ -44,15 +55,19 @@ The program relies on the following IMDB tab separated files:
       --cache-dir DIR  Download cache dir where the tsv files from imdb will be stored
                        before the import (default: downloads)
       --no-index       Do not create any indices. Massively slower joins, but cuts the DB
-                       file size approximately in half (default: False)
-      --verbose        Show database interaction (default: False)
+                       file size approximately in half
+      --only TABLES    Import only some tables. The tables to import are specified using
+                       a comma delimited list, such as "people,titles". Use it to save
+                       storage space.
+      --verbose        Show database interaction
 Just run the program with no arguments, and you'll get a file named `imdb.db`
 in the current working directory.
 ### Hints
 * Make sure the disk the database is written to has sufficient space.
-  About 5 GiB is needed.
+  About 19 GiB is needed as of early 2026. About 9.5 GB without indices.
+  (for even less storage requirement, see Disk space tips below).
 * Use a SSD to speed up the import.
 * To check the best case import performance, use an in-memory database:
   `--db :memory:`.
@@ -60,50 +75,53 @@ in the current working directory.
 ## Example
     $ imdb-sqlite
+    2026-02-04 16:30:31,311 Populating database: imdb.db
+    2026-02-04 16:30:31,319 Applying schema
+    2026-02-04 16:30:31,323 Importing file: downloads\name.basics.tsv.gz
+    2026-02-04 16:30:31,324 Reading number of rows ...
+    2026-02-04 16:30:34,373 Inserting rows into table: people
+    100%|██████████████████| 15063390/15063390 [01:05<00:00, 228659.33 rows/s]
-    2018-07-08 16:00:00,000 Populating database: imdb.db
-    2018-07-08 16:00:00,001 Applying schema
-    2018-07-08 16:00:00,005 Importing file: downloads\name.basics.tsv.gz
-    2018-07-08 16:00:00,005 Reading number of rows ...
-    2018-07-08 16:00:11,521 Inserting rows into table: people
-    100%|█████████████████████████| 8699964/8699964 [01:23<00:00, 104387.75 rows/s]
-    2018-07-08 16:01:34,868 Importing file: downloads\title.basics.tsv.gz
-    2018-07-08 16:01:34,868 Reading number of rows ...
-    2018-07-08 16:01:41,873 Inserting rows into table: titles
-    100%|██████████████████████████| 5110779/5110779 [00:58<00:00, 87686.98 rows/s]
-    2018-07-08 16:02:40,161 Importing file: downloads\title.akas.tsv.gz
-    2018-07-08 16:02:40,161 Reading number of rows ...
-    2018-07-08 16:02:44,743 Inserting rows into table: akas
-    100%|██████████████████████████| 3625334/3625334 [00:37<00:00, 97412.94 rows/s]
-    2018-07-08 16:03:21,964 Importing file: downloads\title.principals.tsv.gz
-    2018-07-08 16:03:21,964 Reading number of rows ...
-    2018-07-08 16:03:55,922 Inserting rows into table: crew
-    100%|███████████████████████| 28914893/28914893 [03:45<00:00, 128037.21 rows/s]
-    2018-07-08 16:07:41,757 Importing file: downloads\title.episode.tsv.gz
-    2018-07-08 16:07:41,757 Reading number of rows ...
-    2018-07-08 16:07:45,370 Inserting rows into table: episodes
-    100%|█████████████████████████| 3449903/3449903 [00:21<00:00, 158265.16 rows/s]
-    2018-07-08 16:08:07,172 Importing file: downloads\title.ratings.tsv.gz
-    2018-07-08 16:08:07,172 Reading number of rows ...
-    2018-07-08 16:08:08,029 Inserting rows into table: ratings
-    100%|███████████████████████████| 846901/846901 [00:05<00:00, 152421.27 rows/s]
-    2018-07-08 16:08:13,589 Creating table indices ...
-    2018-07-08 16:09:16,451 Import successful
+    2026-02-04 16:31:40,262 Importing file: downloads\title.basics.tsv.gz
+    2026-02-04 16:31:40,262 Reading number of rows ...
+    2026-02-04 16:31:42,777 Inserting rows into table: titles
+    100%|██████████████████| 12265715/12265715 [01:06<00:00, 185564.42 rows/s]
+    2026-02-04 16:32:48,879 Importing file: downloads\title.akas.tsv.gz
+    2026-02-04 16:32:48,880 Reading number of rows ...
+    2026-02-04 16:32:54,646 Inserting rows into table: akas
+    100%|██████████████████| 54957563/54957563 [04:06<00:00, 222556.12 rows/s]
+    2026-02-04 16:37:01,586 Importing file: downloads\title.principals.tsv.gz
+    2026-02-04 16:37:01,587 Reading number of rows ...
+    2026-02-04 16:37:11,294 Inserting rows into table: crew
+    100%|██████████████████| 97617046/97617046 [06:27<00:00, 251790.20 rows/s]
+    2026-02-04 16:43:38,990 Importing file: downloads\title.episode.tsv.gz
+    2026-02-04 16:43:38,990 Reading number of rows ...
+    2026-02-04 16:43:39,635 Inserting rows into table: episodes
+    100%|████████████████████| 9462887/9462887 [00:29<00:00, 315650.53 rows/s]
+    2026-02-04 16:44:09,618 Importing file: downloads\title.ratings.tsv.gz
+    2026-02-04 16:44:09,618 Reading number of rows ...
+    2026-02-04 16:44:09,706 Inserting rows into table: ratings
+    100%|████████████████████| 1631810/1631810 [00:05<00:00, 304073.42 rows/s]
+    2026-02-04 16:44:15,077 Creating table indices ...
+    100%|██████████████████████████████████| 12/12 [03:19<00:00, 16.64s/index]
+    2026-02-04 16:47:34,781 Analyzing DB to generate statistic for query planner ...
+    2026-02-04 16:48:01,367 Import successful
 ### Note
 The import may take a long time, since there are millions of records to
 process.
-The above example used python 3.6.4 on windows 7, with the working directory
-being on a SSD.
+The above example used python 3.10.13 on windows 10, with the working directory
+being on a fast Kingston NVME SSD.
 ## Data model
@@ -117,7 +135,8 @@ reference it is in order.
 A movie has a title, a TV show has one. An episode has one as well. Well two
 actually; the title of the show, and the title of the episode itself. That is
-why there are two links to the same `title_id` attribute in the `titles` table.
+why there are two links to the same `title_id` attribute in the `titles` table,
+from the `episodes` table.
 To make the relationships a bit clearer, following are a few query examples
@@ -141,7 +160,7 @@ the following:
 ```sql
 -- // table aliases: st = show-title, et = episode-title
 SELECT st.primary_title, st.premiered, st.genres, e.season_number,
-       e.eposide_number, et.primary_title, r.rating, r.votes
+       e.episode_number, et.primary_title, r.rating, r.votes
 FROM  titles AS st
 INNER JOIN       episodes  e ON ( e.show_title_id = st.title_id )
 INNER JOIN       titles   et ON ( e.episode_title_id = et.title_id )
@@ -151,7 +170,7 @@ AND   st.type = 'tvSeries'
 ORDER BY r.rating DESC
 ```
-**Find which productions both Robert Deniro and Al Pacino acted together on**
+**Find which productions both Robert De Niro and Al Pacino acted together on**
 ```sql
 SELECT t.title_id, t.type, t.primary_title, t.premiered, t.genres,
        c1.characters AS 'Pacino played', c2.characters AS 'Deniro played'
@@ -187,7 +206,7 @@ massive query speedup.
 For example `sqlite3 imdb.db "CREATE INDEX myindex ON <table-name> (<slow-column>)"`
 ### Disk space tips
-The imported data as of 2023 produces a database file that is about 12 GiB.
+The imported data as of 2026 produces a database file that is about 19 GiB.
 About half of that space is for indices used to speed up query lookups and
 joins. The default indices take up about as much as the data.
@@ -196,7 +215,7 @@ ETL-step, for refreshing the dataset every now and then, and then simply export
 the full tables (e.g. for data science using pandas/ML), a `--no-index` flag is
 available. When specifying this flag, no indices will be created, which not
 only saves about 50% disk space, but also speeds up the overall import process.
-When this flag is provided, the DB file will be just shy of 6 GiB as of date of
+When this flag is provided, the DB file will be just 9.5 GiB as of date of
 writing.
 If you know precisely which indices you need, omitting the default indices may
@@ -204,6 +223,43 @@ also be a good idea, since you'd then not waste disk space on indices you don't
 need. Simply create the indices you _do_ need manually, as illustrated in the
 performance tip above.
+As an indicator, following is the current space consumption spread across the tables.
+Full import
+* default (includes indices): 19 GB
+* without indices: 9.5 GB
+Sizes of the respective tables when doing selective import of only a single
+table without indices.
+```
+* crew:     46% (4.4 GB)
+* akas:     28% (2.7 GB)
+* titles:   14% (1.3 GB)
+* people:    8% (0.8 GB)
+* episodes:  3% (0.3 GB)
+* ratings:   1% (0.1 GB)
+```
+Percentages are the relative space consumption of the full index-free import
+(~9.5 GB).
+Fair to say, "who played what character", or "fetched a doughnut to what
+VIP-of-wasting-space" accounts for about half the storage. If you can live
+without those details then there's a massive storage saving to be made. Also, if
+you don't need all the aliases for all the titles, like the portuguese title of
+some bollywood flick, then the akas can also be skipped. Getting rid of those
+two tables shaves off 3/4 of the required space. That's significant.
+If you don't care about characters, and just want to query movies or shows, their
+ratings and perhaps per-episode ratings as well, then 2 GiB of storage suffices
+as you only need tables titles, episodes and ratings. However if you actually
+want to query those tables as well, then you'd want to create indices, either
+manually or use the default. This ups the space requirement about 50% (3GB).
+I.e. just provide the command line argument `--only titles,ratings,episodes`.
 ## PyPI
 Current status of the project is:
 [![Build Status](https://github.com/jojje/imdb-sqlite/actions/workflows/python-publish.yml/badge.svg)](https://github.com/jojje/imdb-sqlite/actions/workflows/python-publish.yml)

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/README.md RENAMED Viewed

@@ -20,8 +20,8 @@ The program relies on the following IMDB tab separated files:
     usage: imdb-sqlite [OPTIONS]
-    Imports imdb tsv interface files into a new sqlitedatabase. Fetches them from
-    imdb if not present onthe machine.
+    Imports imdb tsv interface files into a new sqlite database. Fetches them from imdb
+    if not present on the machine.
     optional arguments:
       -h, --help       show this help message and exit
@@ -29,15 +29,19 @@ The program relies on the following IMDB tab separated files:
       --cache-dir DIR  Download cache dir where the tsv files from imdb will be stored
                        before the import (default: downloads)
       --no-index       Do not create any indices. Massively slower joins, but cuts the DB
-                       file size approximately in half (default: False)
-      --verbose        Show database interaction (default: False)
+                       file size approximately in half
+      --only TABLES    Import only some tables. The tables to import are specified using
+                       a comma delimited list, such as "people,titles". Use it to save
+                       storage space.
+      --verbose        Show database interaction
 Just run the program with no arguments, and you'll get a file named `imdb.db`
 in the current working directory.
 ### Hints
 * Make sure the disk the database is written to has sufficient space.
-  About 5 GiB is needed.
+  About 19 GiB is needed as of early 2026. About 9.5 GB without indices.
+  (for even less storage requirement, see Disk space tips below).
 * Use a SSD to speed up the import.
 * To check the best case import performance, use an in-memory database:
   `--db :memory:`.
@@ -45,50 +49,53 @@ in the current working directory.
 ## Example
     $ imdb-sqlite
+    2026-02-04 16:30:31,311 Populating database: imdb.db
+    2026-02-04 16:30:31,319 Applying schema
+    2026-02-04 16:30:31,323 Importing file: downloads\name.basics.tsv.gz
+    2026-02-04 16:30:31,324 Reading number of rows ...
+    2026-02-04 16:30:34,373 Inserting rows into table: people
+    100%|██████████████████| 15063390/15063390 [01:05<00:00, 228659.33 rows/s]
-    2018-07-08 16:00:00,000 Populating database: imdb.db
-    2018-07-08 16:00:00,001 Applying schema
-    2018-07-08 16:00:00,005 Importing file: downloads\name.basics.tsv.gz
-    2018-07-08 16:00:00,005 Reading number of rows ...
-    2018-07-08 16:00:11,521 Inserting rows into table: people
-    100%|█████████████████████████| 8699964/8699964 [01:23<00:00, 104387.75 rows/s]
-    2018-07-08 16:01:34,868 Importing file: downloads\title.basics.tsv.gz
-    2018-07-08 16:01:34,868 Reading number of rows ...
-    2018-07-08 16:01:41,873 Inserting rows into table: titles
-    100%|██████████████████████████| 5110779/5110779 [00:58<00:00, 87686.98 rows/s]
-    2018-07-08 16:02:40,161 Importing file: downloads\title.akas.tsv.gz
-    2018-07-08 16:02:40,161 Reading number of rows ...
-    2018-07-08 16:02:44,743 Inserting rows into table: akas
-    100%|██████████████████████████| 3625334/3625334 [00:37<00:00, 97412.94 rows/s]
-    2018-07-08 16:03:21,964 Importing file: downloads\title.principals.tsv.gz
-    2018-07-08 16:03:21,964 Reading number of rows ...
-    2018-07-08 16:03:55,922 Inserting rows into table: crew
-    100%|███████████████████████| 28914893/28914893 [03:45<00:00, 128037.21 rows/s]
-    2018-07-08 16:07:41,757 Importing file: downloads\title.episode.tsv.gz
-    2018-07-08 16:07:41,757 Reading number of rows ...
-    2018-07-08 16:07:45,370 Inserting rows into table: episodes
-    100%|█████████████████████████| 3449903/3449903 [00:21<00:00, 158265.16 rows/s]
-    2018-07-08 16:08:07,172 Importing file: downloads\title.ratings.tsv.gz
-    2018-07-08 16:08:07,172 Reading number of rows ...
-    2018-07-08 16:08:08,029 Inserting rows into table: ratings
-    100%|███████████████████████████| 846901/846901 [00:05<00:00, 152421.27 rows/s]
-    2018-07-08 16:08:13,589 Creating table indices ...
-    2018-07-08 16:09:16,451 Import successful
+    2026-02-04 16:31:40,262 Importing file: downloads\title.basics.tsv.gz
+    2026-02-04 16:31:40,262 Reading number of rows ...
+    2026-02-04 16:31:42,777 Inserting rows into table: titles
+    100%|██████████████████| 12265715/12265715 [01:06<00:00, 185564.42 rows/s]
+    2026-02-04 16:32:48,879 Importing file: downloads\title.akas.tsv.gz
+    2026-02-04 16:32:48,880 Reading number of rows ...
+    2026-02-04 16:32:54,646 Inserting rows into table: akas
+    100%|██████████████████| 54957563/54957563 [04:06<00:00, 222556.12 rows/s]
+    2026-02-04 16:37:01,586 Importing file: downloads\title.principals.tsv.gz
+    2026-02-04 16:37:01,587 Reading number of rows ...
+    2026-02-04 16:37:11,294 Inserting rows into table: crew
+    100%|██████████████████| 97617046/97617046 [06:27<00:00, 251790.20 rows/s]
+    2026-02-04 16:43:38,990 Importing file: downloads\title.episode.tsv.gz
+    2026-02-04 16:43:38,990 Reading number of rows ...
+    2026-02-04 16:43:39,635 Inserting rows into table: episodes
+    100%|████████████████████| 9462887/9462887 [00:29<00:00, 315650.53 rows/s]
+    2026-02-04 16:44:09,618 Importing file: downloads\title.ratings.tsv.gz
+    2026-02-04 16:44:09,618 Reading number of rows ...
+    2026-02-04 16:44:09,706 Inserting rows into table: ratings
+    100%|████████████████████| 1631810/1631810 [00:05<00:00, 304073.42 rows/s]
+    2026-02-04 16:44:15,077 Creating table indices ...
+    100%|██████████████████████████████████| 12/12 [03:19<00:00, 16.64s/index]
+    2026-02-04 16:47:34,781 Analyzing DB to generate statistic for query planner ...
+    2026-02-04 16:48:01,367 Import successful
 ### Note
 The import may take a long time, since there are millions of records to
 process.
-The above example used python 3.6.4 on windows 7, with the working directory
-being on a SSD.
+The above example used python 3.10.13 on windows 10, with the working directory
+being on a fast Kingston NVME SSD.
 ## Data model
@@ -102,7 +109,8 @@ reference it is in order.
 A movie has a title, a TV show has one. An episode has one as well. Well two
 actually; the title of the show, and the title of the episode itself. That is
-why there are two links to the same `title_id` attribute in the `titles` table.
+why there are two links to the same `title_id` attribute in the `titles` table,
+from the `episodes` table.
 To make the relationships a bit clearer, following are a few query examples
@@ -126,7 +134,7 @@ the following:
 ```sql
 -- // table aliases: st = show-title, et = episode-title
 SELECT st.primary_title, st.premiered, st.genres, e.season_number,
-       e.eposide_number, et.primary_title, r.rating, r.votes
+       e.episode_number, et.primary_title, r.rating, r.votes
 FROM  titles AS st
 INNER JOIN       episodes  e ON ( e.show_title_id = st.title_id )
 INNER JOIN       titles   et ON ( e.episode_title_id = et.title_id )
@@ -136,7 +144,7 @@ AND   st.type = 'tvSeries'
 ORDER BY r.rating DESC
 ```
-**Find which productions both Robert Deniro and Al Pacino acted together on**
+**Find which productions both Robert De Niro and Al Pacino acted together on**
 ```sql
 SELECT t.title_id, t.type, t.primary_title, t.premiered, t.genres,
        c1.characters AS 'Pacino played', c2.characters AS 'Deniro played'
@@ -172,7 +180,7 @@ massive query speedup.
 For example `sqlite3 imdb.db "CREATE INDEX myindex ON <table-name> (<slow-column>)"`
 ### Disk space tips
-The imported data as of 2023 produces a database file that is about 12 GiB.
+The imported data as of 2026 produces a database file that is about 19 GiB.
 About half of that space is for indices used to speed up query lookups and
 joins. The default indices take up about as much as the data.
@@ -181,7 +189,7 @@ ETL-step, for refreshing the dataset every now and then, and then simply export
 the full tables (e.g. for data science using pandas/ML), a `--no-index` flag is
 available. When specifying this flag, no indices will be created, which not
 only saves about 50% disk space, but also speeds up the overall import process.
-When this flag is provided, the DB file will be just shy of 6 GiB as of date of
+When this flag is provided, the DB file will be just 9.5 GiB as of date of
 writing.
 If you know precisely which indices you need, omitting the default indices may
@@ -189,6 +197,43 @@ also be a good idea, since you'd then not waste disk space on indices you don't
 need. Simply create the indices you _do_ need manually, as illustrated in the
 performance tip above.
+As an indicator, following is the current space consumption spread across the tables.
+Full import
+* default (includes indices): 19 GB
+* without indices: 9.5 GB
+Sizes of the respective tables when doing selective import of only a single
+table without indices.
+```
+* crew:     46% (4.4 GB)
+* akas:     28% (2.7 GB)
+* titles:   14% (1.3 GB)
+* people:    8% (0.8 GB)
+* episodes:  3% (0.3 GB)
+* ratings:   1% (0.1 GB)
+```
+Percentages are the relative space consumption of the full index-free import
+(~9.5 GB).
+Fair to say, "who played what character", or "fetched a doughnut to what
+VIP-of-wasting-space" accounts for about half the storage. If you can live
+without those details then there's a massive storage saving to be made. Also, if
+you don't need all the aliases for all the titles, like the portuguese title of
+some bollywood flick, then the akas can also be skipped. Getting rid of those
+two tables shaves off 3/4 of the required space. That's significant.
+If you don't care about characters, and just want to query movies or shows, their
+ratings and perhaps per-episode ratings as well, then 2 GiB of storage suffices
+as you only need tables titles, episodes and ratings. However if you actually
+want to query those tables as well, then you'd want to create indices, either
+manually or use the default. This ups the space requirement about 50% (3GB).
+I.e. just provide the command line argument `--only titles,ratings,episodes`.
 ## PyPI
 Current status of the project is:
 [![Build Status](https://github.com/jojje/imdb-sqlite/actions/workflows/python-publish.yml/badge.svg)](https://github.com/jojje/imdb-sqlite/actions/workflows/python-publish.yml)

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/imdb_sqlite/__main__.py RENAMED Viewed

@@ -100,7 +100,7 @@ TSV_TABLE_MAP = OrderedDict([
     ('title.ratings.tsv.gz',
         ('ratings', OrderedDict([
             ('tconst',            Column(name='title_id', type='VARCHAR PRIMARY KEY')),
-            ('averageRating',     Column(name='rating', type='INTEGER')),
+            ('averageRating',     Column(name='rating', type='REAL')),
             ('numVotes',          Column(name='votes', type='INTEGER')),
         ]))),
 ])
@@ -109,7 +109,8 @@ TSV_TABLE_MAP = OrderedDict([
 class Database:
     """ Shallow DB abstraction """
-    def __init__(self, uri=':memory:'):
+    def __init__(self, table_map, uri=':memory:'):
+        self.table_map = table_map
         exists = os.path.exists(uri)
         self.connection = sqlite3.connect(uri, isolation_level=None)
         self.connection.executescript("""
@@ -128,14 +129,14 @@ class Database:
     def create_tables(self):
         sqls = [self._create_table_sql(table, mapping.values())
-                for table, mapping in TSV_TABLE_MAP.values()]
+                for table, mapping in self.table_map.values()]
         sql = '\n'.join(sqls)
         logger.debug(sql)
         self.connection.executescript(sql)
     def create_indices(self):
         sqls = [self._create_index_sql(table, mapping.values())
-                for table, mapping in TSV_TABLE_MAP.values()]
+                for table, mapping in self.table_map.values()]
         sql = '\n'.join([s for s in sqls if s])
         logger.debug(sql)
         for stmt in tqdm(sql.split('\n'), unit='index'):
@@ -206,7 +207,7 @@ def ensure_downloaded(files, cache_dir):
         ofn = os.path.join(cache_dir, filename)
         if os.path.exists(ofn):
-            return
+            continue
         logger.info('GET %s -> %s', url, ofn)
         with urlopen(url) as response:
@@ -286,11 +287,22 @@ def import_file(db, filename, table, column_mapping):
         raise
+def filter_table_subset(table_map, wanted_tables):
+    def split_csv(s): return [v for v in (v.strip() for v in s.split(',')) if v]
+    wanted_tables = split_csv(wanted_tables)
+    out = OrderedDict()
+    for filename, (table_name, table_spec) in table_map.items():
+        if table_name in wanted_tables:
+            out[filename] = (table_name, table_spec)
+    return out
 def main():
     parser = argparse.ArgumentParser(
         formatter_class=argparse.ArgumentDefaultsHelpFormatter,
-        description='Imports imdb tsv interface files into a new sqlite'
-                    'database. Fetches them from imdb if not present on'
+        description='Imports imdb tsv interface files into a new sqlite '
+                    'database. Fetches them from imdb if not present on '
                     'the machine.'
     )
     parser.add_argument('--db', metavar='FILE', default='imdb.db',
@@ -300,6 +312,9 @@ def main():
     parser.add_argument('--no-index', action='store_true',
                         help='Do not create any indices. Massively slower joins, but cuts the DB file size '
                              'approximately in half')
+    parser.add_argument('--only', metavar='TABLES',
+                        help='Import only a some tables. The tables to import are specified using a comma delimited '
+                             'list, such as "people,titles". Use it to save storage space.')
     parser.add_argument('--verbose', action='store_true',
                         help='Show database interaction')
     opts = parser.parse_args()
@@ -312,11 +327,13 @@ def main():
         logger.warning('DB already exists: ({db}). Refusing to modify. Exiting'.format(db=opts.db))
         return 1
-    ensure_downloaded(TSV_TABLE_MAP.keys(), opts.cache_dir)
+    table_map = filter_table_subset(TSV_TABLE_MAP, opts.only) if opts.only else TSV_TABLE_MAP
+    ensure_downloaded(table_map.keys(), opts.cache_dir)
     logger.info('Populating database: {}'.format(opts.db))
-    db = Database(uri=opts.db)
+    db = Database(table_map=table_map, uri=opts.db)
-    for filename, table_mapping in TSV_TABLE_MAP.items():
+    for filename, table_mapping in table_map.items():
         table, column_mapping = table_mapping
         import_file(db, os.path.join(opts.cache_dir, filename),
                     table, column_mapping)

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/imdb_sqlite.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.4
 Name: imdb-sqlite
-Version: 1.1.0
+Version: 2.0.0
 Summary: Imports IMDB TSV files into a SQLite database
 Home-page: https://github.com/jojje/imdb-sqlite
 Author: Jonas Tingeborn
@@ -12,6 +12,17 @@ Classifier: License :: OSI Approved :: GNU General Public License v2 (GPLv2)
 Classifier: Operating System :: OS Independent
 Description-Content-Type: text/markdown
 License-File: LICENSE
+Requires-Dist: tqdm>=4.4.1
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: summary
 # imdb-sqlite
 Imports IMDB TSV files into a SQLite database.
@@ -35,8 +46,8 @@ The program relies on the following IMDB tab separated files:
     usage: imdb-sqlite [OPTIONS]
-    Imports imdb tsv interface files into a new sqlitedatabase. Fetches them from
-    imdb if not present onthe machine.
+    Imports imdb tsv interface files into a new sqlite database. Fetches them from imdb
+    if not present on the machine.
     optional arguments:
       -h, --help       show this help message and exit
@@ -44,15 +55,19 @@ The program relies on the following IMDB tab separated files:
       --cache-dir DIR  Download cache dir where the tsv files from imdb will be stored
                        before the import (default: downloads)
       --no-index       Do not create any indices. Massively slower joins, but cuts the DB
-                       file size approximately in half (default: False)
-      --verbose        Show database interaction (default: False)
+                       file size approximately in half
+      --only TABLES    Import only some tables. The tables to import are specified using
+                       a comma delimited list, such as "people,titles". Use it to save
+                       storage space.
+      --verbose        Show database interaction
 Just run the program with no arguments, and you'll get a file named `imdb.db`
 in the current working directory.
 ### Hints
 * Make sure the disk the database is written to has sufficient space.
-  About 5 GiB is needed.
+  About 19 GiB is needed as of early 2026. About 9.5 GB without indices.
+  (for even less storage requirement, see Disk space tips below).
 * Use a SSD to speed up the import.
 * To check the best case import performance, use an in-memory database:
   `--db :memory:`.
@@ -60,50 +75,53 @@ in the current working directory.
 ## Example
     $ imdb-sqlite
+    2026-02-04 16:30:31,311 Populating database: imdb.db
+    2026-02-04 16:30:31,319 Applying schema
+    2026-02-04 16:30:31,323 Importing file: downloads\name.basics.tsv.gz
+    2026-02-04 16:30:31,324 Reading number of rows ...
+    2026-02-04 16:30:34,373 Inserting rows into table: people
+    100%|██████████████████| 15063390/15063390 [01:05<00:00, 228659.33 rows/s]
-    2018-07-08 16:00:00,000 Populating database: imdb.db
-    2018-07-08 16:00:00,001 Applying schema
-    2018-07-08 16:00:00,005 Importing file: downloads\name.basics.tsv.gz
-    2018-07-08 16:00:00,005 Reading number of rows ...
-    2018-07-08 16:00:11,521 Inserting rows into table: people
-    100%|█████████████████████████| 8699964/8699964 [01:23<00:00, 104387.75 rows/s]
-    2018-07-08 16:01:34,868 Importing file: downloads\title.basics.tsv.gz
-    2018-07-08 16:01:34,868 Reading number of rows ...
-    2018-07-08 16:01:41,873 Inserting rows into table: titles
-    100%|██████████████████████████| 5110779/5110779 [00:58<00:00, 87686.98 rows/s]
-    2018-07-08 16:02:40,161 Importing file: downloads\title.akas.tsv.gz
-    2018-07-08 16:02:40,161 Reading number of rows ...
-    2018-07-08 16:02:44,743 Inserting rows into table: akas
-    100%|██████████████████████████| 3625334/3625334 [00:37<00:00, 97412.94 rows/s]
-    2018-07-08 16:03:21,964 Importing file: downloads\title.principals.tsv.gz
-    2018-07-08 16:03:21,964 Reading number of rows ...
-    2018-07-08 16:03:55,922 Inserting rows into table: crew
-    100%|███████████████████████| 28914893/28914893 [03:45<00:00, 128037.21 rows/s]
-    2018-07-08 16:07:41,757 Importing file: downloads\title.episode.tsv.gz
-    2018-07-08 16:07:41,757 Reading number of rows ...
-    2018-07-08 16:07:45,370 Inserting rows into table: episodes
-    100%|█████████████████████████| 3449903/3449903 [00:21<00:00, 158265.16 rows/s]
-    2018-07-08 16:08:07,172 Importing file: downloads\title.ratings.tsv.gz
-    2018-07-08 16:08:07,172 Reading number of rows ...
-    2018-07-08 16:08:08,029 Inserting rows into table: ratings
-    100%|███████████████████████████| 846901/846901 [00:05<00:00, 152421.27 rows/s]
-    2018-07-08 16:08:13,589 Creating table indices ...
-    2018-07-08 16:09:16,451 Import successful
+    2026-02-04 16:31:40,262 Importing file: downloads\title.basics.tsv.gz
+    2026-02-04 16:31:40,262 Reading number of rows ...
+    2026-02-04 16:31:42,777 Inserting rows into table: titles
+    100%|██████████████████| 12265715/12265715 [01:06<00:00, 185564.42 rows/s]
+    2026-02-04 16:32:48,879 Importing file: downloads\title.akas.tsv.gz
+    2026-02-04 16:32:48,880 Reading number of rows ...
+    2026-02-04 16:32:54,646 Inserting rows into table: akas
+    100%|██████████████████| 54957563/54957563 [04:06<00:00, 222556.12 rows/s]
+    2026-02-04 16:37:01,586 Importing file: downloads\title.principals.tsv.gz
+    2026-02-04 16:37:01,587 Reading number of rows ...
+    2026-02-04 16:37:11,294 Inserting rows into table: crew
+    100%|██████████████████| 97617046/97617046 [06:27<00:00, 251790.20 rows/s]
+    2026-02-04 16:43:38,990 Importing file: downloads\title.episode.tsv.gz
+    2026-02-04 16:43:38,990 Reading number of rows ...
+    2026-02-04 16:43:39,635 Inserting rows into table: episodes
+    100%|████████████████████| 9462887/9462887 [00:29<00:00, 315650.53 rows/s]
+    2026-02-04 16:44:09,618 Importing file: downloads\title.ratings.tsv.gz
+    2026-02-04 16:44:09,618 Reading number of rows ...
+    2026-02-04 16:44:09,706 Inserting rows into table: ratings
+    100%|████████████████████| 1631810/1631810 [00:05<00:00, 304073.42 rows/s]
+    2026-02-04 16:44:15,077 Creating table indices ...
+    100%|██████████████████████████████████| 12/12 [03:19<00:00, 16.64s/index]
+    2026-02-04 16:47:34,781 Analyzing DB to generate statistic for query planner ...
+    2026-02-04 16:48:01,367 Import successful
 ### Note
 The import may take a long time, since there are millions of records to
 process.
-The above example used python 3.6.4 on windows 7, with the working directory
-being on a SSD.
+The above example used python 3.10.13 on windows 10, with the working directory
+being on a fast Kingston NVME SSD.
 ## Data model
@@ -117,7 +135,8 @@ reference it is in order.
 A movie has a title, a TV show has one. An episode has one as well. Well two
 actually; the title of the show, and the title of the episode itself. That is
-why there are two links to the same `title_id` attribute in the `titles` table.
+why there are two links to the same `title_id` attribute in the `titles` table,
+from the `episodes` table.
 To make the relationships a bit clearer, following are a few query examples
@@ -141,7 +160,7 @@ the following:
 ```sql
 -- // table aliases: st = show-title, et = episode-title
 SELECT st.primary_title, st.premiered, st.genres, e.season_number,
-       e.eposide_number, et.primary_title, r.rating, r.votes
+       e.episode_number, et.primary_title, r.rating, r.votes
 FROM  titles AS st
 INNER JOIN       episodes  e ON ( e.show_title_id = st.title_id )
 INNER JOIN       titles   et ON ( e.episode_title_id = et.title_id )
@@ -151,7 +170,7 @@ AND   st.type = 'tvSeries'
 ORDER BY r.rating DESC
 ```
-**Find which productions both Robert Deniro and Al Pacino acted together on**
+**Find which productions both Robert De Niro and Al Pacino acted together on**
 ```sql
 SELECT t.title_id, t.type, t.primary_title, t.premiered, t.genres,
        c1.characters AS 'Pacino played', c2.characters AS 'Deniro played'
@@ -187,7 +206,7 @@ massive query speedup.
 For example `sqlite3 imdb.db "CREATE INDEX myindex ON <table-name> (<slow-column>)"`
 ### Disk space tips
-The imported data as of 2023 produces a database file that is about 12 GiB.
+The imported data as of 2026 produces a database file that is about 19 GiB.
 About half of that space is for indices used to speed up query lookups and
 joins. The default indices take up about as much as the data.
@@ -196,7 +215,7 @@ ETL-step, for refreshing the dataset every now and then, and then simply export
 the full tables (e.g. for data science using pandas/ML), a `--no-index` flag is
 available. When specifying this flag, no indices will be created, which not
 only saves about 50% disk space, but also speeds up the overall import process.
-When this flag is provided, the DB file will be just shy of 6 GiB as of date of
+When this flag is provided, the DB file will be just 9.5 GiB as of date of
 writing.
 If you know precisely which indices you need, omitting the default indices may
@@ -204,6 +223,43 @@ also be a good idea, since you'd then not waste disk space on indices you don't
 need. Simply create the indices you _do_ need manually, as illustrated in the
 performance tip above.
+As an indicator, following is the current space consumption spread across the tables.
+Full import
+* default (includes indices): 19 GB
+* without indices: 9.5 GB
+Sizes of the respective tables when doing selective import of only a single
+table without indices.
+```
+* crew:     46% (4.4 GB)
+* akas:     28% (2.7 GB)
+* titles:   14% (1.3 GB)
+* people:    8% (0.8 GB)
+* episodes:  3% (0.3 GB)
+* ratings:   1% (0.1 GB)
+```
+Percentages are the relative space consumption of the full index-free import
+(~9.5 GB).
+Fair to say, "who played what character", or "fetched a doughnut to what
+VIP-of-wasting-space" accounts for about half the storage. If you can live
+without those details then there's a massive storage saving to be made. Also, if
+you don't need all the aliases for all the titles, like the portuguese title of
+some bollywood flick, then the akas can also be skipped. Getting rid of those
+two tables shaves off 3/4 of the required space. That's significant.
+If you don't care about characters, and just want to query movies or shows, their
+ratings and perhaps per-episode ratings as well, then 2 GiB of storage suffices
+as you only need tables titles, episodes and ratings. However if you actually
+want to query those tables as well, then you'd want to create indices, either
+manually or use the default. This ups the space requirement about 50% (3GB).
+I.e. just provide the command line argument `--only titles,ratings,episodes`.
 ## PyPI
 Current status of the project is:
 [![Build Status](https://github.com/jojje/imdb-sqlite/actions/workflows/python-publish.yml/badge.svg)](https://github.com/jojje/imdb-sqlite/actions/workflows/python-publish.yml)

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/imdb_sqlite.egg-info/SOURCES.txt RENAMED Viewed

@@ -1,5 +1,6 @@
 LICENSE
 README.md
+pyproject.toml
 setup.py
 imdb_sqlite/__init__.py
 imdb_sqlite/__main__.py

imdb_sqlite-2.0.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,3 @@
+[build-system]
+requires = ["setuptools>=40.8.0", "wheel"]
+build-backend = "setuptools.build_meta"

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/setup.py RENAMED Viewed

@@ -5,7 +5,7 @@ with open('README.md', 'r') as fh:
 setuptools.setup(
     name='imdb-sqlite',
-    version='1.1.0',
+    version='2.0.0',
     author='Jonas Tingeborn',
     author_email='tinjon+pip@gmail.com',
     description='Imports IMDB TSV files into a SQLite database',

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/LICENSE RENAMED Viewed

File without changes

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/imdb_sqlite/__init__.py RENAMED Viewed

File without changes

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/imdb_sqlite.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/imdb_sqlite.egg-info/entry_points.txt RENAMED Viewed

File without changes

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/imdb_sqlite.egg-info/requires.txt RENAMED Viewed

File without changes

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/imdb_sqlite.egg-info/top_level.txt RENAMED Viewed

File without changes

{imdb-sqlite-1.1.0 → imdb_sqlite-2.0.0}/setup.cfg RENAMED Viewed

File without changes

imdb-sqlite 1.1.0__tar.gz → 2.0.0__tar.gz

imdb-sqlite 1.1.0tar.gz → 2.0.0tar.gz