Uncategorized

Exploring an SQLite database from Jupyter Notebook

Preamble ramble

Until very recently, I had a mysterious aversion to databases. What little exposure I had had to phpMyAdmin and Microsoft Access had conveyed an impression of obscure and dangerous fiddliness, coalescing to a weighty sense of tedium.

When I initially thought about trying to build a photo album using tags and other metadata that I’d curated in digiKam, the digikam4.db file had such a force field around it that I didn’t really think about extracting data from it. I actually thought about getting digiKam to overwrite all my JPEG files to incorporate their metadata, and then using ExifTool to read them. I’m sure that would have worked too, but the bit about rewriting tens of thousands of files on my drive was not attractive.

Then I had an epiphany, because Peter Teuben, in one super-simple Python script, opened his digiKam database and extracted information from it, just like that. So wait. If there’s a library for Python to manipulate SQLite databases (which of course there is), then I can play with this in Jupyter Notebook. For this kind of code, reading data and outputting to text files, we don’t really need the flexibility and interactivity of Jupyter or a more conventional IDE, but I’d argue it’s more convenient.

Thanks to great tools written by other people, and succinct and readable documentation for the sqlite3 module, within hours I was looking at images from my filesystem, chosen by their digiKam tags, within a fresh local TiddlyWiki. I do enjoy instant gratification…so it was, without exaggeration, a ridiculous amount of fun.

This post is an attempt to collect a high density of basic information that someone looking at an SQL-style database for the first time, and using Python, might find useful early on, based on it being a summary of the things I had to discover in order to make my first simple project work. There’s nothing about writing to a database here. It’s all about reading and formatting data.

As an aside: the cross-platform GUI application “DB browser for SQLite” (whose executable and package name for Linux is sqlitebrowser) is great for fast database exploration.

Contents

Basic setup

Import the sqlite3 module

To explore the database I only need to import one module:

import sqlite3

Connect to the database

For a read-write connection, this can be as simple as:

# bog-standard read-write connection
conn = sqlite3.connect('digikam4.db')

For illustration purposes, I have placed the .db file in the same directory as my notebook.

Specifying uri=True to treat the database parameter as a URI allows access to more options, including opening the connection read-only:

# fancy read-only connection
conn = sqlite3.connect('file:digikam4.db?mode=ro', uri=True)

Create a cursor object

curs = conn.cursor()

Now I can browse the database using the cursor's methods.

Listing the tables in the database

According to the SQLite FAQ: “Every SQLite database has an SQLITE_MASTER table that defines the schema for the database.”

Aside: SQL keywords don’t have to be in all caps to work, but it does make them stand out, and because they are within quotes in the cursor.execute() method, syntax highlighting doesn’t catch them.

We can use the cursor’s fetchall() method to get the tables listed in sqlite_master:

curs.execute("SELECT name FROM sqlite_master WHERE type = 'table'").fetchall()
[('Tags',),
 ('TagsTree',),
 ('ImageTags',),
 ...
 ('VideoMetadata',)]

fetchall() returns a list of tuples, and in this case each tuple has a single element, which is why they each have a trailing comma.

Rather than fetching a list all at once into memory, we can access the matching entries one by one from the cursor:

for row in curs.execute("SELECT name FROM sqlite_master WHERE type = 'table'"):
    print(row)
('Tags',)
('TagsTree',)
('ImageTags',)
...
('VideoMetadata',)

Inspecting the columns in a table

To look at the columns in a particular table, and inspect their properties, we can use a PRAGMA statement:

for row in curs.execute("PRAGMA table_info(Albums)"):
    print(row)
(0, 'id', 'INTEGER', 0, None, 1)
(1, 'albumRoot', 'INTEGER', 1, None, 0)
(2, 'relativePath', 'TEXT', 1, None, 0)
(3, 'date', 'DATE', 0, None, 0)
(4, 'caption', 'TEXT', 0, None, 0)
(5, 'collection', 'TEXT', 0, None, 0)
(6, 'icon', 'INTEGER', 0, None, 0)

The description attribute of the cursor can be used to get just the names of the columns:

curs.execute('SELECT * FROM Albums').description

This returns

(('id', None, None, None, None, None, None),
('albumRoot', None, None, None, None, None, None),
('relativePath', None, None, None, None, None, None),
('date', None, None, None, None, None, None),
('caption', None, None, None, None, None, None),
('collection', None, None, None, None, None, None),
('icon', None, None, None, None, None, None))

All the Nones are because the description attribute only has one piece of information per column, but has to be a 7-tuple to comply with the Python DB API (Python Docs).

This attribute can also be used to get the column names of the output of PRAGMA table_info(Albums):

curs.execute("PRAGMA table_info(Albums)").description
(('cid', None, None, None, None, None, None),
('name', None, None, None, None, None, None),
('type', None, None, None, None, None, None),
('notnull', None, None, None, None, None, None),
('dflt_value', None, None, None, None, None, None),
('pk', None, None, None, None, None, None))

Composing SQL statements with variables, using parameter substitution

The cursor.execute() method takes an SQL command in the form of a string. Often we’ll want to compose one of these strings using variables.

It is noted quite near the top of the Python docs page for sqlite3 that simply constructing these strings in Python is insecure to injection attacks (or errors).

Instead we should use a placeholder (in the simplest case, ?). The execute() method takes a second argument, a tuple, to substitute for the placeholder. The contents of the tuple get sanitized – that is, sqlite3 prevents input containing SQL keywords from being processed as SQL, which would change the command in potentially damaging ways.

While this intuitive way works, it’s not considered secure:

# Don't do it this way
the_tag_string = "some_tag_name"
curs.execute("SELECT id from Tags WHERE name == '"+ the_tag_string +"'").fetchone()

This returns the a tuple containing id of the tag named “some_tag_name” in my current digikam4.db file:

(296,)
# Yes, this gets sanitized
the_tag_tuple = "some_tag_name",
curs.execute("SELECT id from Tags WHERE name == ?", the_tag_tuple).fetchone()
(296,)

This is safer. And it turns out you can make a singleton tuple out of a string just by putting a comma after it (no need for brackets).

Using placeholders for unknown numbers of variables

If I want to find the ids of multiple tags from their names, I use an SQL SELECT statement with a WHERE clause, using the IN operator. It works similarly to the previous example, with the tag names supplied by a tuple argument to execute(). The difference is that we may not know how many elements (in this case, how many tags) are in that tuple, but we have to find out, because each one needs its own explicit placeholder.

If there are two tags to find, and they live in a tuple called opt_tags, we need two placeholders. The SQL would look like this:

opt_tags = ("trees", "flowers")
curs.execute("SELECT id from Tags WHERE name IN (?,?)", opt_tags).fetchall()

If the tuple opt_tags is the result of a previous fetchall(), it’s probable that the number of elements in it is unknown. Moreover, there’s a good chance it’s also quite a large number, and hand-typing all those question marks would be very tedious.

I liked the solution by Stack Overflow user @Duncan for generating the right number of placeholders using the length of the opt_tags tuple:

curs.execute("SELECT id from Tags WHERE name IN (" + ",".join("?"*len(opt_tags)) + ")", opt_tags).fetchall()
[(54,), (98,)]

Note that we have a problem if opt_tags has more than 999 elements, because SQLite has a compiled-in limit of that many parameters in a query, to prevent excessive memory usage. Then I guess the query has to be broken up to deal with the tags in batches.

Aside: Now I have learned the origin of the occasional error message I’ve seen from web pages or Android apps, containing long chains of question marks. If I’d known what that was about, I would have been even more reluctant to touch databases.

Formatting output

The cursor’s fetchall() method returns a list of tuples. This may or may not be the most convenient format for whatever I plan to do with it next.

Adding one column at a time to a list

To get a simple list of query results, I can go row by row adding a specific (or the only) element of each tuple to a list, by index:

opt_tags = ("trees", "flowers")
tag_ids = []
hits = curs.execute("SELECT id from Tags WHERE name IN (" + ",".join("?"*len(opt_tags)) + ")", opt_tags )
for row in hits:
    tag_ids.append(row[0])
tag_ids
[54, 98]

Using row_factory to format output

Another way to tailor the output is to use the sqlite3 connection's row_factory attribute.

Setting the row_factory to a function that takes cursor and row as input and returns row[0] (the first column of row):

conn.row_factory = lambda cursor, row: row[0]
curs = conn.cursor()

I also seem to be able to set a row_factory attribute right on the cursor, with similar results.

curs.row_factory = lambda cursor, row: row[0]
curs.row_factory = lambda cursor, row: row[0]
tag_ids = curs.execute("SELECT id from Tags WHERE name IN (" + ",".join("?"*len(opt_tags)) + ")", opt_tags ).fetchall()
tag_ids

fetchall() then yields a plain list, not a list of tuples:

[54, 98]

Once this output behaviour is no longer wanted (for instance, if a different column or columns may be desired from the next query), it’s important to set row_factory back to the default None.

conn.row_factory = None
curs = conn.cursor()

or

curs.row_factory = None

In this simple case, unless there are several queries in a row needing the same form of output, it seems cleaner just to specify the 0th element of each row rather than changing row_factory twice. I haven’t looked into speed or memory impact; it really doesn’t matter for my use-case.

Aside: Combining row_factory with the sqlite3.Row class looks very powerful. I don’t need this specific type of power at this moment.

Conclusion

That’s my collection of basic building blocks for a Python script or Jupyter notebook that reads some information from a local SQLite database file.

The next step for me is to do some more queries on my digikam.db file, massage the output a bit, and spit it out into a file format that can be imported into a TiddlyWiki.