Exploring an SQLite database from Jupyter Notebook
Until very recently, I had a mysterious aversion to databases. What little exposure I had had to phpMyAdmin and Microsoft Access had conveyed an impression of obscure and dangerous fiddliness, coalescing to a weighty sense of tedium.
When I initially thought about trying to build a photo album using tags and other metadata that I’d curated in digiKam, the
digikam4.db file had such a force field around it that I didn’t really think about extracting data from it. I actually thought about getting digiKam to overwrite all my JPEG files to incorporate their metadata, and then using ExifTool to read them. I’m sure that would have worked too, but the bit about rewriting tens of thousands of files on my drive was not attractive.
Then I had an epiphany, because Peter Teuben, in one super-simple Python script, opened his digiKam database and extracted information from it, just like that. So wait. If there’s a library for Python to manipulate SQLite databases (which of course there is), then I can play with this in Jupyter Notebook. For this kind of code, reading data and outputting to text files, we don’t really need the flexibility and interactivity of Jupyter or a more conventional IDE, but I’d argue it’s more convenient.
Thanks to great tools written by other people, and succinct and readable documentation for the
sqlite3 module, within hours I was looking at images from my filesystem, chosen by their digiKam tags, within a fresh local TiddlyWiki. I do enjoy instant gratification…so it was, without exaggeration, a ridiculous amount of fun.
This post is an attempt to collect a high density of basic information that someone looking at an SQL-style database for the first time, and using Python, might find useful early on, based on it being a summary of the things I had to discover in order to make my first simple project work. There’s nothing about writing to a database here. It’s all about reading and formatting data.
As an aside: the cross-platform GUI application “DB browser for SQLite” (whose executable and package name for Linux is
sqlitebrowser) is great for fast database exploration.
- Basic setup
- Listing the tables in the database
- Inspecting the columns in a table
- Composing SQL statements with variables, using parameter substitution
- Using placeholders for unknown numbers of variables
- Formatting output
To explore the database I only need to import one module:
Connect to the database
For a read-write connection, this can be as simple as:
# bog-standard read-write connection conn = sqlite3.connect('digikam4.db')
For illustration purposes, I have placed the
.db file in the same directory as my notebook.
uri=True to treat the
database parameter as a URI allows access to more options, including opening the connection read-only:
# fancy read-only connection conn = sqlite3.connect('file:digikam4.db?mode=ro', uri=True)
curs = conn.cursor()
Now I can browse the database using the
Listing the tables in the database
According to the SQLite FAQ: “Every SQLite database has an SQLITE_MASTER table that defines the schema for the database.”
Aside: SQL keywords don’t have to be in all caps to work, but it does make them stand out, and because they are within quotes in the
cursor.execute() method, syntax highlighting doesn’t catch them.
We can use the cursor’s
fetchall() method to get the tables listed in
curs.execute("SELECT name FROM sqlite_master WHERE type = 'table'").fetchall()
[('Tags',), ('TagsTree',), ('ImageTags',), ... ('VideoMetadata',)]
fetchall() returns a list of tuples, and in this case each tuple has a single element, which is why they each have a trailing comma.
Rather than fetching a list all at once into memory, we can access the matching entries one by one from the cursor:
for row in curs.execute("SELECT name FROM sqlite_master WHERE type = 'table'"): print(row)
('Tags',) ('TagsTree',) ('ImageTags',) ... ('VideoMetadata',)
Inspecting the columns in a table
To look at the columns in a particular table, and inspect their properties, we can use a
for row in curs.execute("PRAGMA table_info(Albums)"): print(row)
(0, 'id', 'INTEGER', 0, None, 1) (1, 'albumRoot', 'INTEGER', 1, None, 0) (2, 'relativePath', 'TEXT', 1, None, 0) (3, 'date', 'DATE', 0, None, 0) (4, 'caption', 'TEXT', 0, None, 0) (5, 'collection', 'TEXT', 0, None, 0) (6, 'icon', 'INTEGER', 0, None, 0)
description attribute of the
cursor can be used to get just the names of the columns:
curs.execute('SELECT * FROM Albums').description
(('id', None, None, None, None, None, None), ('albumRoot', None, None, None, None, None, None), ('relativePath', None, None, None, None, None, None), ('date', None, None, None, None, None, None), ('caption', None, None, None, None, None, None), ('collection', None, None, None, None, None, None), ('icon', None, None, None, None, None, None))
Nones are because the
description attribute only has one piece of information per column, but has to be a 7-tuple to comply with the Python DB API (Python Docs).
This attribute can also be used to get the column names of the output of
(('cid', None, None, None, None, None, None), ('name', None, None, None, None, None, None), ('type', None, None, None, None, None, None), ('notnull', None, None, None, None, None, None), ('dflt_value', None, None, None, None, None, None), ('pk', None, None, None, None, None, None))
Composing SQL statements with variables, using parameter substitution
cursor.execute() method takes an SQL command in the form of a string. Often we’ll want to compose one of these strings using variables.
It is noted quite near the top of the Python docs page for
sqlite3 that simply constructing these strings in Python is insecure to injection attacks (or errors).
Instead we should use a placeholder (in the simplest case,
execute() method takes a second argument, a tuple, to substitute for the placeholder. The contents of the tuple get sanitized – that is,
sqlite3 prevents input containing SQL keywords from being processed as SQL, which would change the command in potentially damaging ways.
While this intuitive way works, it’s not considered secure:
# Don't do it this way the_tag_string = "some_tag_name" curs.execute("SELECT id from Tags WHERE name == '"+ the_tag_string +"'").fetchone()
This returns the a tuple containing
id of the tag
named “some_tag_name” in my current
# Yes, this gets sanitized the_tag_tuple = "some_tag_name", curs.execute("SELECT id from Tags WHERE name == ?", the_tag_tuple).fetchone()
This is safer. And it turns out you can make a singleton tuple out of a string just by putting a comma after it (no need for brackets).
Using placeholders for unknown numbers of variables
If I want to find the ids of multiple tags from their names, I use an SQL
SELECT statement with a
WHERE clause, using the
IN operator. It works similarly to the previous example, with the tag names supplied by a tuple argument to
execute(). The difference is that we may not know how many elements (in this case, how many tags) are in that tuple, but we have to find out, because each one needs its own explicit placeholder.
If there are two tags to find, and they live in a tuple called
opt_tags, we need two placeholders. The SQL would look like this:
opt_tags = ("trees", "flowers") curs.execute("SELECT id from Tags WHERE name IN (?,?)", opt_tags).fetchall()
If the tuple
opt_tags is the result of a previous
fetchall(), it’s probable that the number of elements in it is unknown. Moreover, there’s a good chance it’s also quite a large number, and hand-typing all those question marks would be very tedious.
I liked the solution by Stack Overflow user
@Duncan for generating the right number of placeholders using the length of the
curs.execute("SELECT id from Tags WHERE name IN (" + ",".join("?"*len(opt_tags)) + ")", opt_tags).fetchall()
Note that we have a problem if
opt_tags has more than 999 elements, because SQLite has a compiled-in limit of that many parameters in a query, to prevent excessive memory usage. Then I guess the query has to be broken up to deal with the tags in batches.
Aside: Now I have learned the origin of the occasional error message I’ve seen from web pages or Android apps, containing long chains of question marks. If I’d known what that was about, I would have been even more reluctant to touch databases.
fetchall() method returns a list of tuples. This may or may not be the most convenient format for whatever I plan to do with it next.
Adding one column at a time to a list
To get a simple list of query results, I can go row by row adding a specific (or the only) element of each tuple to a list, by index:
opt_tags = ("trees", "flowers") tag_ids =  hits = curs.execute("SELECT id from Tags WHERE name IN (" + ",".join("?"*len(opt_tags)) + ")", opt_tags ) for row in hits: tag_ids.append(row) tag_ids
row_factory to format output
Another way to tailor the output is to use the
Setting the row_factory to a function that takes cursor and row as input and returns
row (the first column of
conn.row_factory = lambda cursor, row: row curs = conn.cursor()
I also seem to be able to set a
row_factory attribute right on the cursor, with similar results.
curs.row_factory = lambda cursor, row: row
curs.row_factory = lambda cursor, row: row tag_ids = curs.execute("SELECT id from Tags WHERE name IN (" + ",".join("?"*len(opt_tags)) + ")", opt_tags ).fetchall() tag_ids
fetchall() then yields a plain list, not a list of tuples:
Once this output behaviour is no longer wanted (for instance, if a different column or columns may be desired from the next query), it’s important to set
row_factory back to the default
conn.row_factory = None curs = conn.cursor()
curs.row_factory = None
In this simple case, unless there are several queries in a row needing the same form of output, it seems cleaner just to specify the
0th element of each row rather than changing
row_factory twice. I haven’t looked into speed or memory impact; it really doesn’t matter for my use-case.
row_factory with the
sqlite3.Row class looks very powerful. I don’t need this specific type of power at this moment.
That’s my collection of basic building blocks for a Python script or Jupyter notebook that reads some information from a local SQLite database file.
The next step for me is to do some more queries on my
digikam.db file, massage the output a bit, and spit it out into a file format that can be imported into a TiddlyWiki.