November 14, 2014

Leave a comment

Python: creating a sqlite3 database from CSV files

Recently I needed to do some data processing on a pile of CSV files exported from an Access database. Since the relational structure was fairly complex, I decided not to try to stitch it together with UNIX commands but rather to import the whole lot into a sqlite3 database on which to run queries.

Here’s the Python code, which is fairly straightforward. The only slightly exciting thing is that it detects columns whose names end in “_id” and adds an index for them, so that when you start doing join queries they’re not painfully slow.

The script traverses a directory full of CSV files. It takes two arguments; run it like so:

python csv2sqlite.py nameofdb directory

from __future__ import print_function
import sqlite3
import csv
import os
import glob
import sys

db = sys.argv[1]

conn = sqlite3.connect(db)
conn.text_factory = str  # allows utf-8 data to be stored

c = conn.cursor()

# traverse the directory and process each .csv file
for csvfile in glob.glob(os.path.join(sys.argv[2], "*.csv")):
    # remove the path and extension and use what's left as a table name
    tablename = os.path.splitext(os.path.basename(csvfile))[0]

    with open(csvfile, "rb") as f:
        reader = csv.reader(f)

        header = True
        for row in reader:
            if header:
                # gather column names from the first row of the csv
                header = False

                sql = "DROP TABLE IF EXISTS %s" % tablename
                c.execute(sql)
                sql = "CREATE TABLE %s (%s)" % (tablename,
                          ", ".join([ "%s text" % column for column in row ]))
                c.execute(sql)

                for column in row:
                    if column.lower().endswith("_id"):
                        index = "%s__%s" % ( tablename, column )
                        sql = "CREATE INDEX %s on %s (%s)" % ( index, tablename, column )
                        c.execute(sql)

                insertsql = "INSERT INTO %s VALUES (%s)" % (tablename,
                            ", ".join([ "?" for column in row ]))

                rowlen = len(row)
            else:
                # skip lines that don't have the right number of columns
                if len(row) == rowlen:
                    c.execute(insertsql, row)

        conn.commit()

c.close()
conn.close()

Leave a comment Cancel reply