2013-08-10

Parsing package dependencies

So I'm dealing with package dependencies such as Debian's, where , delimits AND dependencies (i.e. all of these) and | OR dependencies (i.e. any of these). I have some naive code that works, but inspired with wisdom from Norvig's CS212, I spark to refactoring it.
from collections import namedtuple
Dependency = namedtuple('Dependency', 'name relation version')
 
def parse_dependency_line_v1(line):
  """Returns list of list of Dependency namedtuples parsed from line."""
  and_dependencies = line.split(',')
  for i, and_dependency in enumerate(and_dependencies):
    or_dependencies = and_dependency.split('|')
    for j, package in enumerate(or_dependencies):
      package = package.strip()
      try: name, version = package.split(None, 1)
      except ValueError:
        name, relation, version = package, '>=', '~~'
      else: relation, version = version.strip('()').split()
      or_dependencies[j] = Dependency(name, relation, version)
  return and_dependencies
After half an hour or so, I have the refactored version:
import re
from operator import itemgetter

_name_rel_version = re.compile('(\S+)(\s+\(\s*([<=>]+)\s+(\S+)\s*\))?')

def _parse_single_dependency(dependency):
  dependency = _name_rel_version.search(dependency).groups()
  dependency = itemgetter(0, 2, 3)([i or default for i, default in zip(dependency, (None, None, '>=', '~~'))])
  return Dependency(*dependency)
 
def parse_dependency_line_v2(line):
  """Returns list of list of Dependency namedtuples parsed from line."""
  return list(map(_parse_single_dependency, i.split('|')) for i in line.split(','))
While the latter code may look uglier to the uninitiated, it should eventually prove easier to maintain. And it's functional! So how valuable is it performance-wise?
>>> line = 'libsfst1-1.4 (= 1.4.6g-3) | eclipse-cdt-profiling-framework, eclipse-rse | libc6 (>= 2.6-1), libgcc1 (>= 1:4.2-20070516), libsdl-image1.2 (>= 1.2.5), libsdl1.2debian (>= 1.2.10-1), libstdc++6 (>= 4.2-20070516), pipenightdreams-data (= 0.10.0-13) | libreoffice-l10n-mn (>= 1:3.4.0~) | libsalck3 (= 1.1.4-4.1) | adduser, dvb-apps, libc6 (>= 2.7) | libsvm3-java | debconf (>= 0.2.80) | debconf-2.0, adduser (>= 3.11)'
>>>
>>> %timeit parse_dependency_line_v1(line)  # naive, readable code
10000 loops, best of 3: 79.6 us per loop
>>> %timeit parse_dependency_line_v2(line)  # functional code
10000 loops, best of 3: 128 us per loop
Oh. It appears the heavily-inlined code runs slower than the naive implementation. Bummer.

After a while, realizing that Python 3 is the future (of Python), I decide to test it. What I'm building is supposed to be forward-compatible anyway.

It turns out that in Python 3, the functional code (v2) runs 6× faster than the naive implementation (v1), that's 13.8 µs per loop! :D

FWIK, they must have significantly improved the static optimization in CPython. The cProfiler counts 1,410,003 function calls with python2 whereas only 230,004 function calls with python3.

(The naive implementation takes roughly the same in python3, around 83 µs.)

It will pay to be functional. Good job, everyone!

No comments:

Post a Comment