Add third-party python libraries


How do you install or pre-configure Seahorse to have third-party python libraries available in the Python notebooks?


In Spark, to install a python multi-file module, you have to provide it in .zip or .egg format.

For some python modules, it can require some work.

For example for nltk, you can download the code from here:

Then, you have to unpack it, and zip the directory nltk inside the unpacked nltk-3.2.1.

For example, in linux:

tar xzf nltk-3.2.1.tar.gz
cd nltk-3.2.1
zip -r …/ nltk

Now, to import such a module in Seahorse, you have to put the module in the directory visible by Seahorse.

This is the data directory that is in the same directory that contains your docker-compose.yml (or Vagrantfile).

Now, to use this module in Seahorse Notebook, you can execute:

and then:

import nltk
The same can be done in python transformation.

Alternatively, if you don’t want to perform this every time, you can add following line to your cluster preset, in the custom settings section:

–conf spark.submit.pyFiles=/resources/data/

I hope that these instructions are clear and that this will work for you!


Fantastic details! Very helpful. The last part with spark.submit.pyFiles via the data directory was what I was looking for.