Menü

Toolbox

During your Coli studies you encounter many difficulties of theoretical and technical nature. Thankfully there is the world wide web with a multitude of resources which can help you with dealing with these difficulties.
This list is an attempt to give an overview over those resources, many of which we already profited of.
There are different kinds of resources:

  • Extensive introductions into a field, for which you should schedule some time.
  • Web tools, to get quick calculations or drawings.
  • Compendium-like overviews.
  • Software libraries.

Coli Infrastructure

Version Control

Since git is the most used version control system today the Gruppe Technik maintains a GitLab instance. GitLab is similar to GitHub. You can create your own projects and collaborate.

Corpora, Parser etc.

The Wiki page for available resources explains everything necessary. (At least in German.)

Server

Student- and Open-for-students servers:

  • ella.cl.uni-heidelberg.de: Computing servers; 32 Cores, 125 GiB Main memory.
  • last.cl.uni-heidelberg.de: Computing servers; 40 Cores, 504 GiB Main memory.

GT Tutorial

The Gruppe Technik has many interesting facts about infrastructure on their Wiki page.

Coli News

  • Hackernews provides news for hackers. Including machine learning or even Python.
  • /r/Python/ is a subreddit with news about Python, i.e. new versions, new modules or projects realised with Python.
  • /r/MachineLearning/ is all about machine learning, i.e. new algorithms, new implementations or discussions.

Web Tools

Tutorials and Overviews

Programming and Automation

Bash

Even though Shell (and especially Bash) are important for daily data processing and programming tasks, there is no introduction into the field (except the resource course). Private study is recommended.

Computational Linguistics

  • The Natural Language Processing with Python page is a tutorial for working with NLTK.
  • The YouTube channel Sentdex highlights what is possible with Python (Stock market prediction, machine learning, but also NLP.
  • The YouTube channel Sirajalogy reports new research, like Reinforcement Learning or GANs, in an easy to digest format for newcomers.

Writing

  • The LaTeX for Linguists page gives an overview over many LaTeX packages relevant to linguistics.
  • A student from Heidelberg provides a good and simple Introduction to LaTeX, for which every lecture and assignment is online. Recommended for working through completely or for quick reference.

Python Libraries

Here are a few interesting libraries, so you don’t have to implement them yourselves.

String Matching

  • fuzzywuzzy: fuzzywuzzy is a library that allows for loose string matching (Levenshtein), but also filtering.

Language Detection

  • langdetect: langdetect is a quick and easy way to find out in what language someone is talking to you.

Statistical Modules

  • statsmodels: statsmodels includes many features for statistical analysis of data, like correlation.
  • sklearn: sklearn implements many machine learning algorithms.
  • scipy: scipy implements features for scientific working, like correlation tests or common similarity measures.
  • sympy: sympy provides a more natural syntax for maths functions in Python which you can then simplify, devise, integrate or evaluate with inserted numbers.
  • pandas: pandas simplifies tables.

Text Processing

  • nltk: nltk includes many text processing features.
  • spacy: spacy implements state of the art algorithms for pos tagging, dependency parsing and NER.
  • textblob: Is you go-to for English text processing, like sentiment analysis.
  • textblob-de: Like textblob, but more German.
  • polyglot: Pos tagging and sentiment analysis in over 100 languages.

Scrapping

  • requests: Simplifies working with HTTP requests.
  • bs4: Easy parsing for XML and HTML.
  • scrapy: Framework for programming scrapers.

Homepages

  • flask: Web framework with many freedoms. Most of all recommended for small projects or out-there use cases.
  • django: Web framework with a harder framework than flask. Useful for bigger projects; especially when using a database.

Deep Learning

  • keras: Framework for easy creation of neural nets. Abstraction layer on top of Tensorflow or Theano.
  • tensorflow: Tensorflow is a platform independent open source library for artificial intelligence close to language processing and image recognition.
  • pytorch: Pytorch is similar to tensorflow with better integration with Python. For loops are really integrated in this one.

Text Extraction

  • textract: Gets texts out of PDFs.

Horny Shit

  • jupyter-notebook: Module to combine code and text and reliable for teaching. Jupyter-notebook is a web based interactive mode of Python, in which, for example, images can be embedded.

Visualisation

  • matplotlib: The classic visualisation library for Python.
  • seaborn: Addon for matplotlib with nice defaults automated integration of panda’s dataframes.