Artificial Intelligence (AI)
Here’s some useful resources on Artificial Intelligence, categorized by topic.
Natural Language Processing Fundamentals
- Word embeddings
- fastText by Facebook Research (GitHub, paper).
- GloVe
- Word2Vec
- Polyglot by Rami Al-Rfou.
Machine Learning Fundamentals
- Visualization of optimizer algoritms & which optimizer to use by Sebastian Ruder. TL;DR: “adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are most suitable and provide the best convergence for these scenarios. (don’t use vanilla SGD)”
Machine Learning for Natural Language Processing
- Oxford Course on Deep Learning for Natural Language Processing. [GitHub] [YouTube playlist (unofficial)]
Very good free course with exercises (Python) covering from the basics of both machine learning and natural language processing, then how to apply it.
Natural Language Processing (NLP) Tools & Software (mostly in Python)
- NLTK. Natural Language Toolkit. tokenizer, POS tagging, identify named entities, … Book: Natural Language Processing with Python.
- AFNER (Named Entity Recognition).
- Pretrained word vectors for 294 languages (from Wikipedia) by Facebook Research.
- Snowball (Word Stemmer).
Text Corpus in Indonesian Language
- Wikipedia Dumps. For Indonesian, use idwiki. My preferred format is “Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream”, so you can process in smaller chunks.
- Kumpulan Korpus Bahasa Indonesia by Herry Sujaini (2013)
- Are there any Indonesian corpora available online? – Quora
- freq-dist-id by Jim Geovedi/Ardwort. Repositori data yang digunakan dalam makalah Perbandingan distribusi frekuensi kata bahasa Indonesia di Kompas, Wikipedia, Twitter, dan Kaskus.
- Lanin, I., Geovedi, J., & Soegijoko. W. (2013). Perbandingan distribusi frekuensi kata bahasa Indonesia di Kompas, Wikipedia, Twitter, dan Kaskus. In Proceedings of Konferensi Linguistik Tahunan Atma Jaya Kesebelas (KOLITA11) (pp. 249-252).
- Indonesian Wordlist by Jim Geovedi. Only useful for password cracking, not linguistic research.
- Tagged text corpus with Named Entity Recognition by Yohanes Gultom. (named entity recognition)
- LAPOR.go.id (Bachelor thesis paper by Chyntia Megawati)
- Korpus Plagiarisma Indonesia (2016) by Felik Junvianto.
- Indonesian Manually POS Tagged Corpus (2016) by Fam Rashel.
- SEAlang Library Indonesian (not downloadable).
- Artikel: Korpus Daring Bahasa Indonesia by Wahyu Adi Putra Ginting (2010).
Natural Language Processing for Indonesian Language
- Pelabelan Corpus Bahasa Indonesia / Indonesian POS Tagging – INACL
- Kumpulan thesis, paper, dan artikel tentang NLP (Natural Language Processing) Bahasa Indonesia (2014) by Andy Librian (maker of Sastrawi)
- Word stemmer
- Sastrawi (2017). Stemmer PHP for Indonesian language. Python version by Hanif Amal Robbani (pip, slower?).
- Porter Stemmer for Indonesian language (2015) by Adinda Praditya. In Ruby. (paper)
- Pretrained word vectors in Indonesian (from Wikipedia) by Facebook Research.