Nidaba is a data analytics project devoted to analysing the Python questions and answers on Stack Overflow.
Using the Stack Exchange Data Dump and
Stack Exchange API, we have incredible real-time access to the Stack
Overflow Python data. The
Python tag currently has over 300,000 questions with several hundred
more asked every day. The aim of Nidaba is to take this information and make the most that we can from it.
Where does Nidaba come from?
Nidaba (or Nisaba) was the Sumerian goddess of writing, teaching, and the harvest. Her main role was as the scribe of the Gods and keeper of records.
The aim of Stack Overflow as a whole is to produce a library to every possible question about programming. When choosing a deity to name your project after, what better one than the scribe of the Gods and keeper of records?
What we want to do
A big part of Nidaba will be analysis of data. There will be broadly two different sets of analysis: analysis that is directly actionable (i.e. finding duplicates) and analysis that is instead informative (trends in the sopython data, etc).
Some ideas we have going forward are:
- Trends in the Python questions/answers with respect to time;
- Highlighting famous questions and answers;
- Finding interesting, hidden gems and shining a spotlight on them;
- Suggesting possible duplicate questions automatically based on similar content;
- Predicting the likelihood of closure of questions based on their quality;
- Identifying spam questions so they can be quickly closed and deleted;
- FGITW analysis by keeping track of edits inside grace timeframe;
- Looking at "relationships" of people that interact via questions/answers/comments.
As devoted Pythonistas we have chosen to do the majority of our work in Python. Some of the technologies we are using (or plan to use) include:
- Flask — Flask is a lightweight web framework written in Python. The sopython website is built using Flask.
- MongoDB — MongoDB is a NoSQL database used to host the Stack Overflow data for Project Nidaba.
- PostgreSQL — PostgreSQL is an open source object-relational database system that is used to hold data for the sopython website.
- NumPy — NumPy is the fundamental package for use in numerical Python work.
- SciPy — SciPy is a Python-based ecosystem of open source software for mathematics, science, and engineering.
- pandas — pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas will be used extensively in analysing the Stack Overflow data for Project Nidaba.
- scikit-learn — scikit-learn is an open source machine-learning library for the Python programming language. scikit-learn will be used extensively in trying to analyse Stack Overflow questions and answers and find common trends.
- matplotlib — matplotlib is an extensive 2D plotting library which can be used to help visualise the analysis.
How To help
You can find the source code for Nidaba at our Github page.
You can find out what we're working on at the moment and what we've got planned for the future at our Trello page.