Toolkit#

What is Data Science?#

The term data scientist was coined as recently as 2008 when companies realized the need for data professionals who are skilled in organizing and analyzing massive amounts of data. In a 2009 McKinsey&Company article, Hal Varian, Google’s chief economist and UC Berkeley professor of information sciences, business, and economics, predicted the importance of adapting to technology’s influence and reconfiguration of different industries.

UC Berkeley School of Information Online

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.

Amazon Web Services

Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning.

IBM

Data Scientist: The Sexiest Job of the 21st Century

Harvard Business Review - 2012

Operating System#

An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs. - Wikipedia

Python#

Python is powerful… and fast;
plays well with others;
runs everywhere;
is friendly & easy to learn;
is Open.

https://www.python.org/about/

Python is a high-level, interpreted programming language that was first released in 1991. It was designed to be easy to read and write, with a simple and consistent syntax that emphasizes readability and reduces the cost of program maintenance. It is a general-purpose language that can be used for a wide variety of tasks, including web development, data analysis, scientific computing, artificial intelligence, and more.

Python’s popularity is due to several factors. Firstly, it has a large and active community of developers who contribute to its development, provide support, and create libraries and tools for various applications. This community has helped to make Python one of the most widely used programming languages in the world.

Secondly, Python’s simplicity and readability make it a great language for beginners and experienced programmers alike. It is easy to learn and use, and it allows developers to focus on solving problems rather than worrying about syntax and technical details.

Thirdly, Python has a vast ecosystem of libraries and frameworks that can be used for a wide range of applications. These libraries and frameworks provide developers with a wide range of tools and features that can help them develop complex applications quickly and efficiently.

Finally, Python’s versatility and flexibility make it an ideal language for a wide range of applications, from scientific computing to web development to data analysis and beyond. Its ease of use, strong community, and wide range of applications make Python one of the most popular programming languages in the world.

Sources:

Virtual Environments#

A virtual environment is a tool that allows users to create an isolated and self-contained environment on their computer, where they can install and run software without affecting their system’s global settings or other installed software.

In non-technical terms, a virtual environment is like a “sandbox” on your computer, where you can experiment with different software and tools without worrying about accidentally breaking anything on your computer.

Alternatives:

  • Conda / Anaconda / Mamba / Micromamba

  • pip + virtualenv

  • Docker

Don’t worry, we won’t use virtual environments in these lessons but it is good to know their existence.

Project Jupyter#

Free software, open standards, and web services for interactive computing across all programming languages.

Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing across all programming languages. Jupyter will always be 100% open-source software, free for all to use and released under the liberal terms of the modified BSD license.

https://jupyter.org/about

Jupyter Notebooks#

The notebook extends the console-based approach to interactive computing in a qualitatively new direction, providing a web-based application suitable for capturing the whole computation process: developing, documenting, and executing code, as well as communicating the results. The Jupyter notebook combines two components:

A web application: a browser-based tool for interactive authoring of documents which combine explanatory text, mathematics, computations and their rich media output.

Notebook documents: a representation of all content visible in the web application, including inputs and outputs of the computations, explanatory text, mathematics, images, and rich media representations of objects.

In non-technical terms, Jupyter Notebooks are like a digital notebooks that lets you write code, view results, and explain your thought process all in one place. You can use it to write programs in popular programming languages like Python, R, and Julia, and see the output of your code immediately in the notebook.

https://jupyter-notebook.readthedocs.io/en/latest/notebook.html#introduction

JupyterNotebook

Jupyter Lab#

JupyterLab is the latest web-based interactive development environment for notebooks, code, and data. Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning. A modular design invites extensions to expand and enrich functionality.

JupyterLab

Jupyter Book#

Build beautiful, publication-quality books and documents from computational content.

https://jupyterbook.org/

Jupyter Book is an open-source tool for building publication-quality books and documents from computational material.

Jupyter Book allows users to

  • write their content in markdown files or Jupyter notebooks,

  • include computational elements (e.g., code cells) in either type,

  • include rich syntax such as citations, cross-references, and numbered equations, and

  • using a simple command, run the embedded code cells, cache the outputs and convert this content into:

    • a web-based interactive book and

    • a publication-quality PDF.

executablebooks/jupyter-book

Fun fact

This website is made with Jupyter Book!

Google Colab#

Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs.

https://research.google.com/colaboratory/faq.html

What is the difference between Jupyter and Colab?

Jupyter is the open source project on which Colab is based. Colab allows you to use and share Jupyter notebooks with others without having to download, install, or run anything.

Git & GitHub#

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside 30 million developers.

Git allows you to keep track of every change made to your code and easily revert to a previous version if needed. Think of it like a “time machine” for your code.

GitHub, on the other hand, is a website and platform that hosts Git repositories in the cloud. It provides an online space for developers to store and collaborate on their code projects, as well as to share their code with others.

In summary, Git is a tool for version control, while GitHub is a platform that makes it easier to use Git for collaboration and sharing code.

There are other alternatives to GitHub such as GitLab or BitBucket.

Fun fact

Some of the paragraphs in this lesson were written by ChatGPT3!

However, we checked the answers and some of them were modified a little bit.

Can you guess which ones were written by this artificial intelligence?