Training Research Seminar - Jason Anastasopoulos - Friday, February 12th, 2021

Training Seminar Light Blue - Databases and Advanced Webscraping: Extracting and Organizing Data from the Internet.

  • Date: 12 February 2021 from 14:00 to 17:00

  • Event location: On line with TEAMS

  • Access Details: Free admission

Introduction 

 

Python is rapidly becoming the preferred language of data scientists in both industry and academia. It’s used by Google, Facebook and other tech giants to perform data analysis and run machine learning algorithms that can handle hundreds of thousands of terabytes of data per day. 

 

Python can be used for:

  • Storing and analyzing large and small datasets. 
  • Web scraping and data collection using APIs.
  • Beautiful data visualization.
  • Natural language processing and text analysis. 
  • General machine learning.
  • Deep learning.
  • Image analysis and much, much more...

 

How you will benefit from this seminar

This seminar is an intermediate course on statistical computing with Python. The goal is to get participants to learn about advanced data analysis and visualization applications of the Python language. 

 

By the end of this seminar you will be able to do: 

  • Big data analysis and inference: Learn how to deal with massive data in Python.
  • Extracting and Organizing Data from the Internet: Scrape and parse data from the internet using APIs, including HTML, XML, and JSON.
  • Databases: Create and extract information from SQL and MongoDB databases with Python.

 

WHO SHOULD ATTEND

This seminar is designed for students who already have basic programming skills in Python and want to learn more advanced applications typically used by data scientists and academic researchers.

 

This course assumes that you have already completed Python for Data Analysis or a similar introduction to Python course. 

 

COMPUTING

This is a hands-on class that will involve at least two hours of structured and supervised assignments. To ensure that you are prepared, you must do the following BEFORE the first class:

 

You should also know how to access the command prompt (Windows users) or the terminal (Mac users). We will briefly review how to access these in class, but it will save you time and effort if you come already knowing these basics. You can get resources on the internet that will help you get started with the Windows Command Prompt or the Mac Terminal .

 

MATERIALS

Participants receive access to a private repository containing all of the lecture notes, code and data needed for the class.

 

Participants interested in getting a jump start on some of the material should consider reading the “Python Data Science Handbook” by Jake VanDerPlas. This book is not required but is recommended as optional reading and as a useful reference.


SEMINAR OUTLINE

     I. Web Scraping Data from the Internet with APIs:

         °  Application programming interfaces (API) and streaming data.

         °  Extracting and building datasets from the Internet with APIs with JSON.

         °  Building datasets from the Internet with HTML.

   II. Storing and Retrieving Data with Databases:

        °  Introduction to SQL.

        °  Introduction to MongoDB.

        °  Using MongoDB and SQL to store and retrieve data.