Web Scraping

How to extract table’s content with Octoparse and apply clustering analysis

Illustration by author

Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against such actions.

Climate Change and Global warming are common words that you can find every day. And these concepts are all linked to something. 51 billion tons of greenhouse gasses. All these gasses are produced by all the countries of the world. You can find these data available on a Wikipedia page. The table is called list of countries by greenhouse gas emissions.

You are probably asking yourself why scraping data about Greenhouse…


A first step to take before scraping a website using Python

Computer on desk
Computer on desk
Photo by Carl Heyerdahl on Unsplash.

I discovered web scraping while working towards my master’s degree in Data Science. It wasn’t one of my courses, but I helped a friend with a project about this topic in her study program. It was hard to understand what basics I needed to solve this enigma. At the same time, the more difficult I found the task, the more compelled I felt to solve the mystery.

What is web scraping? Look at the words. Web refers to a website, while scraping is about the extraction of data. By merging the two words, you can understand the real meaning: extracting…


Data Visualization, Opinion

Image by author

It’s the right moment to move forward to other tools to visualize your data. Do you know Matplotlib? Forget it. Maybe it can be easy to apply and doesn't occupy much memory, but it’s hard to observe the change of features over time from static graphs.

Casually, while I started to work for my internship in Data Science, I began to use a fabulous Python library. I never dreamt that something like this would be possible. It’s called Plotly Express. You can finally interact with the graphs. …


Business Intelligence, Data Science

Understanding the differences between Data Warehouse and Data Lake

Illustration by author

I am actually meeting Business Intelligence during my Data Science internship. But it’s not the first time. It happened again in a previous internship for another type of role in an ICT company. During the university, I never did a course about this topic and it’s not easy to understand when there are many new concepts and not many resources on the Internet. So, what is Business Intelligence? Why do we usually meet it when we go to work and not during the studies?

Business Intelligence is a discipline that analyzes a company’s data using technologies, such as statistics and…


The reasons behind this one-time bonus: the stories I wrote in April vs the read stories in April

Photo by Brooke Cagle on Unsplash

I was totally stupefied when I received the email with the $500 dollars from Medium. In the beginning, I saw only two different earnings in my Partner Program of April. One payment was right and the other seemed to come from another parallel universe. The first thought that came to my mind was that there was surely a mistake.

But after my doubts were cleared when Medium sent a letter that said:


Import data and do both simple and multiple aggregations

Flowers
Flowers
Photo by John-Mark Smith on Unsplash.

When you work with data in Python, there is surely a library that will never leave your side: pandas. It’s a pretty powerful and intuitive open source library that provides data structures that are useful for dealing with high-dimensional datasets.

There are two principal data structures:

  • Series for one-dimensional arrays.
  • DataFrame for two-dimensional tables that contain rows and columns.

In this article, I will focus on the most useful functions that split the dataset into groups. Then you can compute statistics, such as average, standard deviation, maximum, minimum, and much more.

You’ll learn to utilize the apply, cut, groupby, and…


Let’s apply Isolation Forest with scikit-learn using the Iris Dataset

Image of a red flower among yellow flowers
Image of a red flower among yellow flowers
Photo by Rupert Britton on Unsplash

Anomaly detection is the identification of rare observations with extreme values that differ drastically from the rest of the data points. These items are called outliers and need to be identified in order to be separated from the normal items. There can be many causes for these anomalous observations: variability of the data, errors obtained during the data collection, or something new and rare has happened. The last explanation is not an error as you usually expect.

Managing outliers is challenging because it’s usually not possible to understand if the issue is linked to the wrong gathering of data or…


Data Science, Machine Learning

An application on PHM08 Challenge Data Set provided by NASA

Illustration by author

The increasing amount of data together with technological improvements lead to significant changes in the strategies of machine maintenance. The possibility of monitoring machine’s conditions has arisen the Predictive Maintenance (PM). PM had evolved in the last decade and is characterized by the use of the machine’s historical time series data, collected through sensors. Using the available data, it’s possible to provide effective solutions with Machine Learning and Deep Learning approaches. Predictive Maintenance allows to minimize downtime and maximize equipment lifetime.

One critical part of PM is the prediction of the Remaining Useful Life (RUL). It helps to understand the…


Optimization

A guide to understand how to minimize a cost function in your Machine Learning algorithm

Photo by Tom Swinnen on Pexels

During my master's degree in Data Science, I have met optimizers in most of the courses. At first, I didn’t understand very well the concepts of the algorithms because they were treated with many mathematical formulas, which made me feel more confused. Then, I looked at some tutorials on the Internet, and finally I was able to understand the meaning behind these optimizers.

Optimization algorithm plays a key role in Machine Learning and Deep Learning. Without it, we can’t build any model to make predictions. Moreover, depending on the optimization algorithm chosen, the model will perform in a different way…


Data Science

Image by author

I have met the confusion matrix during my first year of a Data Science master's degree. The first time the professor had explained it, I couldn’t feel anything, but CONFUSION! For this reason, I want to explain in simple words the concepts beyond this matrix, that will be your partner every time you need to evaluate the performance of the model.

So, what is a Confusion Matrix? Why do we need it? Generally, It’s a tool that helps to understand if the model is working really well. Moreover, from it, you can derive many evaluation measures, such as accuracy, precision…

Eugenia Anello

I am a Data Science student and a Traveller enthusiast | I learn something new everyday | https://www.linkedin.com/in/eugenia-anello-545711146

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store