Getting Started 2: How to Load and Visualize Data for Cyber Threat Intelligence Analysis

Status
Open notebook on: View filled on Github Open filled In Collab
Author: Christoph R. Landolt
In this tutorial, we will teach you how to conduct an Exploratory Data Analysis (EDA) in Python to gain a detailed understanding of a dataset.
EDA is a fundamental step in any data science workflow. It involves systematically examining the structure, quality, and relationships within the data through data loading, cleaning, statistical analysis, and visualization.

To make this step more engaging, we’ll use a dataset on State-Sponsored Cyber Operations. This dataset provides an opportunity to apply EDA techniques to real-world, security-related data, helping to uncover patterns and relationships that may inform further modeling or research.

Tutorial Objectives

By the end of this tutorial, you will be able to:

  • Load, clean, and manipulate data using Pandas DataFrames

  • Perform numerical operations and transformations using NumPy

  • Create informative visualizations using Matplotlib and Seaborn

Notebook Setup and Visualization Configuration

This section contains standard setup code that will likely be common across all notebooks in this course.
It imports the required Python packages for data analysis and visualization, and configures the notebook for high-quality plotting.

Key Configuration Commands

  • %matplotlib inline
    Ensures that all matplotlib plots are displayed directly within the notebook cells.
  • %config InlineBackend.figure_format
    Controls the output format of plots. Different formats are suited for different use cases:

Format

Description

When to Use

'retina'

High-resolution (2× DPI) PNG output

Ideal for high-quality visuals.

'png'

Standard-resolution PNG output

Suitable for general use; faster rendering and smaller output size.

'svg'

Scalable Vector Graphics (vector-based)

Best for publication-quality plots that scale without losing quality; ideal for LaTeX or web embedding.

'pdf'

Portable Document Format (vector-based)

Useful for exporting plots to reports; less common in notebooks.

'jpg' / 'jpeg'

Compressed raster format

Rarely used for plots due to lossy compression; generally not recommended for analysis.

[356]:
# Render our plots inline
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
[357]:
# Import requiered packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Modifying the Style of Graphs

Matplotlib provides several built-in styles that can be applied to improve the appearance of plots.
In this tutorial, we will use a combination of Seaborn-inspired styles to produce clear, professional, and visually appealing figures.
[358]:
# Modifying the style of the graphs
plt.style.use(['seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-whitegrid'])

Load the Data

We can load a CSV file directly from a URL using pandas, or upload it from a local directory.
The pd.read_csv function has many options; you can explore them in the official documentation.
In this tutorial, we will use the State-Sponsored Cyber Operations (2005-Present) dataset.
This dataset tracks publicly known state-sponsored cyber operations since 2005 and has been compiled from official sources such as the Council on Foreign Relations.

Features in the Dataset

  • Title: Title/Summary of the cyber operation.

  • Date: Date when the cyber operation was conducted.

  • Affiliations: Affiliations behind the cyber operation, if known.

  • Descriptions: Description of the cyber operation conducted.

  • Response: Response of the suspects behind the cyber operation, if any.

  • Victims: Victims that were targeted by the cyber operation, if known.

  • Sponsor: The suspected state sponsor of the cyber operation.

  • Type: Type of cyber operation conducted.

  • Category: Demographic category of the cyber operation’s victims.

  • Sources_1: Source that reported the cyber operation.

  • Sources_2: Source #2 that reported the cyber operation, if any.

  • Sources_3: Source #3 that reported the cyber operation, if any.

[359]:
df = pd.read_csv(
    'cyber-operations-incidents.csv',
    delimiter=',',      # Define a custom separator (default is ',')
    header=0,           # Row to use as column names
    index_col=None,     # Column to use as row labels
    na_values=['NA'],   # Values to treat as missing
    encoding='utf-8'    # File encoding
)

Notes:

  • delimiter lets you specify the character separating fields (default is comma ,).

  • header indicates which row to treat as column names.

  • index_col allows using a column as the DataFrame index.

  • na_values defines strings to interpret as missing data (NaN).

  • Always ensure the encoding matches your file to avoid read errors.

Data Inspection and Basic Statistics

Before performing detailed analysis, it is important to inspect the dataset and understand its structure.

Preview the first 5 rows of the dataset:

[360]:
df.head()

[360]:
Title Date Affiliations Description Response Victims Sponsor Type Category Sources_1 Sources_2 Sources_3
0 Targeting of dissidents and opposition politic... 1/20/2022 Believed to be the work of Ghana's government The Israeli spyware firm NSO Group sold Pegasu... Denial https://www.haaretz.com/israel-news/t... Political dissidents and opposition leaders in... Ghana Espionage Civil society https://www.haaretz.com/israel-news/tech-news/... NaN NaN
1 Targeting of Ukrainian state entities 3/7/2022 UNC1151 The Belarusian threat actor UNC1151 targeted m... NaN Ukrainian state entities Belarus Espionage Government https://cert.gov.ua/article/37626 NaN NaN
2 Targeting of Russian and Belarusian websites 5/4/2022 Ukrainian IT Army Ukraine?? IT Army compromised Docker Engine ho... NaN Russian and Belarusian websites were targeted,... Ukraine Denial of service Government, Private sector, Civil society https://www.crowdstrike.com/blog/compromised-d... NaN NaN
3 Targeting of the Belgian Federal Public Servic... 7/18/2022 Zirconium Chinese threat actors targeted the Belgian Fed... Denouncement https://diplomatie.belgium.be/e... Belgian Federal Public Service Interior China Espionage Government https://diplomatie.belgium.be/en/news/declarat... https://www.bleepingcomputer.com/news/security... NaN
4 Targeting of Ukrainian government agencies 8/11/2022 Gamaredon The Computer Emergency Response Team of Ukrain... Confirmation Ukrainian government agencies Russian Federation Espionage Government https://cert.gov.ua/article/2681855 NaN NaN

Preview the last 2 rows of the dataset:

[361]:
df.tail(2)
[361]:
Title Date Affiliations Description Response Victims Sponsor Type Category Sources_1 Sources_2 Sources_3
766 Trisis NaN Also known as Triton and Xenotime This threat actor targets the Triconex safety ... Confirmation https://www.justice.gov/opa/pr/... Saudi Arabia Russian Federation Sabotage Private sector https://www.fireeye.com/blog/threat-research/2... https://www.fireeye.com/blog/threat-research/2... https://dragos.com/blog/trisis/TRISIS-01.pdf
767 Targeting of the Belgian Federal Public Servic... NaN APT 30 Chinese threat actors targeted the Belgian Fed... Denouncement https://diplomatie.belgium.be/e... Belgian Federal Public Service Interior China Espionage Government https://diplomatie.belgium.be/en/news/declarat... https://www.bleepingcomputer.com/news/security... NaN
Selecting specific rows: You can extract a subset of rows using slicing.
For example, to select rows 3 to 6 (inclusive of 3, exclusive of 7):
[362]:
df[3:7]
[362]:
Title Date Affiliations Description Response Victims Sponsor Type Category Sources_1 Sources_2 Sources_3
3 Targeting of the Belgian Federal Public Servic... 7/18/2022 Zirconium Chinese threat actors targeted the Belgian Fed... Denouncement https://diplomatie.belgium.be/e... Belgian Federal Public Service Interior China Espionage Government https://diplomatie.belgium.be/en/news/declarat... https://www.bleepingcomputer.com/news/security... NaN
4 Targeting of Ukrainian government agencies 8/11/2022 Gamaredon The Computer Emergency Response Team of Ukrain... Confirmation Ukrainian government agencies Russian Federation Espionage Government https://cert.gov.ua/article/2681855 NaN NaN
5 Targeting of Palestinian entities and activists 2/2/2022 APT-C-23 The Hamas-linked threat actor APT-C-23 used ma... NaN NaN NaN Espionage Civil society https://www.cyberscoop.com/palestinian-targete... https://blog.talosintelligence.com/2022/02/ari... NaN
6 Targeting of government agencies in the Europe... 4/5/2022 Gamaredon Russian threat actor Gamaredon targeted Ukrain... Confirmation https://cert.gov.ua/article/39086 Government agencies of EU member states Russian Federation Espionage Government https://www.bleepingcomputer.com/news/security... NaN NaN
By label using ``df.loc``: Selects rows and columns by their names or index labels.
Example: Select rows 3 to 7 and the columns "Title", "Date", and "Sponsor":
[363]:
df.loc[3:7, ["Title", "Date", "Sponsor"]]
[363]:
Title Date Sponsor
3 Targeting of the Belgian Federal Public Servic... 7/18/2022 China
4 Targeting of Ukrainian government agencies 8/11/2022 Russian Federation
5 Targeting of Palestinian entities and activists 2/2/2022 NaN
6 Targeting of government agencies in the Europe... 4/5/2022 Russian Federation
7 Targeting of Rostec 5/19/2022 China
By position using ``df.iloc``: Selects rows and columns by integer index positions.
Example: Select rows 0, 1, 3, 4, 6 and columns 0 to 2:
[364]:
df.iloc[[0,1,3,4,6], 0:3]
[364]:
Title Date Affiliations
0 Targeting of dissidents and opposition politic... 1/20/2022 Believed to be the work of Ghana's government
1 Targeting of Ukrainian state entities 3/7/2022 UNC1151
3 Targeting of the Belgian Federal Public Servic... 7/18/2022 Zirconium
4 Targeting of Ukrainian government agencies 8/11/2022 Gamaredon
6 Targeting of government agencies in the Europe... 4/5/2022 Gamaredon

Obtain a concise summary of the dataset, including column names, non-null counts, and data types:

[365]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Title         768 non-null    object
 1   Date          753 non-null    object
 2   Affiliations  621 non-null    object
 3   Description   768 non-null    object
 4   Response      186 non-null    object
 5   Victims       741 non-null    object
 6   Sponsor       728 non-null    object
 7   Type          731 non-null    object
 8   Category      743 non-null    object
 9   Sources_1     763 non-null    object
 10  Sources_2     488 non-null    object
 11  Sources_3     190 non-null    object
dtypes: object(12)
memory usage: 72.1+ KB

Generate descriptive statistics for all numeric columns, such as mean, standard deviation, minimum, and maximum values:

[366]:
df.describe()
[366]:
Title Date Affiliations Description Response Victims Sponsor Type Category Sources_1 Sources_2 Sources_3
count 768 753 621 768 186 741 728 731 743 763 488 190
unique 764 603 377 767 128 657 61 12 35 715 470 186
top Targeting of Southeast Asian telecommunication... 10/4/2018 Lazarus Group Chinese threat actors targeted the Belgian Fed... Unknown United States China Espionage Private sector https://blogs.microsoft.com/on-the-issues/2020... https://www.bleepingcomputer.com/news/security... https://www.fireeye.com/content/dam/fireeye-ww...
freq 2 6 26 2 47 21 263 608 199 5 3 2

The dataset contains 768 rows and 12 columns:

[367]:
df.shape
[367]:
(768, 12)
[368]:
# Number of rows/observations - the data numerosity
df.shape[0]
[368]:
768
[369]:
# Number of features/column/attributes - the data dimensionality
df.shape[1]
[369]:
12

These commands provide a first overview of the data, helping identify missing values, data types, and key summary statistics.

Data Cleaning and Preparation

We can list all columns and check their data types using:

[370]:
df.columns
[370]:
Index(['Title', 'Date', 'Affiliations', 'Description', 'Response', 'Victims',
       'Sponsor', 'Type', 'Category', 'Sources_1', 'Sources_2', 'Sources_3'],
      dtype='object')
[371]:
df.dtypes
[371]:
Title           object
Date            object
Affiliations    object
Description     object
Response        object
Victims         object
Sponsor         object
Type            object
Category        object
Sources_1       object
Sources_2       object
Sources_3       object
dtype: object
  • Columns with the data type object are strings.
  • Some columns, such as dates, are better represented in a proper datetime format.
  • To convert strings to datetime, we can use pd.to_datetime. Before doing so, it is important to understand Python date parsing conventions. See the official documentation for reference.

  • Relevant formatting codes include:

    • %m – Month as a zero-padded decimal number

    • %d – Day of the month as a zero-padded decimal number

    • %Y – Year with century as a decimal number

[372]:
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df.dtypes

[372]:
Title                   object
Date            datetime64[ns]
Affiliations            object
Description             object
Response                object
Victims                 object
Sponsor                 object
Type                    object
Category                object
Sources_1               object
Sources_2               object
Sources_3               object
dtype: object
[373]:
df.head(2)
[373]:
Title Date Affiliations Description Response Victims Sponsor Type Category Sources_1 Sources_2 Sources_3
0 Targeting of dissidents and opposition politic... 2022-01-20 Believed to be the work of Ghana's government The Israeli spyware firm NSO Group sold Pegasu... Denial https://www.haaretz.com/israel-news/t... Political dissidents and opposition leaders in... Ghana Espionage Civil society https://www.haaretz.com/israel-news/tech-news/... NaN NaN
1 Targeting of Ukrainian state entities 2022-03-07 UNC1151 The Belarusian threat actor UNC1151 targeted m... NaN Ukrainian state entities Belarus Espionage Government https://cert.gov.ua/article/37626 NaN NaN

Display all rows in the DataFrame that are identified as duplicates based on all columns:

[374]:
df[df.duplicated()==True]

[374]:
Title Date Affiliations Description Response Victims Sponsor Type Category Sources_1 Sources_2 Sources_3

Display all rows where duplicate values occur in the “Description” column:

[375]:
df.loc[df.duplicated(subset=['Description'], keep=False)]

[375]:
Title Date Affiliations Description Response Victims Sponsor Type Category Sources_1 Sources_2 Sources_3
3 Targeting of the Belgian Federal Public Servic... 2022-07-18 Zirconium Chinese threat actors targeted the Belgian Fed... Denouncement https://diplomatie.belgium.be/e... Belgian Federal Public Service Interior China Espionage Government https://diplomatie.belgium.be/en/news/declarat... https://www.bleepingcomputer.com/news/security... NaN
767 Targeting of the Belgian Federal Public Servic... NaT APT 30 Chinese threat actors targeted the Belgian Fed... Denouncement https://diplomatie.belgium.be/e... Belgian Federal Public Service Interior China Espionage Government https://diplomatie.belgium.be/en/news/declarat... https://www.bleepingcomputer.com/news/security... NaN

Boolean Indexing

Boolean indexing allows us to filter rows based on one or more conditions.
For example, we can select operations conducted by a specific sponsor within a given date range.
[376]:
# Define date range for 2022-2023
start_date = pd.Timestamp('2022-01-01')
end_date = pd.Timestamp('2023-12-31')

# Boolean condition
boolean_condition = (df['Sponsor'] == 'Belarus') & (df['Date'] >= start_date) & (df['Date'] <= end_date)

# Filtered DataFrame
df_filtered = df[boolean_condition]

df_filtered.shape[0] # Number of rows in the filtered DataFrame
[376]:
2

Aggregation

Aggregation allows us to summarize data by groups, which is useful to identify patterns and trends in cyber operations.

  • For example, we can count the number of cyber operations conducted by each sponsor

[377]:
# Count the number of cyber operations per sponsor
df.groupby("Sponsor", as_index=False).size().sort_values(by='size', ascending=False)
[377]:
Sponsor size
6 China 263
37 Russian Federation 155
20 Iran (Islamic Republic of) 98
26 Korea (Democratic People's Republic of) 79
53 United States 12
... ... ...
25 Kazakhstan 1
28 Lebanon 1
31 Myanmar 1
32 Netherlands 1
30 Morocco 1

61 rows × 2 columns

Explanation:

  • groupby("Sponsor") groups the data by the Sponsor column.

  • .size() counts the number of rows in each group, representing the number of operations.

  • .sort_values(by='size', ascending=False) sorts the sponsors from most to least active.

Sum the number of victims per sponsor to identify which sponsors have caused the greatest impact:

[378]:
# Sum the number of victims per sponsor
df.groupby("Sponsor", as_index=False)["Victims"].sum().sort_values(by="Victims", ascending=False)

[378]:
Sponsor Victims
55 United States, Australia, Canada, New Zealand,... Yandex
27 Korea (Republic of) World Health OrganizationNorth Koreans and Nor...
59 Uzbekistan Uzbek activists and dissidents
28 Lebanon United States, Canada, Germany, Lebanon, France
3 Belarus Ukrainian state entitiesOver seventy Ukrainian...
... ... ...
21 Israel An activist in IsraelMayors and other local of...
52 United Kingdom, United States Algeria, Afghanistan, Belgium, Brazil, Fiji, G...
20 Iran (Islamic Republic of) Albanian government networksVMware Horizon ser...
39 Saudi Arabia, United Arab Emirates Al Jazeera employees
0 United Kingdom" Israel

61 rows × 2 columns

Individual Variables Exploration

Display the number of records per year by extracting the year component from the Date column and counting its occurrences, sorted in chronological order:

[379]:
df['Date'].dt.year.value_counts().sort_index()
[379]:
Date
2005.0      1
2006.0      3
2007.0     10
2008.0      7
2009.0      4
2010.0      9
2011.0     15
2012.0     12
2013.0     18
2014.0     32
2015.0     34
2016.0     33
2017.0     46
2018.0     80
2019.0     74
2020.0    120
2021.0    108
2022.0    147
Name: count, dtype: int64

Visualize the trend of cyber operations over time by plotting the number of recorded cyber events per year:

[380]:

# Extract number of cyber operations per year cyber_per_year = df['Date'].dt.year.value_counts().sort_index() # Set plot style sns.set(style="whitegrid") # Create the plot plt.figure(figsize=(10,6)) sns.lineplot(x=cyber_per_year.index.astype(int), y=cyber_per_year.values, marker="o") plt.title("Number of Cyber Events per Year", fontsize=16) plt.xlabel("Year", fontsize=12) plt.ylabel("Cyber Operations", fontsize=12) plt.xticks(cyber_per_year.index.astype(int)) # Ensure x-axis shows integer ticks plt.grid(True) plt.show()
../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_49_0.svg

Display the 15 most frequent affiliations involved in cyber operations, excluding entries labeled as Unknown, to identify the primary actors appearing in the dataset:

[381]:
aff_counts = df[df.Affiliations!='Unknown'].Affiliations.value_counts()
aff_counts.nlargest(15)
[381]:
Affiliations
Lazarus Group                                     26
APT 28                                            15
Believed to be the work of APT 28                 15
Sandworm                                          15
Winnti Umbrella                                   10
Mustang Panda                                     10
Gamaredon                                         10
Ocean Lotus                                        9
MuddyWater                                         9
Kimsuky                                            9
Charming Kitten                                    8
Believed to be the work of?APT 28                  7
Turla                                              7
Believed to be the work of Russia's government     6
APT 10                                             6
Name: count, dtype: int64

Visualize the top 15 affiliations involved in cyber operations using a bar chart:

[382]:
top_10_aff = aff_counts.nlargest(15)
plt.figure()
ax = sns.barplot(x=top_10_aff.values, y=top_10_aff.index)
ax.bar_label(ax.containers[0])
ax.set_title('Top 10 Affiliations')
plt.ylabel('Affiliation')
plt.show()
../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_53_0.svg

Display the top 15 state sponsors associated with cyber operations in the dataset:

[383]:
top_15_sponsors = df[df.Sponsor!='Unknown'].Sponsor.value_counts().nlargest(15)
plt.figure()
ax = sns.barplot(x=top_15_sponsors.values, y=top_15_sponsors.index)
ax.bar_label(ax.containers[0])
ax.set_title('Top 15 Sponsors')
plt.ylabel('Sponsor')
plt.show()
../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_55_0.svg

Exercises

Answer the following Questions about the Dataset

Exercise 1: Distribution of Operation Types

Investigate the distribution of Type of state-sponsored cyber operations in the dataset.

[384]:
# TODO: Step 1 - Remove invalid or unknown Type entries
# Hint: Filter out invalid rows


[385]:
# TODO: Step 2 - Create the countplot
# Hint: Use sns.countplot() with the `order` parameter set to value_counts().index
[386]:
# TODO: Step 3 - Annotate counts above each bar
# Hint: Use ax.bar_label(ax.containers[0])


[387]:
# TODO: Step 4 - Set title and rotate x-axis labels
# show the plot

Exercise 2: Top 10 Affiliations

Some affiliations refer to the same group in different ways (e.g., “APT 28”). Normalize all variations containing “APT 28” into a single value, then display the top 10 affiliations using a barplot.

[388]:
# TODO: Step 1 - Normalize "APT 28" variations
# Hint: Some entries refer to the same group differently, e.g.,
# 'APT 28', 'Believed to be the work of APT 28', 'Believed to be the work of?APT 28'
[389]:
# TODO: Step 2 - Filter out 'Unknown' affiliations
[390]:
# TODO: Step 3 - Count affiliations and get top 10

[391]:
# TODO: Step 4 - Create a barplot of the top 10 affiliations

Exercise 3: APT 28 Cyber Attacks Over Time

Analyze the activity of APT 28 by plotting the number of cyber operations attributed to this group over the years 2010–2020.

[392]:
# TODO: Step 1 - Filter the dataset for all rows where 'Affiliations' contains "APT 28"
# Hint: Use str.contains() with case=False and na=False
[393]:
# TODO: Step 2 - Extract the year from the 'Date' column
# Hint: Use .dt.year
[394]:
# TODO: Step 3 - Sort the counts by year
[395]:
# TODO: Step 4 - Create a line plot showing the number of attacks per year

Solution - Exercise 1: Distribution of Operation Types

First, we analyze the current missing values. There are 37 NaN entries in the Type column:

[396]:
# Step 1 - Remove invalid or unknown Type entries

# count NaN values in each column
df.isna().sum()
[396]:
Title             0
Date             15
Affiliations    147
Description       0
Response        582
Victims          27
Sponsor          40
Type             37
Category         25
Sources_1         5
Sources_2       280
Sources_3       578
dtype: int64
[397]:
# Delete rows with NaN in the 'Type' column
df.dropna(subset=['Type'], inplace=True)

# Verify that all NaNs in 'Type' are removed
df.isna().sum()
[397]:
Title             0
Date             11
Affiliations    135
Description       0
Response        549
Victims          18
Sponsor          35
Type              0
Category         12
Sources_1         4
Sources_2       270
Sources_3       552
dtype: int64

We analyze all current values in the Type column and notice that some entries contain invalid data, such as URLs.

[398]:
# show unique values in 'Type' column
df.Type.unique()

[398]:
array(['Espionage', 'Denial of service', 'Data destruction',
       'Financial Theft', 'Sabotage',
       'https://blog.talosintelligence.com/2022/03/transparent-tribe-new-campaign.html',
       'Defacement',
       'https://www.direkt36.hu/en/putyin-hekkerei-is-latjak-a-magyar-kulugy-titkait-az-orban-kormany-evek-ota-nem-birja-elharitani-oket/',
       'Private sector', 'Government', 'Government, Military', 'Doxing'],
      dtype=object)
[399]:
# Filter out rows where 'Type' is a URL using regular expression
df = df[~df['Type'].str.contains(r'^https?://', na=False)].reset_index(drop=True)

# show unique values in 'Type' column
df.Type.unique()

[399]:
array(['Espionage', 'Denial of service', 'Data destruction',
       'Financial Theft', 'Sabotage', 'Defacement', 'Private sector',
       'Government', 'Government, Military', 'Doxing'], dtype=object)

Now we observe that there are two invalid categories in the Type column. Therefore, we need to delete the rows containing 'Government' and 'Government, Military' types.

[400]:
 # List of types to remove
types_to_remove = [
    'Government',
    'Government, Military'
]

# Filter the DataFrame
df = df[~df['Type'].isin(types_to_remove)].reset_index(drop=True)
df.Type.unique()

[400]:
array(['Espionage', 'Denial of service', 'Data destruction',
       'Financial Theft', 'Sabotage', 'Defacement', 'Private sector',
       'Doxing'], dtype=object)

Create the plot.

[401]:
# Step 2 - Create the countplot
# Hint: Use sns.countplot() with the `order` parameter set to value_counts().index
plt.figure(figsize=(12,6))
ax = sns.countplot(data=df, x='Type', order=df['Type'].value_counts().index)
ax.set_title('Distribution of Operation Types', fontsize=16)

# Step 3 - Annotate counts above each bar
# Hint: Use ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[0])


# Step 4 - Set title and rotate x-axis labels
# show the plot
plt.xticks(rotation=45)
plt.show()
../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_80_0.svg

Solution - Exercise 2: Top 10 Affiliations

We start by printing all unique values in the Affiliations column.

[402]:
# Step 1 - Normalize "APT 28" variations
# Hint: Some entries refer to the same group differently, e.g.,
# 'APT 28', 'Believed to be the work of APT 28', 'Believed to be the work of?APT 28'

df.Affiliations.unique()


[402]:
array(["Believed to be the work of Ghana's government", 'UNC1151',
       'Ukrainian IT Army', 'Zirconium', 'Gamaredon', 'APT-C-23',
       'Believed to be the work of the Chinese government',
       'Believed to be the work of the Iranian government',
       'Lazarus Group', 'APT 35',
       'Believed to be the work of the Canadian government',
       "Believed to be the work of Israel's government.",
       'Believed to be the work of the Russian government', 'Gallium',
       'Mirage', 'Believed to be the work of the Jordanian government',
       'Lucky Cat', 'APT 40', 'Sandworm', 'APT 37',
       "Believed to be the work of Israel's government", 'APT 28', nan,
       'Believed to be the work of the Iranian government.', 'APT 18',
       'Ocean Lotus', 'Mustang Panda',
       'Believed to be the work of the government of the Russian Federation.',
       'APT 10', 'Kimsuky', 'Charming Kitten',
       "Believed to be the work of China's government.",
       'National Security Agency, Tailored Access Operations', 'APT 5',
       'Believed to be the work of the Spanish government',
       'Believed to be the work of the government of Thailand.',
       'Believed to be the work of Hamas.',
       "Believed to be the work of China's government", 'APT 36',
       "Believed to be the work of Palestine's government",
       'Iran, APT 35',
       'Believed to be the work of the Chinese government\r\n?',
       'Confucius', 'Winnti Umbrella',
       "Believed to be the work of Russia's government", 'Konni Group',
       'Emissary Panda', 'Patchwork', 'APT 41', 'Turla', 'Poison Carp',
       'Hafnium', 'Siamesekitten, OilRig',
       "Believed to be the work of El Salvador's government.",
       'The Dukes', 'Prince of Persia', 'Red Menshen', 'MuddyWater',
       'Space Pirates', 'Thrip', 'UNC1151?',
       "Believed to be the work of Bahrain's government",
       'ModifiedElephant, Pune Police', 'Also known as Space Pirates',
       'Believed to be the work of the Mexican government', 'APT 34',
       'Believed to be the work of the United Arab Emirates government',
       'Also known as PLA Unit 69010.', 'TAG-28', 'Russian Federation',
       'APT37', 'China', 'UNC215', 'BlackTech', 'RedFoxTrot', 'Cozy Bear',
       'TAG-22, Winnti Umbrella', 'Nobelium', 'Aggah, Gorgon Group',
       'TA453, Charming Kitten', 'Domestic Kitten',
       'APT 41, Barium, TA415', 'Lazarus Group?',
       'APT 31, Hurricane Panda', 'Calypso',
       'Law enforcement agencies in the United States, Canada, France, Germany, Lithuania, the Netherlands, Ukraine, and United Kingdom',
       'HoneyMyte, Mustang Panda, Bronze President',
       'North Korean government', 'Nobelium, APT 29', 'Pakistan',
       'Federal Bureau of Investigation, Bulgarian National Investigative Service',
       'APT 34, OilRig', 'Tortoiseshell, ?Charming Kitten', 'Sharp Panda',
       'APT 37, ScarCruft, Stardust Chollima', 'Emennet Pasargad?',
       'Belarus, Ghostwriter.', 'Comment Crew', 'Fancy Bear, APT 28',
       'Siamesekitten, Lyceum', 'Chinese government', 'Philippine Army',
       'MysterySnail', 'Primitive Bear, Gamaredon',
       'Judgement Panda, APT 31',
       'Also known as?Lyceum. The group has also been connected to OilRig.',
       'IndigoZebra', 'South Sudanese government', 'Tick',
       'Naikon, APT 30, Deadringer', 'Charming Kitten?',
       'Russian government', 'TA428, Wekby, Dynamite Panda',
       'Winnti Umbrella, SparklingGoblin',
       'NSA, U.S. Cyber Command, FBI, and partner agencies in several unknown countries.?',
       'Emirati government', 'Soft Cell, GALLIUM, Deadringer',
       'Believed to be the work of the Saudi Arabian government.',
       'Cozy Bear, APT29', 'Winnti Umbrella, Grayfly',
       "Believed to be the work of the Islamic Republic of Iran's government",
       'Static Kitten', 'Tonto Team or TA428', 'APT 31, Judgement Panda',
       'PLA Unit 69010, RedFoxTrot', 'Mustang Panda, Bronze President',
       'Cozy Bear, APT 29', 'APT 36, Mythic Leopard',
       'Charming Kitten, Phosphorous',
       'Believed to be the work of the Polish government.', 'Israel',
       'Saudi and Emirati governments', 'APT 30',
       'Believed to be the work of the Chinese government.?',
       'Government of China', 'Unknown, suspected Bronze Butler.',
       'Iranian government',
       'Believed to be the work of the Russian government.?',
       'Energetic Bear', 'Also known as APT 31 and Judgement Panda.',
       'Zinc', 'Believed to be the work of Togolese security forces.?',
       'DarkHotel', 'Greenbug', 'Cerium', 'Lotus Blossom',
       'Cobalt Dickens', 'Pirate Panda', 'Also known as Parasite',
       'Fox Kitten', 'Also known as Parasite, UNC757, and Fox Kitten.',
       'Government of Saudi Arabia', 'Hidden Cobra', 'Pioneer Kitten',
       'KINGDOM', 'Syrian Electronic Army', 'KilllSomeOne',
       'APT 34, possibly APT 33/APT 39.', 'Cozy Bear,?APT 29',
       'U.S. Cyber Command', 'Tonto Team',
       'Australian Signals Directorate', 'APT 33',
       'Believed to be Lazarus Group', 'Bronze Butler',
       'Guangdong State Security Department (GSSD) of China?? Ministry of State Security (MSS)',
       'Vicious Panda', 'Crouching Yeti', 'Government of Turkey',
       'Ghostwriter', 'Believed to be the work of Fancy Bear.',
       'Russian military intelligence agency, known as GRU', 'PKplug',
       'Unnamed state-sponsored APT',
       'Believed to be the work of APT 10.', 'The Lazarus Group',
       'Five Eyes',
       'Believed to be the work of Chinese state-sponsored hackers.',
       'Phosphorus', 'Government of Egypt', 'Iran', 'Rancor', 'Unknown',
       'Also known as Earth Longzhi.',
       'Believed to be the work of Leviathan.',
       'Believed to be the work of Phosphoros, also known as APT 35',
       'Government of Uzbkeistan', 'Government of the United States',
       'Government of Morocco', 'Hong Kong police',
       'Believed to be the work of Russian and Chinese hackers.',
       'Believed to be the work of APT 10 or the Jiangsu Province Ministry of State Security.',
       'the Dukes', 'China?? Ministry of State Security',
       'Believed to be the work of at least two unnamed APT groups, including Evil Eye',
       'APT 3', 'Believed to be the work of Turbine Panda', 'APT 20',
       'APTs associated with Beijing',
       'Suspected to be the work of Iranian actors.',
       'Government of the UAE', 'Iranian Revolutionary Guard',
       'Suspected to be the work of the Dukes.?',
       'Believed to be the work of Cobalt Dickens.', 'Wicked Panda',
       'Believed to be the work of Poison Carp.',
       'APT 17 and Bronze Butler', 'Believed to be the work of Winnti.',
       'Bronze President',
       'Believed to be the work of the UAE government.',
       'Also known as Evil Eye', 'Golden Falcon',
       'Also known as Billbug and Lotus Blossom.',
       'Believed to be associated with the Lazarus Group',
       'Believed to be the work of Crouching Yeti',
       'Believed to be the work of APT 28',
       'Believed to be associated with Sandworm. Also known as VPNFilter.',
       'Believed to use similar infrastructure as the NetTraveler, Icefog, and APT 17 threat actors',
       'Believed to be the work of Whitefly.?',
       'Believed to be the work of TempTick?and Turla',
       'Believed to use similar tools as DragonOK',
       "China's Jiangsu Ministry of State Security",
       'Believed to be the work of Turla',
       'Believed to be the work of APT 10',
       'Also known as Phosphoros and Newscaster Team',
       'Believed to be the work of the Lazarus Group',
       'Suspected to be the work of the Lazarus Group.',
       "China's Tianjin State Security Bureau",
       'Believed to be the work of Lazarus Group',
       'Suspected to be the work of Chinese military hackers.',
       'Believed to be the work of Leviathan',
       'Believed to be associated with the Axiom, APT 17, and Mirage threat actors. Believed to share the same tools and infrastructure as the threat actors that carried out Operation Aurora, the 2015 targeting of video game companies, the 2015 targeting of the Thai government, and the 2017 targeting of Chinese-language news websites',
       'Believed to use similar tools as the Lazarus Group',
       'Possibly linked to an Iran-based group that uses the Shamoon virus to target oil and gas entities with a connection to Saudi Arabia.',
       'Also known as Transparent Tribe and C-Major',
       'Believed to be the work of?APT 28',
       'Believed to be linked to Crouching Yeti',
       'Believed to be the work of Gamaredon.',
       'Believed to be linked to the Crouching Yeti threat actor',
       'Suspected to be the work of North Korean authorities.',
       'Also known as Bronze President and?HoneyMyte.?Also known as RedDelta.',
       "Linked to Turbine Panda and China's Ministry of State Security",
       'Believed to be the work of the same threat actors behind the alleged Russian compromise of networking equipment',
       'Believed to be the work of the Mexican government.',
       'Believed to be the work of Russian hacker group Snake.',
       'Believed to be the work of Russian state-sponsored group Dragonfly, also known as Energetic Bear',
       'Believed to be the work of the same actors behind the compromise of the 2018 Winter Olympics',
       'Suspected to be the work of Chinese operatives working for the Ministry of State Security.',
       'Believed to be the work of Sandworm.',
       'Also known as the Silent Librarian. Believed to be associated with the Mabna Institute.',
       'Believed to be associated with the U.S. targeting of the Islamic State Group.',
       'Believed to be the work of a Russian hacking group.',
       'Believed to be the work of a subset of the Lazarus Group known as Bluenoroff',
       'Also known as TEMP.Zagros, Seedworm and Static Kitten.?',
       'Believed to be the work of APT 28.?',
       "Also known as TEMP.Periscope, Leviathan, and Mudcarp.?Believed to be behind the compromise of Cambodia?? election organizations?and the Target of universities' maritime military secrets.",
       'Possibly linked to Sandworm.', 'Possibly linked to APT 30',
       'Also known as Tick and RedBaldKnight',
       'Also known as Cloud Hopper, Red Apollo, CNVX, Stone Panda, MenuPass, and POTASSIUM.?Believed to be a part of the Tianjin bureau of the Chinese Ministry of State Security.',
       'Believed to be the work of the same threat actors who used the?Duqu 2.0 tool.?',
       'Possibly linked to OilRig', 'Possibly linked to NetTraveler',
       'Believed to be the work of Winnti Umbrella',
       'Believed to be the work of the Lazarus Group.',
       'Also known as Magnallium and Elfin. Suspected to be linked to the Shamoon malware attacks in 2018.',
       'Also known as the Lamberts',
       'Also known as Cobalt Gypsy and possibly associated with Rocket Kitten',
       'Associated with Turla', 'Believed to be the work of the Dukes',
       'Believed to have been partially the work of Chrysene',
       "This incident is believed to be the work of APT 28. The Dukes were also found to be in the Democratic National Committee's networks.",
       'Also known as Twisted Kitten and Crambus; possibly related to Magic Hound,?APT 34, and Chrysene.?Believed to use similar tools and infrastructure as Chafer.? The group has also been connected to SiameseKitten.',
       'Believed to be associated with the United Kingdom?? cyber operations against the Islamic State, revealed in 2018.',
       'Used an evolution of the tools used by Sandworm.', 'Indonesia',
       'Believed to be the work of?Winnti Umbrella',
       'Same state sponsor as the 2016 breach',
       'Believed to be responsible for Dark Seoul, Ten Days of Rain, the Sony Pictures Entertainment attack,?the SWIFT-related bank heists,?WannaCry, and?Operation GhostSecret. Believed to use the same tools as Covellite.?Known to the U.S. government as Hidden Cobra.?Also known as BlueNoroff.',
       'Possibly linked to Icefog?and the Dark Seoul compromise',
       'Believed to be the work of the Turla group',
       'Same state sponsor as the 2014 breach', 'Also known as Strider',
       'Possible link to Sandworm.',
       'Also known as Naikon, PLA Unit 78020, Lotus Panda. Possibly responsible for the JadeRAT tool.?',
       'Believed to be the work of Deep Panda',
       'Also known as the Gamaredon Group and?Primitive Bear.',
       'Also known as APT 32',
       'Associated with Duqu. Believed to have been the tool used to compromise Kaspersky Labs.',
       'Believed to use similar tools and infrastructure as OilRig',
       'Also known as UPS, Gothic Panda, Buckeye, and Boyusec',
       'The government of Uganda',
       'Also known as TG-3390, APT 27, TEMP.Hippo, and?LuckyMouse.?Believed to have been responsible for the compromise of a Mongolian data center.',
       'Suspected to be the work of APT 28',
       'Also known as Deputy Dog and Axiom.?Believed to be associated with Winnti Umbrella',
       'Possibly?linked to?Saffron Rose',
       'Also known as Tilded Team, it is associated with Flame, Stuxnet, Regin, Gauss, and Duqu.',
       'Suspected to be the work of the Dukes',
       'Suspected to be the work of Deep Panda',
       'Possibly uses the same infrastructure as?Mirage',
       'Believed to be the work of the same actors behind Dark Seoul?and possibly linked to Onion Dog',
       'This threat actor is believed to have been responsible for the Japan Pension Service incident. It is also known as Emdivi?and Cloudy Omega.',
       'Possibly attributed to APT 28',
       'Believed to be the work of Saffron Rose',
       'Linked to Blue Termite',
       'Also known as Charming Kitten, APT 35, Newscaster, Ajax Security Team, Phosphorus, Group 83, and TA453. Possibly linked to Rocket Kitten',
       'Uses similar tools as Moafee?and Rancor, possibly indicating a link between the threat actors',
       'Uses similar tools as DragonOK,?possibly indicating a link between the two',
       'Also known as Spring Dragon and Eslie',
       'Also known as Energetic Bear, Dragonfly, Group 24, and Koala Team. Possibly associated with Allanite and Dymalloy.?Believed to be behind the targeting of critical infrastructure entities in the United States and Germany, prompting both countries to issue technical alerts',
       'Turla is also known as Snake, Venomous Bear, Uroburos, Group 88, Waterbug, and Turla Team. Estonian intelligence services?associate this group with the Russian federal security service (FSB). Turla is associated with Agent.btz?and believed to be behind the RUAG espionage incident, an attempted compromise of the Swiss defense ministry,?the targeting of diplomatic missions in Eastern Europe, the targeting of South Korean actors prior to the meeting of U.S. President Donald J. Trump and North Korean leader Kim Jong-un, and the compromise of the German?and Finnish foreign ministries.?',
       'Believed to be closely linked to, though distinct from, Winnti.',
       'Also known as Voodoo Bear, and Electrum. Believed to be responsible for the 2008 DDoS attacks in Georgia and the 2015?Ukraine power grid outage.?Code used by this threat actor was also found in the alleged Russian compromise of networking equipment. There is possible overlap between Black Energy and the threat actor behind the compromise of the 2018 Winter Olympics.',
       'Also known as Dynamite Panda, TG-0416, and TA428',
       'Believed to be the work of APT 18',
       'Also known as APT 2, PLA Unit 61486, and TG-6952',
       'Related to the compromise at the Office of Personnel Management',
       'Also known as IXESHE, Numbered Panda, and Group 22',
       'Also known as Mask and Ugly Face', 'Animal Farm',
       'Believed to be the work of the Equation Group',
       'Also known as Flying Kitten, Sayad, Ajax Security Team, and Group 26',
       'Also known as Cutting Kitten and Group 41',
       'Believed to be responsible for the attempted compromise of U.S. think tanks',
       'Mirage is possibly linked to the Hellsing?threat actor. It is also known as Vixen Panda, GREF, APT 15, and Playful Dragon.?Believed to be associated with Winnti Umbrella',
       'Also known as APT 29, Cozy Bear, Dark Halo, Nobelium, and Cloaked Ursa.?Estonian intelligence services associate this group with the Russian Federal Security Service (FSB) and Foreign Intelligence Service (SVR).',
       'Also known as APT 21',
       'Also known as JS Sykipot, Backdoor Sykipot, and Maverick Panda',
       'Also known as Dagger Panda and possibly linked to Onion Dog',
       'Also known as Shell Crew, Web Masters, APT 19, Kung Fu Kittens, Black Vine, Temp.Avengers, and Group 13',
       'Also known as Operation Troy and Dark Seoul',
       'Also known as APT 1, Comment Crew, Comment Panda, TG-8223, Group 3, GIF89a, and Byzantine Candor',
       'Has links to the Shadow Network?and?SabPub. Also known as TA413.',
       'A variant of Lucky Cat for Mac computers',
       'This tool is believed to be the work of the Equation Group. The tool Gauss?is believed to be based on the Flame platform.',
       'Also known as Operation Ababil',
       'Believed to be based on the Flame platform and possibly the work of the Equation Group',
       'Believed to be the work of PLA Unit 61398',
       'Possibly assisted in Operation Aurora, the RSA incident, and the Joint Strike Fighter Program compromise',
       'This is believed be a tool set used by the Equation Group. It is related to Stuxnet?and Flame.',
       'Believed to be a tool used by PLA Unit 61398',
       'Possibly linked to the Nitro incident',
       'Possibly linked to the Mitsubishi Heavy Industries incident.',
       'This incident is believed to be the work of Unit 61398 and at least two other Chinese groups, according to the New York Times. The threat actor Sneaky Panda may also have assisted.',
       'Believed to be the work of PLA Unit 61398 with possible assistance from Sneaky Panda.?Believed to be associated with Winnti Umbrella',
       'Also known as Manganese',
       'Codenamed ??lympic Games??and believed to be the work of the Equation Group',
       'Has links to Lucky Cat and GhostNet',
       'Sneaky Panda is among some of the actors suspected to be behind this incident.',
       'Believed to be linked to Sandworm.',
       'Believed?to be partially the work of PLA Unit 61398.', 'Antlion',
       'Believed to be the work of MuddyWater.',
       'Believed to be responsible for the targeting of employees of companies that operate U.S. power plants. Also believed to be linked to Crouching Yeti',
       'Believed to be the work of the Ugandan government.', 'RedAlpha',
       'Also known as Triton and Xenotime'], dtype=object)
First, we remove any rows where the Affiliations column is NaN to ensure we only work with valid entries.
Then, we use a regular expression (regex) to identify all variations of “APT 28” in the column and normalize them to a single consistent value.
[403]:
# import re for regular expressions
import re

# Drop rows where 'Affiliations' is NaN
df = df.dropna(subset=['Affiliations']).reset_index(drop=True)

# Now you can normalize APT 28
def normalize_apt28(x):
    if re.search(r'\bAPT\s*28\b', x, flags=re.IGNORECASE):
        return 'APT 28'
    return x

df['Affiliations'] = df['Affiliations'].apply(normalize_apt28)

df.Affiliations.unique()

[403]:
array(["Believed to be the work of Ghana's government", 'UNC1151',
       'Ukrainian IT Army', 'Zirconium', 'Gamaredon', 'APT-C-23',
       'Believed to be the work of the Chinese government',
       'Believed to be the work of the Iranian government',
       'Lazarus Group', 'APT 35',
       'Believed to be the work of the Canadian government',
       "Believed to be the work of Israel's government.",
       'Believed to be the work of the Russian government', 'Gallium',
       'Mirage', 'Believed to be the work of the Jordanian government',
       'Lucky Cat', 'APT 40', 'Sandworm', 'APT 37',
       "Believed to be the work of Israel's government", 'APT 28',
       'Believed to be the work of the Iranian government.', 'APT 18',
       'Ocean Lotus', 'Mustang Panda',
       'Believed to be the work of the government of the Russian Federation.',
       'APT 10', 'Kimsuky', 'Charming Kitten',
       "Believed to be the work of China's government.",
       'National Security Agency, Tailored Access Operations', 'APT 5',
       'Believed to be the work of the Spanish government',
       'Believed to be the work of the government of Thailand.',
       'Believed to be the work of Hamas.',
       "Believed to be the work of China's government", 'APT 36',
       "Believed to be the work of Palestine's government",
       'Iran, APT 35',
       'Believed to be the work of the Chinese government\r\n?',
       'Confucius', 'Winnti Umbrella',
       "Believed to be the work of Russia's government", 'Konni Group',
       'Emissary Panda', 'Patchwork', 'APT 41', 'Turla', 'Poison Carp',
       'Hafnium', 'Siamesekitten, OilRig',
       "Believed to be the work of El Salvador's government.",
       'The Dukes', 'Prince of Persia', 'Red Menshen', 'MuddyWater',
       'Space Pirates', 'Thrip', 'UNC1151?',
       "Believed to be the work of Bahrain's government",
       'ModifiedElephant, Pune Police', 'Also known as Space Pirates',
       'Believed to be the work of the Mexican government', 'APT 34',
       'Believed to be the work of the United Arab Emirates government',
       'Also known as PLA Unit 69010.', 'TAG-28', 'Russian Federation',
       'APT37', 'China', 'UNC215', 'BlackTech', 'RedFoxTrot', 'Cozy Bear',
       'TAG-22, Winnti Umbrella', 'Nobelium', 'Aggah, Gorgon Group',
       'TA453, Charming Kitten', 'Domestic Kitten',
       'APT 41, Barium, TA415', 'Lazarus Group?',
       'APT 31, Hurricane Panda', 'Calypso',
       'Law enforcement agencies in the United States, Canada, France, Germany, Lithuania, the Netherlands, Ukraine, and United Kingdom',
       'HoneyMyte, Mustang Panda, Bronze President',
       'North Korean government', 'Nobelium, APT 29', 'Pakistan',
       'Federal Bureau of Investigation, Bulgarian National Investigative Service',
       'APT 34, OilRig', 'Tortoiseshell, ?Charming Kitten', 'Sharp Panda',
       'APT 37, ScarCruft, Stardust Chollima', 'Emennet Pasargad?',
       'Belarus, Ghostwriter.', 'Comment Crew', 'Siamesekitten, Lyceum',
       'Chinese government', 'Philippine Army', 'MysterySnail',
       'Primitive Bear, Gamaredon', 'Judgement Panda, APT 31',
       'Also known as?Lyceum. The group has also been connected to OilRig.',
       'IndigoZebra', 'South Sudanese government', 'Tick',
       'Naikon, APT 30, Deadringer', 'Charming Kitten?',
       'Russian government', 'TA428, Wekby, Dynamite Panda',
       'Winnti Umbrella, SparklingGoblin',
       'NSA, U.S. Cyber Command, FBI, and partner agencies in several unknown countries.?',
       'Emirati government', 'Soft Cell, GALLIUM, Deadringer',
       'Believed to be the work of the Saudi Arabian government.',
       'Cozy Bear, APT29', 'Winnti Umbrella, Grayfly',
       "Believed to be the work of the Islamic Republic of Iran's government",
       'Static Kitten', 'Tonto Team or TA428', 'APT 31, Judgement Panda',
       'PLA Unit 69010, RedFoxTrot', 'Mustang Panda, Bronze President',
       'Cozy Bear, APT 29', 'APT 36, Mythic Leopard',
       'Charming Kitten, Phosphorous',
       'Believed to be the work of the Polish government.', 'Israel',
       'Saudi and Emirati governments', 'APT 30',
       'Believed to be the work of the Chinese government.?',
       'Government of China', 'Unknown, suspected Bronze Butler.',
       'Iranian government',
       'Believed to be the work of the Russian government.?',
       'Energetic Bear', 'Also known as APT 31 and Judgement Panda.',
       'Zinc', 'Believed to be the work of Togolese security forces.?',
       'DarkHotel', 'Greenbug', 'Cerium', 'Lotus Blossom',
       'Cobalt Dickens', 'Pirate Panda', 'Also known as Parasite',
       'Fox Kitten', 'Also known as Parasite, UNC757, and Fox Kitten.',
       'Government of Saudi Arabia', 'Hidden Cobra', 'Pioneer Kitten',
       'KINGDOM', 'Syrian Electronic Army', 'KilllSomeOne',
       'APT 34, possibly APT 33/APT 39.', 'Cozy Bear,?APT 29',
       'U.S. Cyber Command', 'Tonto Team',
       'Australian Signals Directorate', 'APT 33',
       'Believed to be Lazarus Group', 'Bronze Butler',
       'Guangdong State Security Department (GSSD) of China?? Ministry of State Security (MSS)',
       'Vicious Panda', 'Crouching Yeti', 'Government of Turkey',
       'Ghostwriter', 'Believed to be the work of Fancy Bear.',
       'Russian military intelligence agency, known as GRU', 'PKplug',
       'Unnamed state-sponsored APT',
       'Believed to be the work of APT 10.', 'The Lazarus Group',
       'Five Eyes',
       'Believed to be the work of Chinese state-sponsored hackers.',
       'Phosphorus', 'Government of Egypt', 'Iran', 'Rancor', 'Unknown',
       'Also known as Earth Longzhi.',
       'Believed to be the work of Leviathan.',
       'Believed to be the work of Phosphoros, also known as APT 35',
       'Government of Uzbkeistan', 'Government of the United States',
       'Government of Morocco', 'Hong Kong police',
       'Believed to be the work of Russian and Chinese hackers.',
       'Believed to be the work of APT 10 or the Jiangsu Province Ministry of State Security.',
       'the Dukes', 'China?? Ministry of State Security',
       'Believed to be the work of at least two unnamed APT groups, including Evil Eye',
       'APT 3', 'Believed to be the work of Turbine Panda', 'APT 20',
       'APTs associated with Beijing',
       'Suspected to be the work of Iranian actors.',
       'Government of the UAE', 'Iranian Revolutionary Guard',
       'Suspected to be the work of the Dukes.?',
       'Believed to be the work of Cobalt Dickens.', 'Wicked Panda',
       'Believed to be the work of Poison Carp.',
       'APT 17 and Bronze Butler', 'Believed to be the work of Winnti.',
       'Bronze President',
       'Believed to be the work of the UAE government.',
       'Also known as Evil Eye', 'Golden Falcon',
       'Also known as Billbug and Lotus Blossom.',
       'Believed to be associated with the Lazarus Group',
       'Believed to be the work of Crouching Yeti',
       'Believed to be associated with Sandworm. Also known as VPNFilter.',
       'Believed to use similar infrastructure as the NetTraveler, Icefog, and APT 17 threat actors',
       'Believed to be the work of Whitefly.?',
       'Believed to be the work of TempTick?and Turla',
       'Believed to use similar tools as DragonOK',
       "China's Jiangsu Ministry of State Security",
       'Believed to be the work of Turla',
       'Believed to be the work of APT 10',
       'Also known as Phosphoros and Newscaster Team',
       'Believed to be the work of the Lazarus Group',
       'Suspected to be the work of the Lazarus Group.',
       "China's Tianjin State Security Bureau",
       'Believed to be the work of Lazarus Group',
       'Suspected to be the work of Chinese military hackers.',
       'Believed to be the work of Leviathan',
       'Believed to be associated with the Axiom, APT 17, and Mirage threat actors. Believed to share the same tools and infrastructure as the threat actors that carried out Operation Aurora, the 2015 targeting of video game companies, the 2015 targeting of the Thai government, and the 2017 targeting of Chinese-language news websites',
       'Believed to use similar tools as the Lazarus Group',
       'Possibly linked to an Iran-based group that uses the Shamoon virus to target oil and gas entities with a connection to Saudi Arabia.',
       'Also known as Transparent Tribe and C-Major',
       'Believed to be linked to Crouching Yeti',
       'Believed to be the work of Gamaredon.',
       'Believed to be linked to the Crouching Yeti threat actor',
       'Suspected to be the work of North Korean authorities.',
       'Also known as Bronze President and?HoneyMyte.?Also known as RedDelta.',
       "Linked to Turbine Panda and China's Ministry of State Security",
       'Believed to be the work of the same threat actors behind the alleged Russian compromise of networking equipment',
       'Believed to be the work of the Mexican government.',
       'Believed to be the work of Russian hacker group Snake.',
       'Believed to be the work of Russian state-sponsored group Dragonfly, also known as Energetic Bear',
       'Believed to be the work of the same actors behind the compromise of the 2018 Winter Olympics',
       'Suspected to be the work of Chinese operatives working for the Ministry of State Security.',
       'Believed to be the work of Sandworm.',
       'Also known as the Silent Librarian. Believed to be associated with the Mabna Institute.',
       'Believed to be associated with the U.S. targeting of the Islamic State Group.',
       'Believed to be the work of a Russian hacking group.',
       'Believed to be the work of a subset of the Lazarus Group known as Bluenoroff',
       'Also known as TEMP.Zagros, Seedworm and Static Kitten.?',
       "Also known as TEMP.Periscope, Leviathan, and Mudcarp.?Believed to be behind the compromise of Cambodia?? election organizations?and the Target of universities' maritime military secrets.",
       'Possibly linked to Sandworm.', 'Possibly linked to APT 30',
       'Also known as Tick and RedBaldKnight',
       'Also known as Cloud Hopper, Red Apollo, CNVX, Stone Panda, MenuPass, and POTASSIUM.?Believed to be a part of the Tianjin bureau of the Chinese Ministry of State Security.',
       'Believed to be the work of the same threat actors who used the?Duqu 2.0 tool.?',
       'Possibly linked to OilRig', 'Possibly linked to NetTraveler',
       'Believed to be the work of Winnti Umbrella',
       'Believed to be the work of the Lazarus Group.',
       'Also known as Magnallium and Elfin. Suspected to be linked to the Shamoon malware attacks in 2018.',
       'Also known as the Lamberts',
       'Also known as Cobalt Gypsy and possibly associated with Rocket Kitten',
       'Associated with Turla', 'Believed to be the work of the Dukes',
       'Believed to have been partially the work of Chrysene',
       'Also known as Twisted Kitten and Crambus; possibly related to Magic Hound,?APT 34, and Chrysene.?Believed to use similar tools and infrastructure as Chafer.? The group has also been connected to SiameseKitten.',
       'Believed to be associated with the United Kingdom?? cyber operations against the Islamic State, revealed in 2018.',
       'Used an evolution of the tools used by Sandworm.', 'Indonesia',
       'Believed to be the work of?Winnti Umbrella',
       'Same state sponsor as the 2016 breach',
       'Believed to be responsible for Dark Seoul, Ten Days of Rain, the Sony Pictures Entertainment attack,?the SWIFT-related bank heists,?WannaCry, and?Operation GhostSecret. Believed to use the same tools as Covellite.?Known to the U.S. government as Hidden Cobra.?Also known as BlueNoroff.',
       'Possibly linked to Icefog?and the Dark Seoul compromise',
       'Believed to be the work of the Turla group',
       'Same state sponsor as the 2014 breach', 'Also known as Strider',
       'Possible link to Sandworm.',
       'Also known as Naikon, PLA Unit 78020, Lotus Panda. Possibly responsible for the JadeRAT tool.?',
       'Believed to be the work of Deep Panda',
       'Also known as the Gamaredon Group and?Primitive Bear.',
       'Also known as APT 32',
       'Associated with Duqu. Believed to have been the tool used to compromise Kaspersky Labs.',
       'Believed to use similar tools and infrastructure as OilRig',
       'Also known as UPS, Gothic Panda, Buckeye, and Boyusec',
       'The government of Uganda',
       'Also known as TG-3390, APT 27, TEMP.Hippo, and?LuckyMouse.?Believed to have been responsible for the compromise of a Mongolian data center.',
       'Also known as Deputy Dog and Axiom.?Believed to be associated with Winnti Umbrella',
       'Possibly?linked to?Saffron Rose',
       'Also known as Tilded Team, it is associated with Flame, Stuxnet, Regin, Gauss, and Duqu.',
       'Suspected to be the work of the Dukes',
       'Suspected to be the work of Deep Panda',
       'Possibly uses the same infrastructure as?Mirage',
       'Believed to be the work of the same actors behind Dark Seoul?and possibly linked to Onion Dog',
       'This threat actor is believed to have been responsible for the Japan Pension Service incident. It is also known as Emdivi?and Cloudy Omega.',
       'Believed to be the work of Saffron Rose',
       'Linked to Blue Termite',
       'Also known as Charming Kitten, APT 35, Newscaster, Ajax Security Team, Phosphorus, Group 83, and TA453. Possibly linked to Rocket Kitten',
       'Uses similar tools as Moafee?and Rancor, possibly indicating a link between the threat actors',
       'Uses similar tools as DragonOK,?possibly indicating a link between the two',
       'Also known as Spring Dragon and Eslie',
       'Also known as Energetic Bear, Dragonfly, Group 24, and Koala Team. Possibly associated with Allanite and Dymalloy.?Believed to be behind the targeting of critical infrastructure entities in the United States and Germany, prompting both countries to issue technical alerts',
       'Turla is also known as Snake, Venomous Bear, Uroburos, Group 88, Waterbug, and Turla Team. Estonian intelligence services?associate this group with the Russian federal security service (FSB). Turla is associated with Agent.btz?and believed to be behind the RUAG espionage incident, an attempted compromise of the Swiss defense ministry,?the targeting of diplomatic missions in Eastern Europe, the targeting of South Korean actors prior to the meeting of U.S. President Donald J. Trump and North Korean leader Kim Jong-un, and the compromise of the German?and Finnish foreign ministries.?',
       'Believed to be closely linked to, though distinct from, Winnti.',
       'Also known as Voodoo Bear, and Electrum. Believed to be responsible for the 2008 DDoS attacks in Georgia and the 2015?Ukraine power grid outage.?Code used by this threat actor was also found in the alleged Russian compromise of networking equipment. There is possible overlap between Black Energy and the threat actor behind the compromise of the 2018 Winter Olympics.',
       'Also known as Dynamite Panda, TG-0416, and TA428',
       'Believed to be the work of APT 18',
       'Also known as APT 2, PLA Unit 61486, and TG-6952',
       'Related to the compromise at the Office of Personnel Management',
       'Also known as IXESHE, Numbered Panda, and Group 22',
       'Also known as Mask and Ugly Face', 'Animal Farm',
       'Believed to be the work of the Equation Group',
       'Also known as Flying Kitten, Sayad, Ajax Security Team, and Group 26',
       'Also known as Cutting Kitten and Group 41',
       'Believed to be responsible for the attempted compromise of U.S. think tanks',
       'Mirage is possibly linked to the Hellsing?threat actor. It is also known as Vixen Panda, GREF, APT 15, and Playful Dragon.?Believed to be associated with Winnti Umbrella',
       'Also known as APT 29, Cozy Bear, Dark Halo, Nobelium, and Cloaked Ursa.?Estonian intelligence services associate this group with the Russian Federal Security Service (FSB) and Foreign Intelligence Service (SVR).',
       'Also known as APT 21',
       'Also known as JS Sykipot, Backdoor Sykipot, and Maverick Panda',
       'Also known as Dagger Panda and possibly linked to Onion Dog',
       'Also known as Shell Crew, Web Masters, APT 19, Kung Fu Kittens, Black Vine, Temp.Avengers, and Group 13',
       'Also known as Operation Troy and Dark Seoul',
       'Also known as APT 1, Comment Crew, Comment Panda, TG-8223, Group 3, GIF89a, and Byzantine Candor',
       'Has links to the Shadow Network?and?SabPub. Also known as TA413.',
       'A variant of Lucky Cat for Mac computers',
       'This tool is believed to be the work of the Equation Group. The tool Gauss?is believed to be based on the Flame platform.',
       'Also known as Operation Ababil',
       'Believed to be based on the Flame platform and possibly the work of the Equation Group',
       'Believed to be the work of PLA Unit 61398',
       'Possibly assisted in Operation Aurora, the RSA incident, and the Joint Strike Fighter Program compromise',
       'This is believed be a tool set used by the Equation Group. It is related to Stuxnet?and Flame.',
       'Believed to be a tool used by PLA Unit 61398',
       'Possibly linked to the Nitro incident',
       'Possibly linked to the Mitsubishi Heavy Industries incident.',
       'This incident is believed to be the work of Unit 61398 and at least two other Chinese groups, according to the New York Times. The threat actor Sneaky Panda may also have assisted.',
       'Believed to be the work of PLA Unit 61398 with possible assistance from Sneaky Panda.?Believed to be associated with Winnti Umbrella',
       'Also known as Manganese',
       'Codenamed ??lympic Games??and believed to be the work of the Equation Group',
       'Has links to Lucky Cat and GhostNet',
       'Sneaky Panda is among some of the actors suspected to be behind this incident.',
       'Believed to be linked to Sandworm.',
       'Believed?to be partially the work of PLA Unit 61398.', 'Antlion',
       'Believed to be the work of MuddyWater.',
       'Believed to be responsible for the targeting of employees of companies that operate U.S. power plants. Also believed to be linked to Crouching Yeti',
       'Believed to be the work of the Ugandan government.', 'RedAlpha',
       'Also known as Triton and Xenotime'], dtype=object)
Next, we count how many rows in the Affiliations column contain each known alias of APT 28, including “Fancy Bear”, “Sofacy”, “Pawn Storm”, and “Sednit”.
This helps us understand how the group is referred to across the dataset. Finally, we normalize all rows containing “Fancy Bear” to the canonical name “APT 28” for consistency.
[404]:
# Count rows where 'Affiliations' contains 'Fancy Bear' (case-insensitive)
count_fancy_bear = df['Affiliations'].str.contains('Fancy Bear', case=False, na=False).sum()
# Sofacy
count_sofacy = df['Affiliations'].str.contains('Sofacy', case=False, na=False).sum()
# Pawn Storm
count_pawn_storm = df['Affiliations'].str.contains('Pawn Storm', case=False, na=False).sum()
# r'Sednit'
count_sednit = df['Affiliations'].str.contains('Sednit', case=False, na=False).sum()

print(f"Number of rows containing 'Fancy Bear': {count_fancy_bear}")
print(f"Number of rows containing 'Sofacy': {count_sofacy}")
print(f"Number of rows containing 'Pawn Storm': {count_pawn_storm}")
print(f"Number of rows containing 'Sednit': {count_sednit}")
Number of rows containing 'Fancy Bear': 2
Number of rows containing 'Sofacy': 0
Number of rows containing 'Pawn Storm': 0
Number of rows containing 'Sednit': 0
[405]:
df.loc[df['Affiliations'].str.contains('Fancy Bear', case=False, na=False), 'Affiliations'] = 'APT 28'


[406]:
# TODO: Step 2 - Filter out 'Unknown' affiliations
df = df[df['Affiliations'] != 'Unknown'].reset_index(drop=True)
[407]:
# TODO: Step 3 - Count affiliations and get top 10
# Hint: Use value_counts().nlargest(10)
top_10_aff = df['Affiliations'].value_counts().nlargest(10)

[408]:
# TODO: Step 4 - Create a barplot of the top 10 affiliations

plt.figure(figsize=(10,6))
ax = sns.barplot(x=top_10_aff.values, y=top_10_aff.index)
ax.bar_label(ax.containers[0])
ax.set_title('Top 10 Affiliations')
plt.ylabel('Affiliation')
plt.show()

../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_90_0.svg

Solution - Exercise 3: APT 28 Cyber Attacks Over Time

We analyze the behavior of APT 28 over time, including the malware associated with the group.
For background information and a comprehensive list of malware linked to APT 28, see Malpedia.
[ ]:
# Step 1 - Filter the dataset for all rows where 'Affiliations' contains "APT 28"
# Hint: Use str.contains() with case=False and na=False
mask_apt28 = df['Affiliations'].str.contains('APT 28', case=False, na=False)
df_apt28 = df[mask_apt28].reset_index(drop=True)
[ ]:
# Step 2 - Extract the year from the 'Date' column
# Hint: Use .dt.year
df_apt28['Year'] = df_apt28['Date'].dt.year
[ ]:
# Step 3 - Sort the counts by year
cyber_per_year_apt28 = df_apt28['Year'].value_counts().sort_index()
[ ]:
# Step 4 - Create a line plot showing the number of attacks per year

plt.figure(figsize=(10,6))
sns.lineplot(x=cyber_per_year_apt28.index.astype(int), y=cyber_per_year_apt28.values, marker="o")
plt.title("Number of APT 28 Cyber Events per Year", fontsize=16)
plt.xlabel("Year", fontsize=12)
plt.ylabel("Cyber Operations", fontsize=12)
plt.xticks(cyber_per_year_apt28.index.astype(int))  # Ensure x-axis shows integer ticks
plt.grid(True)
plt.show()

../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_95_0.svg

The observed peak in APT 28 activity around 2016–2019 could be related to the deployment of the malware X-Agent, which has been linked to the group during that period.

Conclusion

In this tutorial, we learned how to load, clean, and visualize data using Python. Through Exploratory Data Analysis (EDA), we explored trends in cyber operations over time, identified the most active affiliations, sponsors, and targeted sectors, and handled inconsistent or missing data. Using plots and charts, we visualized insights from the dataset, providing a solid foundation for more advanced analyses in machine learning and cybersecurity.


Star our repository If you found this tutorial helpful, please ⭐ star our repository to show your support.
Ask questions For any questions, typos, or bugs, kindly open an issue on GitHub — we appreciate your feedback!