Getting Started 2: How to Load and Visualize Data for Cyber Threat Intelligence Analysis
To make this step more engaging, we’ll use a dataset on State-Sponsored Cyber Operations. This dataset provides an opportunity to apply EDA techniques to real-world, security-related data, helping to uncover patterns and relationships that may inform further modeling or research.
Tutorial Objectives
By the end of this tutorial, you will be able to:
Load, clean, and manipulate data using Pandas DataFrames
Perform numerical operations and transformations using NumPy
Create informative visualizations using Matplotlib and Seaborn
Notebook Setup and Visualization Configuration
Key Configuration Commands
%matplotlib inlineEnsures that all matplotlib plots are displayed directly within the notebook cells.%config InlineBackend.figure_formatControls the output format of plots. Different formats are suited for different use cases:
Format |
Description |
When to Use |
|---|---|---|
|
High-resolution (2× DPI) PNG output |
Ideal for high-quality visuals. |
|
Standard-resolution PNG output |
Suitable for general use; faster rendering and smaller output size. |
|
Scalable Vector Graphics (vector-based) |
Best for publication-quality plots that scale without losing quality; ideal for LaTeX or web embedding. |
|
Portable Document Format (vector-based) |
Useful for exporting plots to reports; less common in notebooks. |
|
Compressed raster format |
Rarely used for plots due to lossy compression; generally not recommended for analysis. |
[83]:
# Render our plots inline
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
[84]:
# Import requiered packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
Modifying the Style of Graphs
[85]:
# Modifying the style of the graphs
plt.style.use(['seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-whitegrid'])
Load the Data
pd.read_csv function has many options; you can explore them in the official documentation.Features in the Dataset
Title: Title/Summary of the cyber operation.
Date: Date when the cyber operation was conducted.
Affiliations: Affiliations behind the cyber operation, if known.
Descriptions: Description of the cyber operation conducted.
Response: Response of the suspects behind the cyber operation, if any.
Victims: Victims that were targeted by the cyber operation, if known.
Sponsor: The suspected state sponsor of the cyber operation.
Type: Type of cyber operation conducted.
Category: Demographic category of the cyber operation’s victims.
Sources_1: Source that reported the cyber operation.
Sources_2: Source #2 that reported the cyber operation, if any.
Sources_3: Source #3 that reported the cyber operation, if any.
[86]:
df = pd.read_csv(
'cyber-operations-incidents.csv',
delimiter=',', # Define a custom separator (default is ',')
header=0, # Row to use as column names
index_col=None, # Column to use as row labels
na_values=['NA'], # Values to treat as missing
encoding='utf-8' # File encoding
)
Notes:
delimiterlets you specify the character separating fields (default is comma,).headerindicates which row to treat as column names.index_colallows using a column as the DataFrame index.na_valuesdefines strings to interpret as missing data (NaN).Always ensure the encoding matches your file to avoid read errors.
Data Inspection and Basic Statistics
Before performing detailed analysis, it is important to inspect the dataset and understand its structure.
Preview the first 5 rows of the dataset:
[87]:
df.head()
[87]:
| Title | Date | Affiliations | Description | Response | Victims | Sponsor | Type | Category | Sources_1 | Sources_2 | Sources_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Targeting of dissidents and opposition politic... | 1/20/2022 | Believed to be the work of Ghana's government | The Israeli spyware firm NSO Group sold Pegasu... | Denial https://www.haaretz.com/israel-news/t... | Political dissidents and opposition leaders in... | Ghana | Espionage | Civil society | https://www.haaretz.com/israel-news/tech-news/... | NaN | NaN |
| 1 | Targeting of Ukrainian state entities | 3/7/2022 | UNC1151 | The Belarusian threat actor UNC1151 targeted m... | NaN | Ukrainian state entities | Belarus | Espionage | Government | https://cert.gov.ua/article/37626 | NaN | NaN |
| 2 | Targeting of Russian and Belarusian websites | 5/4/2022 | Ukrainian IT Army | Ukraine?? IT Army compromised Docker Engine ho... | NaN | Russian and Belarusian websites were targeted,... | Ukraine | Denial of service | Government, Private sector, Civil society | https://www.crowdstrike.com/blog/compromised-d... | NaN | NaN |
| 3 | Targeting of the Belgian Federal Public Servic... | 7/18/2022 | Zirconium | Chinese threat actors targeted the Belgian Fed... | Denouncement https://diplomatie.belgium.be/e... | Belgian Federal Public Service Interior | China | Espionage | Government | https://diplomatie.belgium.be/en/news/declarat... | https://www.bleepingcomputer.com/news/security... | NaN |
| 4 | Targeting of Ukrainian government agencies | 8/11/2022 | Gamaredon | The Computer Emergency Response Team of Ukrain... | Confirmation | Ukrainian government agencies | Russian Federation | Espionage | Government | https://cert.gov.ua/article/2681855 | NaN | NaN |
Preview the last 2 rows of the dataset:
[88]:
df.tail(2)
[88]:
| Title | Date | Affiliations | Description | Response | Victims | Sponsor | Type | Category | Sources_1 | Sources_2 | Sources_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 766 | Trisis | NaN | Also known as Triton and Xenotime | This threat actor targets the Triconex safety ... | Confirmation https://www.justice.gov/opa/pr/... | Saudi Arabia | Russian Federation | Sabotage | Private sector | https://www.fireeye.com/blog/threat-research/2... | https://www.fireeye.com/blog/threat-research/2... | https://dragos.com/blog/trisis/TRISIS-01.pdf |
| 767 | Targeting of the Belgian Federal Public Servic... | NaN | APT 30 | Chinese threat actors targeted the Belgian Fed... | Denouncement https://diplomatie.belgium.be/e... | Belgian Federal Public Service Interior | China | Espionage | Government | https://diplomatie.belgium.be/en/news/declarat... | https://www.bleepingcomputer.com/news/security... | NaN |
[89]:
df[3:7]
[89]:
| Title | Date | Affiliations | Description | Response | Victims | Sponsor | Type | Category | Sources_1 | Sources_2 | Sources_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Targeting of the Belgian Federal Public Servic... | 7/18/2022 | Zirconium | Chinese threat actors targeted the Belgian Fed... | Denouncement https://diplomatie.belgium.be/e... | Belgian Federal Public Service Interior | China | Espionage | Government | https://diplomatie.belgium.be/en/news/declarat... | https://www.bleepingcomputer.com/news/security... | NaN |
| 4 | Targeting of Ukrainian government agencies | 8/11/2022 | Gamaredon | The Computer Emergency Response Team of Ukrain... | Confirmation | Ukrainian government agencies | Russian Federation | Espionage | Government | https://cert.gov.ua/article/2681855 | NaN | NaN |
| 5 | Targeting of Palestinian entities and activists | 2/2/2022 | APT-C-23 | The Hamas-linked threat actor APT-C-23 used ma... | NaN | NaN | NaN | Espionage | Civil society | https://www.cyberscoop.com/palestinian-targete... | https://blog.talosintelligence.com/2022/02/ari... | NaN |
| 6 | Targeting of government agencies in the Europe... | 4/5/2022 | Gamaredon | Russian threat actor Gamaredon targeted Ukrain... | Confirmation https://cert.gov.ua/article/39086 | Government agencies of EU member states | Russian Federation | Espionage | Government | https://www.bleepingcomputer.com/news/security... | NaN | NaN |
"Title", "Date", and "Sponsor":[90]:
df.loc[3:7, ["Title", "Date", "Sponsor"]]
[90]:
| Title | Date | Sponsor | |
|---|---|---|---|
| 3 | Targeting of the Belgian Federal Public Servic... | 7/18/2022 | China |
| 4 | Targeting of Ukrainian government agencies | 8/11/2022 | Russian Federation |
| 5 | Targeting of Palestinian entities and activists | 2/2/2022 | NaN |
| 6 | Targeting of government agencies in the Europe... | 4/5/2022 | Russian Federation |
| 7 | Targeting of Rostec | 5/19/2022 | China |
[91]:
df.iloc[[0,1,3,4,6], 0:3]
[91]:
| Title | Date | Affiliations | |
|---|---|---|---|
| 0 | Targeting of dissidents and opposition politic... | 1/20/2022 | Believed to be the work of Ghana's government |
| 1 | Targeting of Ukrainian state entities | 3/7/2022 | UNC1151 |
| 3 | Targeting of the Belgian Federal Public Servic... | 7/18/2022 | Zirconium |
| 4 | Targeting of Ukrainian government agencies | 8/11/2022 | Gamaredon |
| 6 | Targeting of government agencies in the Europe... | 4/5/2022 | Gamaredon |
Obtain a concise summary of the dataset, including column names, non-null counts, and data types:
[92]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Title 768 non-null object
1 Date 753 non-null object
2 Affiliations 621 non-null object
3 Description 768 non-null object
4 Response 186 non-null object
5 Victims 741 non-null object
6 Sponsor 728 non-null object
7 Type 731 non-null object
8 Category 743 non-null object
9 Sources_1 763 non-null object
10 Sources_2 488 non-null object
11 Sources_3 190 non-null object
dtypes: object(12)
memory usage: 72.1+ KB
Generate descriptive statistics for all numeric columns, such as mean, standard deviation, minimum, and maximum values:
[93]:
df.describe()
[93]:
| Title | Date | Affiliations | Description | Response | Victims | Sponsor | Type | Category | Sources_1 | Sources_2 | Sources_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 768 | 753 | 621 | 768 | 186 | 741 | 728 | 731 | 743 | 763 | 488 | 190 |
| unique | 764 | 603 | 377 | 767 | 128 | 657 | 61 | 12 | 35 | 715 | 470 | 186 |
| top | Targeting of Southeast Asian telecommunication... | 10/4/2018 | Lazarus Group | Chinese threat actors targeted the Belgian Fed... | Unknown | United States | China | Espionage | Private sector | https://blogs.microsoft.com/on-the-issues/2020... | https://www.bleepingcomputer.com/news/security... | https://www.fireeye.com/content/dam/fireeye-ww... |
| freq | 2 | 6 | 26 | 2 | 47 | 21 | 263 | 608 | 199 | 5 | 3 | 2 |
The dataset contains 768 rows and 12 columns:
[94]:
df.shape
[94]:
(768, 12)
[95]:
# Number of rows/observations - the data numerosity
df.shape[0]
[95]:
768
[96]:
# Number of features/column/attributes - the data dimensionality
df.shape[1]
[96]:
12
These commands provide a first overview of the data, helping identify missing values, data types, and key summary statistics.
Data Cleaning and Preparation
We can list all columns and check their data types using:
[97]:
df.columns
[97]:
Index(['Title', 'Date', 'Affiliations', 'Description', 'Response', 'Victims',
'Sponsor', 'Type', 'Category', 'Sources_1', 'Sources_2', 'Sources_3'],
dtype='object')
[98]:
df.dtypes
[98]:
Title object
Date object
Affiliations object
Description object
Response object
Victims object
Sponsor object
Type object
Category object
Sources_1 object
Sources_2 object
Sources_3 object
dtype: object
- Columns with the data type
objectare strings. - Some columns, such as dates, are better represented in a proper datetime format.
To convert strings to datetime, we can use
pd.to_datetime. Before doing so, it is important to understand Python date parsing conventions. See the official documentation for reference.Relevant formatting codes include:
%m– Month as a zero-padded decimal number%d– Day of the month as a zero-padded decimal number%Y– Year with century as a decimal number
[99]:
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df.dtypes
[99]:
Title object
Date datetime64[ns]
Affiliations object
Description object
Response object
Victims object
Sponsor object
Type object
Category object
Sources_1 object
Sources_2 object
Sources_3 object
dtype: object
[100]:
df.head(2)
[100]:
| Title | Date | Affiliations | Description | Response | Victims | Sponsor | Type | Category | Sources_1 | Sources_2 | Sources_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Targeting of dissidents and opposition politic... | 2022-01-20 | Believed to be the work of Ghana's government | The Israeli spyware firm NSO Group sold Pegasu... | Denial https://www.haaretz.com/israel-news/t... | Political dissidents and opposition leaders in... | Ghana | Espionage | Civil society | https://www.haaretz.com/israel-news/tech-news/... | NaN | NaN |
| 1 | Targeting of Ukrainian state entities | 2022-03-07 | UNC1151 | The Belarusian threat actor UNC1151 targeted m... | NaN | Ukrainian state entities | Belarus | Espionage | Government | https://cert.gov.ua/article/37626 | NaN | NaN |
Display all rows in the DataFrame that are identified as duplicates based on all columns:
[101]:
df[df.duplicated()==True]
[101]:
| Title | Date | Affiliations | Description | Response | Victims | Sponsor | Type | Category | Sources_1 | Sources_2 | Sources_3 |
|---|
Display all rows where duplicate values occur in the “Description” column:
[102]:
df.loc[df.duplicated(subset=['Description'], keep=False)]
[102]:
| Title | Date | Affiliations | Description | Response | Victims | Sponsor | Type | Category | Sources_1 | Sources_2 | Sources_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Targeting of the Belgian Federal Public Servic... | 2022-07-18 | Zirconium | Chinese threat actors targeted the Belgian Fed... | Denouncement https://diplomatie.belgium.be/e... | Belgian Federal Public Service Interior | China | Espionage | Government | https://diplomatie.belgium.be/en/news/declarat... | https://www.bleepingcomputer.com/news/security... | NaN |
| 767 | Targeting of the Belgian Federal Public Servic... | NaT | APT 30 | Chinese threat actors targeted the Belgian Fed... | Denouncement https://diplomatie.belgium.be/e... | Belgian Federal Public Service Interior | China | Espionage | Government | https://diplomatie.belgium.be/en/news/declarat... | https://www.bleepingcomputer.com/news/security... | NaN |
Boolean Indexing
[103]:
# Define date range for 2022-2023
start_date = pd.Timestamp('2022-01-01')
end_date = pd.Timestamp('2023-12-31')
# Boolean condition
boolean_condition = (df['Sponsor'] == 'Belarus') & (df['Date'] >= start_date) & (df['Date'] <= end_date)
# Filtered DataFrame
df_filtered = df[boolean_condition]
df_filtered.shape[0] # Number of rows in the filtered DataFrame
[103]:
2
Aggregation
Aggregation allows us to summarize data by groups, which is useful to identify patterns and trends in cyber operations.
For example, we can count the number of cyber operations conducted by each sponsor
[104]:
# Count the number of cyber operations per sponsor
df.groupby("Sponsor", as_index=False).size().sort_values(by='size', ascending=False)
[104]:
| Sponsor | size | |
|---|---|---|
| 6 | China | 263 |
| 37 | Russian Federation | 155 |
| 20 | Iran (Islamic Republic of) | 98 |
| 26 | Korea (Democratic People's Republic of) | 79 |
| 53 | United States | 12 |
| ... | ... | ... |
| 25 | Kazakhstan | 1 |
| 28 | Lebanon | 1 |
| 31 | Myanmar | 1 |
| 32 | Netherlands | 1 |
| 30 | Morocco | 1 |
61 rows × 2 columns
Explanation:
groupby("Sponsor")groups the data by theSponsorcolumn..size()counts the number of rows in each group, representing the number of operations..sort_values(by='size', ascending=False)sorts the sponsors from most to least active.
Sum the number of victims per sponsor to identify which sponsors have caused the greatest impact:
[105]:
# Sum the number of victims per sponsor
df.groupby("Sponsor", as_index=False)["Victims"].sum().sort_values(by="Victims", ascending=False)
[105]:
| Sponsor | Victims | |
|---|---|---|
| 55 | United States, Australia, Canada, New Zealand,... | Yandex |
| 27 | Korea (Republic of) | World Health OrganizationNorth Koreans and Nor... |
| 59 | Uzbekistan | Uzbek activists and dissidents |
| 28 | Lebanon | United States, Canada, Germany, Lebanon, France |
| 3 | Belarus | Ukrainian state entitiesOver seventy Ukrainian... |
| ... | ... | ... |
| 21 | Israel | An activist in IsraelMayors and other local of... |
| 52 | United Kingdom, United States | Algeria, Afghanistan, Belgium, Brazil, Fiji, G... |
| 20 | Iran (Islamic Republic of) | Albanian government networksVMware Horizon ser... |
| 39 | Saudi Arabia, United Arab Emirates | Al Jazeera employees |
| 0 | United Kingdom" | Israel |
61 rows × 2 columns
Individual Variables Exploration
Display the number of records per year by extracting the year component from the Date column and counting its occurrences, sorted in chronological order:
[106]:
df['Date'].dt.year.value_counts().sort_index()
[106]:
Date
2005.0 1
2006.0 3
2007.0 10
2008.0 7
2009.0 4
2010.0 9
2011.0 15
2012.0 12
2013.0 18
2014.0 32
2015.0 34
2016.0 33
2017.0 46
2018.0 80
2019.0 74
2020.0 120
2021.0 108
2022.0 147
Name: count, dtype: int64
Visualize the trend of cyber operations over time by plotting the number of recorded cyber events per year:
[107]:
# Extract number of cyber operations per year
cyber_per_year = df['Date'].dt.year.value_counts().sort_index()
# Set plot style
sns.set(style="whitegrid")
# Create the plot
plt.figure(figsize=(10,6))
sns.lineplot(x=cyber_per_year.index.astype(int), y=cyber_per_year.values, marker="o")
plt.title("Number of Cyber Events per Year", fontsize=16)
plt.xlabel("Year", fontsize=12)
plt.ylabel("Cyber Operations", fontsize=12)
plt.xticks(cyber_per_year.index.astype(int)) # Ensure x-axis shows integer ticks
plt.grid(True)
plt.show()
Display the 15 most frequent affiliations involved in cyber operations, excluding entries labeled as Unknown, to identify the primary actors appearing in the dataset:
[108]:
aff_counts = df[df.Affiliations!='Unknown'].Affiliations.value_counts()
aff_counts.nlargest(15)
[108]:
Affiliations
Lazarus Group 26
APT 28 15
Believed to be the work of APT 28 15
Sandworm 15
Winnti Umbrella 10
Mustang Panda 10
Gamaredon 10
Ocean Lotus 9
MuddyWater 9
Kimsuky 9
Charming Kitten 8
Believed to be the work of?APT 28 7
Turla 7
Believed to be the work of Russia's government 6
APT 10 6
Name: count, dtype: int64
Visualize the top 15 affiliations involved in cyber operations using a bar chart:
[109]:
top_10_aff = aff_counts.nlargest(15)
plt.figure()
ax = sns.barplot(x=top_10_aff.values, y=top_10_aff.index)
ax.bar_label(ax.containers[0])
ax.set_title('Top 10 Affiliations')
plt.ylabel('Affiliation')
plt.show()
Display the top 15 state sponsors associated with cyber operations in the dataset:
[110]:
top_15_sponsors = df[df.Sponsor!='Unknown'].Sponsor.value_counts().nlargest(15)
plt.figure()
ax = sns.barplot(x=top_15_sponsors.values, y=top_15_sponsors.index)
ax.bar_label(ax.containers[0])
ax.set_title('Top 15 Sponsors')
plt.ylabel('Sponsor')
plt.show()
Exercises
Answer the following Questions about the Dataset
Exercise 1: Distribution of Operation Types
Investigate the distribution of Type of state-sponsored cyber operations in the dataset.
[111]:
# TODO: Step 1 - Remove invalid or unknown Type entries
# Hint: Filter out invalid rows
[112]:
# TODO: Step 2 - Create the countplot
# Hint: Use sns.countplot() with the `order` parameter set to value_counts().index
[113]:
# TODO: Step 3 - Annotate counts above each bar
# Hint: Use ax.bar_label(ax.containers[0])
[114]:
# TODO: Step 4 - Set title and rotate x-axis labels
# show the plot
Exercise 2: Top 10 Affiliations
Some affiliations refer to the same group in different ways (e.g., “APT 28”). Normalize all variations containing “APT 28” into a single value, then display the top 10 affiliations using a barplot.
[115]:
# TODO: Step 1 - Normalize "APT 28" variations
# Hint: Some entries refer to the same group differently, e.g.,
# 'APT 28', 'Believed to be the work of APT 28', 'Believed to be the work of?APT 28'
[116]:
# TODO: Step 2 - Filter out 'Unknown' affiliations
[117]:
# TODO: Step 3 - Count affiliations and get top 10
[118]:
# TODO: Step 4 - Create a barplot of the top 10 affiliations
Exercise 3: APT 28 Cyber Attacks Over Time
Analyze the activity of APT 28 by plotting the number of cyber operations attributed to this group over the years 2010–2020.
[119]:
# TODO: Step 1 - Filter the dataset for all rows where 'Affiliations' contains "APT 28"
# Hint: Use str.contains() with case=False and na=False
[120]:
# TODO: Step 2 - Extract the year from the 'Date' column
# Hint: Use .dt.year
[121]:
# TODO: Step 3 - Sort the counts by year
[122]:
# TODO: Step 4 - Create a line plot showing the number of attacks per year
Conclusion
In this tutorial, we learned how to load, clean, and visualize data using Python. Through Exploratory Data Analysis (EDA), we explored trends in cyber operations over time, identified the most active affiliations, sponsors, and targeted sectors, and handled inconsistent or missing data. Using plots and charts, we visualized insights from the dataset, providing a solid foundation for more advanced analyses in machine learning and cybersecurity.