Getting Started 2: How to Load and Visualize Data for Cyber Threat Intelligence Analysis

Open notebook on:

Author: Christoph R. Landolt

In this tutorial, we will teach you how to conduct an Exploratory Data Analysis (EDA) in Python to gain a detailed understanding of a dataset.

EDA is a fundamental step in any data science workflow. It involves systematically examining the structure, quality, and relationships within the data through data loading, cleaning, statistical analysis, and visualization.

To make this step more engaging, we’ll use a dataset on State-Sponsored Cyber Operations. This dataset provides an opportunity to apply EDA techniques to real-world, security-related data, helping to uncover patterns and relationships that may inform further modeling or research.

Tutorial Objectives

By the end of this tutorial, you will be able to:

Load, clean, and manipulate data using Pandas DataFrames
Perform numerical operations and transformations using NumPy
Create informative visualizations using Matplotlib and Seaborn

Notebook Setup and Visualization Configuration

This section contains standard setup code that will likely be common across all notebooks in this course.

It imports the required Python packages for data analysis and visualization, and configures the notebook for high-quality plotting.

Key Configuration Commands

%matplotlib inline

Ensures that all matplotlib plots are displayed directly within the notebook cells.
%config InlineBackend.figure_format

Controls the output format of plots. Different formats are suited for different use cases:

Format	Description	When to Use
`'retina'`	High-resolution (2× DPI) PNG output	Ideal for high-quality visuals.
`'png'`	Standard-resolution PNG output	Suitable for general use; faster rendering and smaller output size.
`'svg'`	Scalable Vector Graphics (vector-based)	Best for publication-quality plots that scale without losing quality; ideal for LaTeX or web embedding.
`'pdf'`	Portable Document Format (vector-based)	Useful for exporting plots to reports; less common in notebooks.
`'jpg'` / `'jpeg'`	Compressed raster format	Rarely used for plots due to lossy compression; generally not recommended for analysis.

[83]:

# Render our plots inline
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

[84]:

# Import requiered packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Modifying the Style of Graphs

Matplotlib provides several built-in styles that can be applied to improve the appearance of plots.

In this tutorial, we will use a combination of Seaborn-inspired styles to produce clear, professional, and visually appealing figures.

[85]:

# Modifying the style of the graphs
plt.style.use(['seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-whitegrid'])

Load the Data

We can load a CSV file directly from a URL using pandas, or upload it from a local directory.

The pd.read_csv function has many options; you can explore them in the official documentation.

In this tutorial, we will use the State-Sponsored Cyber Operations (2005-Present) dataset.

This dataset tracks publicly known state-sponsored cyber operations since 2005 and has been compiled from official sources such as the Council on Foreign Relations.

Features in the Dataset

Title: Title/Summary of the cyber operation.
Date: Date when the cyber operation was conducted.
Affiliations: Affiliations behind the cyber operation, if known.
Descriptions: Description of the cyber operation conducted.
Response: Response of the suspects behind the cyber operation, if any.
Victims: Victims that were targeted by the cyber operation, if known.
Sponsor: The suspected state sponsor of the cyber operation.
Type: Type of cyber operation conducted.
Category: Demographic category of the cyber operation’s victims.
Sources_1: Source that reported the cyber operation.
Sources_2: Source #2 that reported the cyber operation, if any.
Sources_3: Source #3 that reported the cyber operation, if any.

[86]:

df = pd.read_csv(
    'cyber-operations-incidents.csv',
    delimiter=',',      # Define a custom separator (default is ',')
    header=0,           # Row to use as column names
    index_col=None,     # Column to use as row labels
    na_values=['NA'],   # Values to treat as missing
    encoding='utf-8'    # File encoding
)

Notes:

delimiter lets you specify the character separating fields (default is comma ,).
header indicates which row to treat as column names.
index_col allows using a column as the DataFrame index.
na_values defines strings to interpret as missing data (NaN).
Always ensure the encoding matches your file to avoid read errors.

Data Inspection and Basic Statistics

Before performing detailed analysis, it is important to inspect the dataset and understand its structure.

Preview the first 5 rows of the dataset:

[87]:

df.head()

[87]:

	Title	Date	Affiliations	Description	Response	Victims	Sponsor	Type	Category	Sources_1	Sources_2	Sources_3
0	Targeting of dissidents and opposition politic...	1/20/2022	Believed to be the work of Ghana's government	The Israeli spyware firm NSO Group sold Pegasu...	Denial https://www.haaretz.com/israel-news/t...	Political dissidents and opposition leaders in...	Ghana	Espionage	Civil society	https://www.haaretz.com/israel-news/tech-news/...	NaN	NaN
1	Targeting of Ukrainian state entities	3/7/2022	UNC1151	The Belarusian threat actor UNC1151 targeted m...	NaN	Ukrainian state entities	Belarus	Espionage	Government	https://cert.gov.ua/article/37626	NaN	NaN
2	Targeting of Russian and Belarusian websites	5/4/2022	Ukrainian IT Army	Ukraine?? IT Army compromised Docker Engine ho...	NaN	Russian and Belarusian websites were targeted,...	Ukraine	Denial of service	Government, Private sector, Civil society	https://www.crowdstrike.com/blog/compromised-d...	NaN	NaN
3	Targeting of the Belgian Federal Public Servic...	7/18/2022	Zirconium	Chinese threat actors targeted the Belgian Fed...	Denouncement https://diplomatie.belgium.be/e...	Belgian Federal Public Service Interior	China	Espionage	Government	https://diplomatie.belgium.be/en/news/declarat...	https://www.bleepingcomputer.com/news/security...	NaN
4	Targeting of Ukrainian government agencies	8/11/2022	Gamaredon	The Computer Emergency Response Team of Ukrain...	Confirmation	Ukrainian government agencies	Russian Federation	Espionage	Government	https://cert.gov.ua/article/2681855	NaN	NaN

Preview the last 2 rows of the dataset:

[88]:

df.tail(2)

[88]:

	Title	Date	Affiliations	Description	Response	Victims	Sponsor	Type	Category	Sources_1	Sources_2	Sources_3
766	Trisis	NaN	Also known as Triton and Xenotime	This threat actor targets the Triconex safety ...	Confirmation https://www.justice.gov/opa/pr/...	Saudi Arabia	Russian Federation	Sabotage	Private sector	https://www.fireeye.com/blog/threat-research/2...	https://www.fireeye.com/blog/threat-research/2...	https://dragos.com/blog/trisis/TRISIS-01.pdf
767	Targeting of the Belgian Federal Public Servic...	NaN	APT 30	Chinese threat actors targeted the Belgian Fed...	Denouncement https://diplomatie.belgium.be/e...	Belgian Federal Public Service Interior	China	Espionage	Government	https://diplomatie.belgium.be/en/news/declarat...	https://www.bleepingcomputer.com/news/security...	NaN

Selecting specific rows: You can extract a subset of rows using slicing.

For example, to select rows 3 to 6 (inclusive of 3, exclusive of 7):

[89]:

df[3:7]

[89]:

	Title	Date	Affiliations	Description	Response	Victims	Sponsor	Type	Category	Sources_1	Sources_2	Sources_3
3	Targeting of the Belgian Federal Public Servic...	7/18/2022	Zirconium	Chinese threat actors targeted the Belgian Fed...	Denouncement https://diplomatie.belgium.be/e...	Belgian Federal Public Service Interior	China	Espionage	Government	https://diplomatie.belgium.be/en/news/declarat...	https://www.bleepingcomputer.com/news/security...	NaN
4	Targeting of Ukrainian government agencies	8/11/2022	Gamaredon	The Computer Emergency Response Team of Ukrain...	Confirmation	Ukrainian government agencies	Russian Federation	Espionage	Government	https://cert.gov.ua/article/2681855	NaN	NaN
5	Targeting of Palestinian entities and activists	2/2/2022	APT-C-23	The Hamas-linked threat actor APT-C-23 used ma...	NaN	NaN	NaN	Espionage	Civil society	https://www.cyberscoop.com/palestinian-targete...	https://blog.talosintelligence.com/2022/02/ari...	NaN
6	Targeting of government agencies in the Europe...	4/5/2022	Gamaredon	Russian threat actor Gamaredon targeted Ukrain...	Confirmation https://cert.gov.ua/article/39086	Government agencies of EU member states	Russian Federation	Espionage	Government	https://www.bleepingcomputer.com/news/security...	NaN	NaN

By label using ``df.loc``: Selects rows and columns by their names or index labels.

Example: Select rows 3 to 7 and the columns "Title", "Date", and "Sponsor":

[90]:

df.loc[3:7, ["Title", "Date", "Sponsor"]]

[90]:

	Title	Date	Sponsor
3	Targeting of the Belgian Federal Public Servic...	7/18/2022	China
4	Targeting of Ukrainian government agencies	8/11/2022	Russian Federation
5	Targeting of Palestinian entities and activists	2/2/2022	NaN
6	Targeting of government agencies in the Europe...	4/5/2022	Russian Federation
7	Targeting of Rostec	5/19/2022	China

By position using ``df.iloc``: Selects rows and columns by integer index positions.

Example: Select rows 0, 1, 3, 4, 6 and columns 0 to 2:

[91]:

df.iloc[[0,1,3,4,6], 0:3]

[91]:

	Title	Date	Affiliations
0	Targeting of dissidents and opposition politic...	1/20/2022	Believed to be the work of Ghana's government
1	Targeting of Ukrainian state entities	3/7/2022	UNC1151
3	Targeting of the Belgian Federal Public Servic...	7/18/2022	Zirconium
4	Targeting of Ukrainian government agencies	8/11/2022	Gamaredon
6	Targeting of government agencies in the Europe...	4/5/2022	Gamaredon

Obtain a concise summary of the dataset, including column names, non-null counts, and data types:

[92]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Title         768 non-null    object
 1   Date          753 non-null    object
 2   Affiliations  621 non-null    object
 3   Description   768 non-null    object
 4   Response      186 non-null    object
 5   Victims       741 non-null    object
 6   Sponsor       728 non-null    object
 7   Type          731 non-null    object
 8   Category      743 non-null    object
 9   Sources_1     763 non-null    object
 10  Sources_2     488 non-null    object
 11  Sources_3     190 non-null    object
dtypes: object(12)
memory usage: 72.1+ KB

Generate descriptive statistics for all numeric columns, such as mean, standard deviation, minimum, and maximum values:

[93]:

df.describe()

[93]:

	Title	Date	Affiliations	Description	Response	Victims	Sponsor	Type	Category	Sources_1	Sources_2	Sources_3
count	768	753	621	768	186	741	728	731	743	763	488	190
unique	764	603	377	767	128	657	61	12	35	715	470	186
top	Targeting of Southeast Asian telecommunication...	10/4/2018	Lazarus Group	Chinese threat actors targeted the Belgian Fed...	Unknown	United States	China	Espionage	Private sector	https://blogs.microsoft.com/on-the-issues/2020...	https://www.bleepingcomputer.com/news/security...	https://www.fireeye.com/content/dam/fireeye-ww...
freq	2	6	26	2	47	21	263	608	199	5	3	2

The dataset contains 768 rows and 12 columns:

[94]:

df.shape

[94]:

(768, 12)

[95]:

# Number of rows/observations - the data numerosity
df.shape[0]

[95]:

[96]:

# Number of features/column/attributes - the data dimensionality
df.shape[1]

[96]:

These commands provide a first overview of the data, helping identify missing values, data types, and key summary statistics.

Data Cleaning and Preparation

We can list all columns and check their data types using:

[97]:

df.columns

[97]:

Index(['Title', 'Date', 'Affiliations', 'Description', 'Response', 'Victims',
       'Sponsor', 'Type', 'Category', 'Sources_1', 'Sources_2', 'Sources_3'],
      dtype='object')

[98]:

df.dtypes

[98]:

Title           object
Date            object
Affiliations    object
Description     object
Response        object
Victims         object
Sponsor         object
Type            object
Category        object
Sources_1       object
Sources_2       object
Sources_3       object
dtype: object

Columns with the data type object are strings.
Some columns, such as dates, are better represented in a proper datetime format.
To convert strings to datetime, we can use pd.to_datetime. Before doing so, it is important to understand Python date parsing conventions. See the official documentation for reference.
Relevant formatting codes include:
- %m – Month as a zero-padded decimal number
- %d – Day of the month as a zero-padded decimal number
- %Y – Year with century as a decimal number

[99]:

df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df.dtypes

[99]:

Title                   object
Date            datetime64[ns]
Affiliations            object
Description             object
Response                object
Victims                 object
Sponsor                 object
Type                    object
Category                object
Sources_1               object
Sources_2               object
Sources_3               object
dtype: object

[100]:

df.head(2)

[100]:

	Title	Date	Affiliations	Description	Response	Victims	Sponsor	Type	Category	Sources_1	Sources_2	Sources_3
0	Targeting of dissidents and opposition politic...	2022-01-20	Believed to be the work of Ghana's government	The Israeli spyware firm NSO Group sold Pegasu...	Denial https://www.haaretz.com/israel-news/t...	Political dissidents and opposition leaders in...	Ghana	Espionage	Civil society	https://www.haaretz.com/israel-news/tech-news/...	NaN	NaN
1	Targeting of Ukrainian state entities	2022-03-07	UNC1151	The Belarusian threat actor UNC1151 targeted m...	NaN	Ukrainian state entities	Belarus	Espionage	Government	https://cert.gov.ua/article/37626	NaN	NaN

Display all rows in the DataFrame that are identified as duplicates based on all columns:

[101]:

df[df.duplicated()==True]

[101]:

	Title	Date	Affiliations	Description	Response	Victims	Sponsor	Type	Category	Sources_1	Sources_2	Sources_3

Display all rows where duplicate values occur in the “Description” column:

[102]:

df.loc[df.duplicated(subset=['Description'], keep=False)]

[102]:

	Title	Date	Affiliations	Description	Response	Victims	Sponsor	Type	Category	Sources_1	Sources_2	Sources_3
3	Targeting of the Belgian Federal Public Servic...	2022-07-18	Zirconium	Chinese threat actors targeted the Belgian Fed...	Denouncement https://diplomatie.belgium.be/e...	Belgian Federal Public Service Interior	China	Espionage	Government	https://diplomatie.belgium.be/en/news/declarat...	https://www.bleepingcomputer.com/news/security...	NaN
767	Targeting of the Belgian Federal Public Servic...	NaT	APT 30	Chinese threat actors targeted the Belgian Fed...	Denouncement https://diplomatie.belgium.be/e...	Belgian Federal Public Service Interior	China	Espionage	Government	https://diplomatie.belgium.be/en/news/declarat...	https://www.bleepingcomputer.com/news/security...	NaN

Boolean Indexing

Boolean indexing allows us to filter rows based on one or more conditions.

For example, we can select operations conducted by a specific sponsor within a given date range.

[103]:

# Define date range for 2022-2023
start_date = pd.Timestamp('2022-01-01')
end_date = pd.Timestamp('2023-12-31')

# Boolean condition
boolean_condition = (df['Sponsor'] == 'Belarus') & (df['Date'] >= start_date) & (df['Date'] <= end_date)

# Filtered DataFrame
df_filtered = df[boolean_condition]

df_filtered.shape[0] # Number of rows in the filtered DataFrame

[103]:

Aggregation

Aggregation allows us to summarize data by groups, which is useful to identify patterns and trends in cyber operations.

For example, we can count the number of cyber operations conducted by each sponsor

[104]:

# Count the number of cyber operations per sponsor
df.groupby("Sponsor", as_index=False).size().sort_values(by='size', ascending=False)

[104]:

	Sponsor	size
6	China	263
37	Russian Federation	155
20	Iran (Islamic Republic of)	98
26	Korea (Democratic People's Republic of)	79
53	United States	12
...	...	...
25	Kazakhstan	1
28	Lebanon	1
31	Myanmar	1
32	Netherlands	1
30	Morocco	1

61 rows × 2 columns

Explanation:

groupby("Sponsor") groups the data by the Sponsor column.
.size() counts the number of rows in each group, representing the number of operations.
.sort_values(by='size', ascending=False) sorts the sponsors from most to least active.

Sum the number of victims per sponsor to identify which sponsors have caused the greatest impact:

[105]:

# Sum the number of victims per sponsor
df.groupby("Sponsor", as_index=False)["Victims"].sum().sort_values(by="Victims", ascending=False)

[105]:

	Sponsor	Victims
55	United States, Australia, Canada, New Zealand,...	Yandex
27	Korea (Republic of)	World Health OrganizationNorth Koreans and Nor...
59	Uzbekistan	Uzbek activists and dissidents
28	Lebanon	United States, Canada, Germany, Lebanon, France
3	Belarus	Ukrainian state entitiesOver seventy Ukrainian...
...	...	...
21	Israel	An activist in IsraelMayors and other local of...
52	United Kingdom, United States	Algeria, Afghanistan, Belgium, Brazil, Fiji, G...
20	Iran (Islamic Republic of)	Albanian government networksVMware Horizon ser...
39	Saudi Arabia, United Arab Emirates	Al Jazeera employees
0	United Kingdom"	Israel

61 rows × 2 columns

Individual Variables Exploration

Display the number of records per year by extracting the year component from the Date column and counting its occurrences, sorted in chronological order:

[106]:

df['Date'].dt.year.value_counts().sort_index()

[106]:

Date
2005.0      1
2006.0      3
2007.0     10
2008.0      7
2009.0      4
2010.0      9
2011.0     15
2012.0     12
2013.0     18
2014.0     32
2015.0     34
2016.0     33
2017.0     46
2018.0     80
2019.0     74
2020.0    120
2021.0    108
2022.0    147
Name: count, dtype: int64

Visualize the trend of cyber operations over time by plotting the number of recorded cyber events per year:

[107]:

# Extract number of cyber operations per year
cyber_per_year = df['Date'].dt.year.value_counts().sort_index()

# Set plot style
sns.set(style="whitegrid")

# Create the plot
plt.figure(figsize=(10,6))
sns.lineplot(x=cyber_per_year.index.astype(int), y=cyber_per_year.values, marker="o")
plt.title("Number of Cyber Events per Year", fontsize=16)
plt.xlabel("Year", fontsize=12)
plt.ylabel("Cyber Operations", fontsize=12)
plt.xticks(cyber_per_year.index.astype(int))  # Ensure x-axis shows integer ticks
plt.grid(True)
plt.show()

../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_49_0.svg

Display the 15 most frequent affiliations involved in cyber operations, excluding entries labeled as Unknown, to identify the primary actors appearing in the dataset:

[108]:

aff_counts = df[df.Affiliations!='Unknown'].Affiliations.value_counts()
aff_counts.nlargest(15)

[108]:

Affiliations
Lazarus Group                                     26
APT 28                                            15
Believed to be the work of APT 28                 15
Sandworm                                          15
Winnti Umbrella                                   10
Mustang Panda                                     10
Gamaredon                                         10
Ocean Lotus                                        9
MuddyWater                                         9
Kimsuky                                            9
Charming Kitten                                    8
Believed to be the work of?APT 28                  7
Turla                                              7
Believed to be the work of Russia's government     6
APT 10                                             6
Name: count, dtype: int64

Visualize the top 15 affiliations involved in cyber operations using a bar chart:

[109]:

top_10_aff = aff_counts.nlargest(15)
plt.figure()
ax = sns.barplot(x=top_10_aff.values, y=top_10_aff.index)
ax.bar_label(ax.containers[0])
ax.set_title('Top 10 Affiliations')
plt.ylabel('Affiliation')
plt.show()

../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_53_0.svg

Display the top 15 state sponsors associated with cyber operations in the dataset:

[110]:

top_15_sponsors = df[df.Sponsor!='Unknown'].Sponsor.value_counts().nlargest(15)
plt.figure()
ax = sns.barplot(x=top_15_sponsors.values, y=top_15_sponsors.index)
ax.bar_label(ax.containers[0])
ax.set_title('Top 15 Sponsors')
plt.ylabel('Sponsor')
plt.show()

../../_images/tutorial_notebooks_discover_visualize_gain_insights_discover_visualize_gain_insights_55_0.svg

Exercises

Answer the following Questions about the Dataset

Exercise 1: Distribution of Operation Types

Investigate the distribution of Type of state-sponsored cyber operations in the dataset.

[111]:

# TODO: Step 1 - Remove invalid or unknown Type entries
# Hint: Filter out invalid rows

[112]:

# TODO: Step 2 - Create the countplot
# Hint: Use sns.countplot() with the `order` parameter set to value_counts().index

[113]:

# TODO: Step 3 - Annotate counts above each bar
# Hint: Use ax.bar_label(ax.containers[0])

[114]:

# TODO: Step 4 - Set title and rotate x-axis labels
# show the plot

Exercise 2: Top 10 Affiliations

Some affiliations refer to the same group in different ways (e.g., “APT 28”). Normalize all variations containing “APT 28” into a single value, then display the top 10 affiliations using a barplot.

[115]:

# TODO: Step 1 - Normalize "APT 28" variations
# Hint: Some entries refer to the same group differently, e.g.,
# 'APT 28', 'Believed to be the work of APT 28', 'Believed to be the work of?APT 28'

[116]:

# TODO: Step 2 - Filter out 'Unknown' affiliations

[117]:

# TODO: Step 3 - Count affiliations and get top 10

[118]:

# TODO: Step 4 - Create a barplot of the top 10 affiliations

Exercise 3: APT 28 Cyber Attacks Over Time

Analyze the activity of APT 28 by plotting the number of cyber operations attributed to this group over the years 2010–2020.

[119]:

# TODO: Step 1 - Filter the dataset for all rows where 'Affiliations' contains "APT 28"
# Hint: Use str.contains() with case=False and na=False

[120]:

# TODO: Step 2 - Extract the year from the 'Date' column
# Hint: Use .dt.year

[121]:

# TODO: Step 3 - Sort the counts by year

[122]:

# TODO: Step 4 - Create a line plot showing the number of attacks per year

Conclusion

In this tutorial, we learned how to load, clean, and visualize data using Python. Through Exploratory Data Analysis (EDA), we explored trends in cyber operations over time, identified the most active affiliations, sponsors, and targeted sectors, and handled inconsistent or missing data. Using plots and charts, we visualized insights from the dataset, providing a solid foundation for more advanced analyses in machine learning and cybersecurity.

If you found this tutorial helpful, please ⭐ star our repository to show your support.

For any questions, typos, or bugs, kindly open an issue on GitHub — we appreciate your feedback!