HELK/docker/helk-jupyter/notebooks/04-Intro_pyspark_graphframe...

524 lines
16 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Spark Graphframes and Sysmon\n",
"----------------------------------------------------------------------------\n",
"## Goals:\n",
"* Get familiarized with the basics of Spark Graphframes\n",
"* Confirm Jupyter can import Graphframes library\n",
"* Confirm Spark can read data from Elasticsearch\n",
"* Learn to create a graphframe from sysmon Index\n",
"* Learn the basics of GraphFrames Motifs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is Graphframes?\n",
"GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. \n",
"* It provides high-level APIs in Scala, Java, and Python. \n",
"* It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames.\n",
"* This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What is a graph?\n",
"GraphFrames represent graphs: \n",
"* Vertices (e.g., users)\n",
"* Edges (e.g., relationships between users)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import SparkSession Class"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a SparkSession instance\n",
"* Define a **spark** variable\n",
"* Pass values to the **appName** and **master** functions\n",
" * For the master function, we are going to use the HELK's Spark Master container (helk-spark-master)\n",
"* This time add the **config()** function to set Elasticsearch information needed to read from it"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[**config(key=None, value=None, conf=None)**](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession.Builder.config)\n",
"* Sets a config option.\n",
"* Options set using this method are automatically propagated to both SparkConf and SparkSessions own configuration."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"spark = SparkSession.builder \\\n",
" .appName(\"HELK Graphs\") \\\n",
" .master(\"spark://helk-spark-master:7077\") \\\n",
" .config(\"es.read.field.as.array.include\", \"tags\") \\\n",
" .config(\"es.nodes\",\"helk-elasticsearch:9200\") \\\n",
" .config(\"es.net.http.auth.user\",\"elastic\") \\\n",
" .config(\"es.net.http.auth.pass\",\"elasticpassword\") \\\n",
" .enableHiveSupport() \\\n",
" .getOrCreate()\n",
" #PLEASE REMEMBER!!!!\n",
" #If you are using elastic TRIAL license, then you need the es.net.http.auth.pass value\n",
" #If you are using elastic BASIC license, then you can remove the es.net.http.auth.pass value"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check the SparkSession variable"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <div>\n",
" <p><b>SparkSession - hive</b></p>\n",
" \n",
" <div>\n",
" <p><b>SparkContext</b></p>\n",
"\n",
" <p><a href=\"http://403892d82956:4041\">Spark UI</a></p>\n",
"\n",
" <dl>\n",
" <dt>Version</dt>\n",
" <dd><code>v2.4.0</code></dd>\n",
" <dt>Master</dt>\n",
" <dd><code>spark://helk-spark-master:7077</code></dd>\n",
" <dt>AppName</dt>\n",
" <dd><code>HELK Graphs</code></dd>\n",
" </dl>\n",
" </div>\n",
" \n",
" </div>\n",
" "
],
"text/plain": [
"<pyspark.sql.session.SparkSession at 0x7f23dc07e898>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Graphframes & SQL Functions"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from graphframes import *\n",
"from pyspark.sql.functions import *"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---+--------+\n",
"| id|inDegree|\n",
"+---+--------+\n",
"| c| 1|\n",
"| b| 2|\n",
"+---+--------+\n",
"\n",
"CPU times: user 492 ms, sys: 1.2 s, total: 1.69 s\n",
"Wall time: 53.2 s\n"
]
}
],
"source": [
"%%time\n",
"# Create a Vertex DataFrame with unique ID column \"id\"# Creat \n",
"v = spark.createDataFrame([\n",
" (\"a\", \"Alice\", 34),\n",
" (\"b\", \"Bob\", 36),\n",
" (\"c\", \"Charlie\", 30),\n",
"], [\"id\", \"name\", \"age\"])\n",
"# Create an Edge DataFrame with \"src\" and \"dst\" columns\n",
"e = spark.createDataFrame([\n",
" (\"a\", \"b\", \"friend\"),\n",
" (\"b\", \"c\", \"follow\"),\n",
" (\"c\", \"b\", \"follow\"),\n",
"], [\"src\", \"dst\", \"relationship\"])\n",
"# Create a GraphFrame\n",
"from graphframes import *\n",
"g = GraphFrame(v, e)\n",
"\n",
"# Query: Get in-degree of each vertex.\n",
"g.inDegrees.show()\n",
"\n",
"# Query: Count the number of \"follow\" connections in the graph.\n",
"g.edges.filter(\"relationship = 'follow'\").count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read data from the HELK Elasticsearch via Spark SQL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the Dataframe API to access Elasticsearch index (Elasticsearch-Sysmon Index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* As we know, Spark SQL is a Spark module for structured data processing, and provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.\n",
"* Elasticsearch becomes a native source for Spark SQL so that data can be indexed and queried from Spark SQL transparently\n",
"* Spark SQL works with structured data - in other words, all entries are expected to have the same structure (same number of fields, of the same type and name)\n",
"* Using unstructured data (documents with different structures) is not supported and will cause problems.\n",
"* Through the **org.elasticsearch.spark.sql** package, esDF methods are available on the SQLContext API\n",
"\n",
"Reference: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"es_reader = (spark\n",
" .read\n",
" .format(\"org.elasticsearch.spark.sql\")\n",
" .option(\"inferSchema\", \"true\")\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[**load(path=None, format=None, schema=None, **options)**](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.load)\n",
"* Loads data from a data source and returns it as a :class`DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 0 ns, sys: 4 ms, total: 4 ms\n",
"Wall time: 2.2 s\n"
]
}
],
"source": [
"%%time\n",
"sysmon_df = es_reader.load(\"logs-endpoint-winevent-sysmon-*/doc\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ProcessCreate & Motifs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Vertices Dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to replace the column name from **process_guid** to **id** because thats the column name that Graphframes uses for the row IDs.\n",
"\n",
"[**withColumn(colName, col)**](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join)\n",
"* Returns a new DataFrame by adding a column or replacing the existing column that has the same name.\n",
"* The column expression must be an expression over this DataFrame; attempting to add a column from some other dataframe will raise an error."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"vertices = (sysmon_df.withColumn(\"id\", sysmon_df.process_guid)\n",
" .select(\"id\",\"user_name\",\"host_name\",\"process_parent_name\",\"process_name\",\"action\")\n",
" )\n",
"vertices = vertices.filter(vertices.action == \"processcreate\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------------------------+---------------+-------------------------------+-------------------+-------------+-------------+\n",
"|id |user_name |host_name |process_parent_name|process_name |action |\n",
"+------------------------------------+---------------+-------------------------------+-------------------+-------------+-------------+\n",
"|1C9FDC81-9806-5C6F-0000-00100CDDD000|system |DESKTOP-LFD11QP.RIVENDELL.local|cmd.exe |conhost.exe |processcreate|\n",
"|1C9FDC81-9806-5C6F-0000-001051DAD000|system |DESKTOP-LFD11QP.RIVENDELL.local|svchost.exe |taskhostw.exe|processcreate|\n",
"|1C9FDC81-9807-5C6F-0000-00100EEED000|system |DESKTOP-LFD11QP.RIVENDELL.local|svchost.exe |wsqmcons.exe |processcreate|\n",
"|1C9FDC81-9809-5C6F-0000-00100E28D100|network service|DESKTOP-LFD11QP.RIVENDELL.local|gpupdate.exe |conhost.exe |processcreate|\n",
"|1C9FDC81-980A-5C6F-0000-0010903BD100|cbrown |DESKTOP-LFD11QP.RIVENDELL.local|services.exe |svchost.exe |processcreate|\n",
"+------------------------------------+---------------+-------------------------------+-------------------+-------------+-------------+\n",
"only showing top 5 rows\n",
"\n",
"CPU times: user 0 ns, sys: 4 ms, total: 4 ms\n",
"Wall time: 2.55 s\n"
]
}
],
"source": [
"%%time\n",
"vertices.show(5,truncate=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Edges Dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to make sure we also rename our **process_parent_guid** to **src** and **process_guid** to **dst**. This is to look for that relationship across our whole environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[**selectExpr(*expr)**](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.selectExpr)\n",
"* Projects a set of SQL expressions and returns a new DataFrame.\n",
"* This is a variant of select() that accepts SQL expressions.\n",
"* You can also combine selecting columns and renaming columns in a single step with selectExpr"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"edges = (sysmon_df\n",
" .filter(sysmon_df.action == \"processcreate\")\n",
" .selectExpr(\"process_parent_guid as src\",\"process_guid as dst\")\n",
" .withColumn(\"relationship\", lit(\"spawned\"))\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------------------------------+------------------------------------+------------+\n",
"|src |dst |relationship|\n",
"+------------------------------------+------------------------------------+------------+\n",
"|1C9FDC81-9806-5C6F-0000-001054D8D000|1C9FDC81-9806-5C6F-0000-00100CDDD000|spawned |\n",
"|1C9FDC81-84C9-5C6D-0000-001065210100|1C9FDC81-9806-5C6F-0000-001051DAD000|spawned |\n",
"|1C9FDC81-84C9-5C6D-0000-001065210100|1C9FDC81-9807-5C6F-0000-00100EEED000|spawned |\n",
"|1C9FDC81-9806-5C6F-0000-00102CEAD000|1C9FDC81-9809-5C6F-0000-00100E28D100|spawned |\n",
"|1C9FDC81-84C7-5C6D-0000-001025A90000|1C9FDC81-980A-5C6F-0000-0010903BD100|spawned |\n",
"+------------------------------------+------------------------------------+------------+\n",
"only showing top 5 rows\n",
"\n",
"CPU times: user 0 ns, sys: 4 ms, total: 4 ms\n",
"Wall time: 458 ms\n"
]
}
],
"source": [
"%%time\n",
"edges.show(5,truncate=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Graph (Vertices & Edges DataFrames)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"g = GraphFrame(vertices, edges)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Process A spawning Process B AND Process B Spawning Process C"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 0 ns, sys: 4 ms, total: 4 ms\n",
"Wall time: 208 ms\n"
]
}
],
"source": [
"%%time\n",
"motifs = g.find(\"(a)-[]->(b);(b)-[]->(c)\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+------------+------------+------------+\n",
"|process_parent_name|process_name|process_name|process_name|\n",
"+-------------------+------------+------------+------------+\n",
"+-------------------+------------+------------+------------+\n",
"\n",
"CPU times: user 4 ms, sys: 4 ms, total: 8 ms\n",
"Wall time: 18.1 s\n"
]
}
],
"source": [
"%%time\n",
"(motifs\n",
" .select(\"a.process_parent_name\",\"a.process_name\",\"b.process_name\",\"c.process_name\")\n",
" .show(20,truncate=False)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+-----+\n",
"|process_parent_name|count|\n",
"+-------------------+-----+\n",
"+-------------------+-----+\n",
"\n",
"CPU times: user 12 ms, sys: 0 ns, total: 12 ms\n",
"Wall time: 12.9 s\n"
]
}
],
"source": [
"%%time\n",
"motifs.groupby('a.process_parent_name').count().sort('count').show(10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "PySpark_Python3",
"language": "python",
"name": "pyspark3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}