The Global Database of Events, Language and Tone (GDELT) is a project aimed to catalogue events and their reception in an exhaustive and unbiased fashion. To this end, coverage by media worldwide is classified with respect to involved parties, the nature of the event, location and the tone of the coverage itself. In total 58 features are used to characterized an event.
I’m really excited about the shear mass of information logged by GDELT: The database covers approximately a quarter billion events from 1979 to 2013. While this information content makes GDELT a great potential resource to answer political, economical and social questions, it shares the burden of all big data: How can we efficiently store the data, but make it fast and easily accessable?
Here I describe my MySQL implementation, designed for efficient database access without loosing too much of the intuitive feel for the GDELT data. While the scale of the data may suggest a NoSQL solution, a MySQL database may allow easier interaction with the data for non-programmers. In the first section I describe the design process. In the second section I introduce java code to easily populate the database with data. In the third section I will test the database by trying to retrieve some useful information. Specifically I will ask i) how does the number of events change over time and ii) is the feature QuadClass a good indicator for conflict?
The goal of the database implementation is to provide easy access to the data by allowing fast responses to intuitive database queries.
The data provided by GDELT is formatted as one very large table (relation) with one simple constraint: every field contains only one value. Thus, the input data satisfies first normal form (1NF). Importantly, the data is highly redundant. Redundancy is problematic for at least two reasons: It makes the database susceptible for anomalies and it significantly increases the database footprint in terms of storage and accessibility. While anomalies will play a minor role in the GDELT database (no data modification after insertion is expected), the storage and latency effects of redundant data will be large.
To eliminate these problems I want to transform the database to satisfy third normal form (3NF). While this can be done in an automated fashion I want to use this transformation to maximize the intuitive feeling for the data experienced by the user. To this end, I divided the GDELT fields into 6 subtables: Date, Actor, Event, Media, Geo and Source. All subtables except Source are further fragmented into subsubtables eliminating redundancy while keeping usability. The primary keys of subtable are foreign keys in the parent table enforcing database integrity. While the vast majority of tables satisfy 3NF, I deliberately decided to keep some redundancy to maximize user friendliness. For the database-design I used MySQL Workbench. Figure 1 visualizes the design and the design file can be downloaded here.
Figure 1: GDELT MySQL Database design. Boxes indicate tables. Primary key dependencies are visualized by arrows between tables. Tables are grouped by color coding.
The design removes redundancy and makes searches significantly faster: Take for instance Actors. In the original data each row contains all information for both actors involved in an event. In the new design we store unique sets of actor features in the Actor table and reference these entries in the main table. The resulting Actor table is significantly smaller than the main table and can be searched in milliseconds. Events for identified actors can then be quickly retrieved from the main table taking advantage of foreign keys stored in B-trees.
Having decided on a database design, we can now initiate the database and populate it with data. Initiation is very simple: MySQL Workbench allows forward engineering from the design to MySQL code to create all relations (download here). Pasting this code into the MySQL command line will initiate the relations.
Populating the database is a little bit more involved, mainly due to interconnected relations. To enable simple population, I wrote code to load the data taking advantage of the recursive nature of the data insertion process:
IF no subtable exist
store data in hashmap
Split the incoming data into subtable sets
FOR each subtable entrySet
IF exact entrySet exist
This function is implemented in the java class Table.java. All tables extend this class, implementing the recursion specific for the respective data and table subsets. To abolish tries to insert redundant data, inserted sets are stored in hashes.
Population of the database occurs in two steps: The first step loads tables which are not expected to change a lot. The majority of these static tables are provided by the GDELT project. I had to remove some duplications and added another table for the QuadClass. All tables can be downloaded here. These tables are loaded into the database via the class DatabaseInitiation.java. This function also loads an example set into the database, to test if everything runs smoothly. To initiate the database export DatabaseInitiation.java as an executable jar and run it from the command line:
java -jar databaseInitiation.jar 172.29.13.226:3306 myUser myPassword /PATH/TO/staticTables/
Substitute 172.29.13.226:3306 with your MySQL server IP:Port, myUser and myPassword with your user credentials and /PATH/TO/staticTables/ with the directory you stored the static tables in.
The second step, the actual data propagation is performed by DataImport.java. To load data into the database, export this class as an executable jar and run it from the command line:
java -jar -Xms2g -Xmx4g dataImport.jar 172.29.13.226:3306 myUser myPassword 20131107.export.CSV
Substitute 172.29.13.226:3306 with your MySQL server IP:Port, myUser and myPassword with your user credentials and 20131107.export.CSV with the GDELT data file you want to load.
The main disadvantage of the MySQL implementation lays in the database population: For every line of data the database checks for each subtable if it already contains the respective entries. I limit actual MySQL-queries by hashing unique data sets at each subtable level. Advantages of the normalized MySQL implementation are the small resulting footprint of 13 GB (dump w/o indices) and its fast query results (see below). Unfortunately, I don’t have space to host a database dump here, but I’m happy to upload it to a server. Just drop me a line.
Testing the database
Mean number of events per day
Now that the database is populated, let us test its performance by finding out some general features of the data. Let’s start with reporting the number of events for each day:
CREATE TABLE EventCountPerDate
( SELECT Date.*, SQLDate.SQLDate
FROM Date INNER JOIN SQLDate ON Date.SQLDateId=SQLDate.SQLDateId
( SELECT DateId, count(*) "Count"
FROM GdeltMain GROUP BY DateId
The first subtable d holds the mapping of DateId and Date information, the second subtable the mapping between DateId and EventCount. Joining of the subtables d and c on DateId results in a table holding the date information together with the number of events reported. Considering that the entire database had to be traversed the result was retrieved fairly quickly (2 min 33.73 sec). We can now use the newly created table EventCountPerDate to retrieve summary statistics for every year and save it to a file:
SELECT Year "year", AVG( count ) "mean", STDDEV( count ) "sd", MIN( count ) "min", MAX( count) "max"
FROM EventCountPerDate GROUP BY Year
INTO OUTFILE ‘/tmp/eventsPerYear.txt';
Visualization (script) of the data shows that reported events increase exponentially from ~1,000 events per day in 1980 to almost 100,000 events per day in the 2010s (Figure 2, note the logarithmic y-axis).
Figure 2: Number of events per day (Log10) plotted versus year. Mean (red line) and standard deviation (black dashed line) are shown.
QuadClass, an indicator for conflict?
For a little bit more sophisticated analysis, let us concentrate on the feature QuadClass (see GDELT data format). QuadClasses are used to classify events into four groups: verbal and material cooperation (1 and 2) as well as verbal and material conflict (3 and 4). In theory, events with certain QuadClasses should be enriched during wars (i.e. material conflict) and others during peace periods (i.e. material cooperation). To test this hypothesis, I used the Global Peace Index to identify the most and less peaceful countries within the last 10 years. According to this index, the most peaceful country is Iceland, the less peaceful one Somalia. If GDELT's QuadClass classification is an unbiased estimator of conflict, there should be a significant difference in QuadClass counts between these two countries. Using the MySQL database we can easily retrieve the number of events in each year for each QuadClass.
The MySQL queries below show how to perform this analysis using Iceland as an example. Let us first retrieve its CountryId:
SELECT CountryId FROM Country WHERE Label = “Iceland";
We can now create a new relation which reports DateId and QuadClass for all events in which Iceland (CountryId = 122) acted as Actor1:
CREATE TABLE tempISL1
SELECT DateId,QuadClassId FROM
( SELECT DateId, EventId FROM ( SELECT ActorId FROM Actor WHERE Actor.CountryId = 122 ) actor
( SELECT DateId, EventId, Actor1Id FROM GdeltMain ) main
ON actor.ActorId = main.Actor1Id ) de
( SELECT EventId, QuadClassId FROM Event ) ev
ON de.EventId = ev.EventId;
We can then summarize events by QuadClass and year and write the result to disk:
SELECT YearQuadClass, sum(Count)
FROM ( SELECT YEAR,Count,QuadClassId,CONCAT_WS(',', Year, QuadClassId) "YearQuadClass"
( SELECT DateId, count(*) "Count", QuadClassId FROM tempISL1 GROUP BY DateId ) a
( SELECT * FROM Date ) b
ON a.DateId = b.DateId ) a
GROUP BY YearQuadClass
INTO OUTFILE ‘/tmp/isl1.txt’;
In order not to bias the analysis towards aggressors or defenders the analogous analysis if performed for Actor2 and both datasets combined. The same procedure is followed for Somalia (CountryId 222) and QuadClass per year visualized (script for all remaining plots) for both countries (see Figure 3).
Figure 3A: Iceland, event count per QuadClass (Log10) versus year.
Figure 3B: Somalia, event count per QuadClass (Log10) versus year.
The difference in i) total number of events per year and ii) their distinct change over time makes a direct comparison difficult. Both problems can be taken care of by normalizing QuadClass counts by total number of events for each year and plotting the fraction of total events for each QuadClass (Figure 4).
Figure 4A: Iceland, fractions of events per QuadClass (Log10) per year.
Figure 4B: Somalia, fractions of events per QuadClass (Log10) per year.
Obviously one QuadClass, "Verbal Cooperation", dominates, characterizing almost 100% of all events. Interestingly, the other three QuadClasses have not only similar values, but are also highly correlated. While this behavior could be expected for the related QuadClasses "Verbal and Material Conflict", the correlation with "Material Cooperation" is very surprising. A simple, but unrewarding explanation would be that the three minor QuadClasses represent background noise and observed changes are the result of fluctuations in the major QuadClass "Verbal Cooperation". In agreement, minor QuadClasses of all countries I checked follow a similar pattern: approximately constant values until 2005, followed by a decrease. This behavior is anti-correlated with a sudden increase in total events (see Figure 1). This anti-correlation could suggest that the decrease in minor QuadClass could be explained by an increase in total event coverage, which would give more weight to less spectacular news.
Despite this worrying correlation, can we still use QuadClass as an indicator of conflict? To boost sensitivity, let us collapse both subclasses of conflict and cooperation into one. Figure 5 shows how QuadClasses representing conflict and cooperation vary over time.
Figure 5A: Iceland, fractions of events per simplified QuadClass (Log10) per year.
Figure 5B: Somalia, fractions of events per simplified QuadClass (Log10) per year.
While the fraction of conflict related events is visually similar (due to the Log-scale), they are significantly enriched in Somalia compared to Iceland (0.131 vs. 0.068). Assuming that the World Peace Index and GDELT agree on the identity of the most extreme countries in respect to conflict, this 2-fold difference represents the expected dynamic range. A low signal-to-noise ratio () makes the observed dynamic range too small to use the feature QuadClass as an independent, robust indicator for conflict.
The MySQL implementation described here provides an efficient and fast interface to interact with the GDELT data. Its disadvantages lay in the slow population of the database and the general problem of relational databases that the design anticipates future uses. These disadvantages would be abolished in using a non-relational database like Hadoop, but would complicate the interaction with the data.
A quick analysis of the GDELT data shows that it is indeed a large database of events with potential importance. Robust analysis of the data will be influenced by significant biases and will have to be taken care of with careful normalization. Due to the small dynamic range and small signal-to-noise ratio, the QuadClass feature alone does not seem to be a robust indicator for conflict.