Election Outlier Detection Report. (1)

Introduction

This project focuses on identifying potential voting irregularities and ensuring the transparency of the election results in Plateau State. As a resident and voter in Plateau State, I chose this region to leverage my familiarity with the area and provide a more informed analysis. This report documents the methodology, findings, and key insights from the outlier detection analysis conducted on the election data.

Objectives

Dataset Preparation: Ensuring that the dataset includes longitude and latitude values for each polling unit.
Neighbour Identification: Identifying neighboring polling units based on geographical proximity, defining a radius to determine which units are considered neighbors.
Outlier Score Calculation: Comparing the votes each party received with those of its neighboring units and calculating an outlier score for each party based on the deviation of votes.
Sorting and Reporting: Sorting the dataset by outlier scores for each party to identify the most significant outliers and providing a detailed report explaining the methodology and findings.

Dataset Preparation

Data Collection:
- The dataset for Plateau State was obtained from the provided Google Drive folder, specifically from the file named "Plateau_crosschecked.csv".
- The initial dataset included the following columns: [State, LGA, Ward, PU-Code, PU-Name, Accredited_Voters, Registered_Voters, Results_Found, Transcription_Count, Result_Sheet_Stamped, Result_Sheet_Corrected, Result_Sheet_Invalid, Result_Sheet_Unclear, Result_Sheet_Unsigned, APC, LP, PDP, NNPP, Results_File].
Adding Geospatial Data:
- To conduct geospatial analysis, longitude and latitude values were needed for all 4328 entries (including headers).
- The "Geolocation by Awesome Table" extension in Google Sheets was used to generate longitudinal and latitudinal values of each Polling unit. This extension could only process around 970 entries at a time, so the dataset was split into five parts to complete the geolocation process.
- After using the extension, 93 entries were still missing geolocation values. These missing values were manually retrieved from the INEC Polling Unit Locator to ensure accuracy.
Data Verification and Cleaning:
- The dataset was sorted by latitude values to identify any anomalies. 17 entries were found to have incorrect geolocation values.
- These values were cross-checked and corrected using the INEC Polling Unit Locator.
- Two polling units were removed:
  - LGEA PRIMARY SCHOOL, DUNGKUK: Could not be found and had zero vote counts for all parties, rendering it inconsequential.
  - MAMPYEM PRI SCH: Appeared twice, with the second entry having a PU-Code that did not match the INEC database.
Final Dataset Preparation:
- The cleaned dataset used for analysis included the following columns: [PU-Code, PU-Name, Full Address, Latitude, Longitude, APC, LP, PDP, NNPP] and 4325 rows (excluding headers).
- Each polling unit now had accurate geolocation data, verified and corrected where necessary, ensuring a reliable basis for subsequent analysis.
Neighbor Identification

The goal of identifying neighboring polling units is to determine which units are geographically close to each other. This allows us to compare voting patterns and detect any significant deviations that might indicate irregularities or influences.

Geospatial Techniques Used

Geospatial analysis involves calculating the distances between polling units to identify neighbors. For this, I used the Haversine formula, which is suitable for calculating distances between points on the Earth's surface.

Defining a Radius

A radius of 1 km was chosen to define neighboring polling units. This radius is large enough to capture relevant neighboring units without being too broad, which could include unrelated units. The 1 km radius is a reasonable distance in both urban and rural areas for comparing voting patterns.

Calculations
- Conversion to radians: Latitude and longitude values were converted to radians. This conversion is necessary because the Haversine formula, used for distance calculations, operates on radian values.
- BallTree Algorithm: The BallTree algorithm from the scikit-learn library was employed to efficiently find all neighbouring polling units within the 1 km radius. BallTree is well-suited for this purpose as it quickly searches for neighbours in large datasets.

Code Snippet to Identify Neighboring PUs

Outlier Score Calculation

Comparing the votes of a polling unit with those of its neighbours allows for the detection of anomalies. A significant deviation suggests that the polling unit's results are inconsistent with the local voting pattern, indicating potential irregularities.

Step 1: Initialize outlier score columns for each party.
Step 2: For each polling unit, identify its neighbors.

Introduction

Objectives

Dataset Preparation

Neighbor Identification

Geospatial Techniques Used

Defining a Radius

Calculations

Code Snippet to Identify Neighboring PUs

Outlier Score Calculation