Joshua Gladwell
About Me Code Data Introduction Data Gathering Data Cleaning Exploring Data Naïve Bayes Decision Trees SVM Clustering ARM and Networking Conclusions

Data Cleaning

Cleaning the Databases

Using R, I performed a little bit of cleaning on the Mass Shooter Database and the School Shooting Database. My primary goal in this cleaning was to make the two databases more compatible with each other. I primarily did this by linking the two databases with a common case/incident key.

On reading in the incident data from the School Shooting Database I immediately noticed that there were more records than incidents, even though the unit of analysis was purportedly the shooting incident in this table. There was one incident id that had two rows assigned to it:

Incident_ID Sources Number_News Media_Attention Reliability Date Quarter School City State School_Level Location Location_Type During_School Time_Period First_Shot Summary Narrative Situation Targets Accomplice Hostages Barricade Officer_Involved Bullied Domestic_Violence Gang_Related Preplanned Shots_Fired Active_Shooter_FBI
285 20210902CASAL News source 10 Regional 4 2021-09-02 Fall Santee High School Los Angeles CA High Football Field/Track Outside on School Property Yes Afternoon Classes 14:00:00 Two students shot during fight on football field Two teens were shot during a fight on the football field of the school during afternoon classes. The high school and a neighboring primary school were locked down while police searched for the shooter. A teenage victim was found on the football field and a second victim was found on the street near the school. Shooter fled the scene and was arrested 12 days later. Police are still searching for multiple co-conspirators. Escalation of Dispute Both Yes No No No No No null No null No
286 20210902CASAL News source 20 Regional 4 2021-09-02 Fall Santee High School Los Angeles CA High Outside on School Property Outside on School Property Yes Afternoon Classes 14:00:00 Fight between students escalated into shooting During a gang related fight between students on the parameter of the campus, the loser of the fight pulled out a handgun and shot the other student involved in the leg. School was locked down. Shooter fled and was arrested the following day. Injured student was transported to the hospital. Police said the fight was part of an on-going gang rivalry at the school. Escalation of Dispute Victims Targeted Yes No No No No No Yes No null No

On further analysis, I found that the second row was likely based on a misinformed article. The news source cited in the row is no longer available, so was evidently taken down. The story given on the other row, however, has many news sources that tell the same story [leave link]. Given this information, I decided to drop the second row shown above.

The next major step in cleaning this data was to connect the Mass Shooter Database and the School Shooting Database. I decided to follow the ID protocol established in the School Shooting Database for two reasons: (1) the construction of each id is very intuitive (truncated by year, month, day, state abbreviation, first two letters of school name, and first letter of city name for a given incident). (2) There turned out to be far more incidents of school shootings in the School Shooting Database than the Mass Shooter Database, so the SSDB key was more abundant. I matched the incident IDs by constructing partial IDs from the data given in the MSDB and then completing them by matching them against all the IDs in the SSDB.

The Mass Shooter Database sacrifices quantity of data for quality of data, and for that reason it only ended up having data on 24 school shooters (out of 351 mass shooters, schools or otherwise). Out of those 24 school shooters present in the MSDB, only 14 of them are found in the SSDB. Due to the terms of use for the MSDB I cannot publicly show the table for these incidents, but I will post them on my private GitHub repository here. The reason for the absence of these incidents in SSDB will need to be investigated on a row-by-row basis, and will be updated here when discovered.

Cleaning the Narrative Text

In the SSDB, each incident includes a "narrative," or passage of text describing the specifics of the incident. Using Sci-Kit Learn's CountVectorizer tool, I converted these narratives for each incident into count vectors. I also included the corresponding level of news coverage, which I will treat as a label. The resulting count matrix can be found here.