<strong>1 Classification Dataset</strong> This dataset for the classification model contains 3,804 tweets, where 1,902 are related to traffic accident reports (TA, positive class) and 1,902 are unrelated (NTA, negative class). For training the tweet classification model, a collaborative labeling strategy was designed. Here, 30 people labeled data according to the instructions given. Each participant had to evaluate a tweet to manually classify it into one of three categories defined as: traffic accident related, unrelated and don´t know/no response. Each tweet was evaluated by 3 participants. The correct label was selected by voting; the 3 people must agree on the selected label, otherwise the tweet was excluded from training. This process took a month and required the development and deployment of a web application. <strong>2 NER Dataset (Named Entity Recognition)</strong> For the entity recognition model training, a sample of the filtered tweets resulting from the previous classification phase was taken. 1,340 tweets were extracted, where 800 are from “unofficial” users, almost 60% of the sample. These tweets were user reports on traffic incident occurred in Bogota from October 2018 to July 2019, including other tweets that contained some location references such as reports on the state of road infrastructure; some tweets from the years 2016 and 2017 were also included. Although these posts were not related to accidents per se, they were selected because they contained location information. The purpose was to train a model that would recognize these entities, because a classifier of accident-related tweets was previously created. Additionally, the dataset was split, reserving 1,072 tweets for training and 268 for evaluation. This dataset was manually labeled using the IOB (Inside-outside- beginning) format. The labeling tool called Brat Annotation Tools was used for this task. The labels defined are Location, which refers to the location of the report; and Time, which refers to the time or date of the incident. Accordingly, 5 labels were generated: B-loc, I-loc, B-time, I-time and O. The O label refers to Others. <strong>3 Traffic accident Twitter geolocation</strong> A dataset with 26362 traffic accident tweets with the coordinates of the incident and the date of publication.
Tópico:
Sentiment Analysis and Opinion Mining
Citaciones:
0
Citaciones por año:
No hay datos de citaciones disponibles
Altmétricas:
0
Información de la Fuente:
FuenteZenodo (CERN European Organization for Nuclear Research)