It is well known that the usefulness of a machinelearning model is due to its ability to generalize to unseendata. This study uses three popular cyberbullying datasets toexplore the effects of data, how it’s collected, and how it’slabeled, on the resulting machine learning models. The biasintroduced from differing definitions of cyberbullying and fromdata collection is discussed in detail. An emphasis is made onthe impact of dataset expansion methods, which utilize currentdata points to fetch and label new ones. Furthermore, explicittesting is performed to evaluate the ability of a model togeneralize to unseen datasets through cross-dataset evaluation.As hypothesized, the models have a significant drop in theMacro F1 Score, with an average drop of 0.222. As such, thisstudy effectively highlights the importance of dataset curationand cross-dataset testing for creating models with real-worldapplicability. The experiments and other code can be found athttps://github.com/rootdrew27/cyberbullying-ml.
Curling is a strategic ice sport that presents unique challenges for AI research due to its combination of complex decision-making and intricate physical dynamics. This project aims to develop a physics-based curling simulator to address these challenges, enabling accurate modeling of stone movement, ice conditions, and sweeping effects. Our approach involves utilizing an existing physics engine, MoJuCo, to simulate realistic curling interactions. We implemented physics models based on leading theories for basic curling shot selections. The simulator initially focuses on stone dynamics and shot selection, with more complex features such as sweeping effects being added in later iterations. A visualization web app displays shot outcomes and will eventually support AI training and data analysis.In addition to the simulation application for curling research, we developed a training module for both the physics of curling and interacting with the MoJuCo library. This module is designed to help new student learn about the complicated physics of curling. This module also helps students learn how to implement and maintain MuJuCo based features into the simulator.
Curling is a strategic team sport that presents unique challenges for artificial intelligence (AI) research, particularly in decision-making and physical simulation. However, a significant barrier to AI development in curling is the lack of structured and accessible datasets. This project aims to address this gap by leveraging standardized video feeds from Curling Stadium to generate datasets suitable for AI research. Our approach involves developing software that uses image detection models YOLO (You Only Look Once) and SAM (Segment Anything Model) to analyze YouTube videos of curling matches, tracking objects such as rocks and players to gather data on their positions and movements. The expected outcome of the larger project is a structured and scalable dataset that can be used for AI-based curling research, including game strategy analysis and predictive modeling. This project lays the foundation for broader AI applications in curling by automating data collection, enabling machine learning models to analyze strategic decision-making, and fostering human-AI collaboration in sports analytics.
The Internet of Things (IoT) encompasses a variety of systems and devices that enable data exchange across networks. With this interleaved connectivity comes an inherent vulnerability to attacks. Traditional intrusion detection in IoT environments has been primarily human-reliant, but modern malicious methods surpass manual approaches. Machine Learning (ML)-based Intrusion Detection Systems (IDS) show promise but require refinement to match human-monitored IDS effectiveness.This study involved a literature review of research involving the NetFlow dataset NF-ToN-IoT-v2, created in 2022 to enable ML-based IDS development. With balancing, the dataset includes approximately 16 million net-flows, with 63.99% attack and 36.01% benign. The data’s imbalanced nature was addressed through methods like down sampling to reduce training bias. A hyper-parameter tuning pipeline was used to optimize algorithm testing and cross-validation, especially for different data balancing methods.The algorithms tested based on previous research found during literature review include Naïve Bayes, Random Forest, K-Nearest Neighbor (KNN), Support Vector Machines (SVM), and XGBoost. Comparative analysis using confusion matrices and bar plots enabled the evaluation of algorithm effectiveness. Overall, this research highlights the potential of ML approaches in IoT IDS development, through leveraging NF-ToN-IoT-v2 to enhance detection accuracy and bridge the gap between human-monitored and ML-driven solutions.
Pancreatic ductal adenocarcinoma (PDAC) is the most common form of pancreatic cancer, accounting for over 90% of cases, and is characterized by aggressive growth, early metastasis, and resistance to therapy. A comprehensive understanding of the molecular mechanisms driving PDAC is essential for improving diagnosis, prognosis, and treatment. In this study, a multiomics approach was applied by analyzing both DNA methylation and RNA-sequencing datasets obtained from The Cancer Genome Atlas Pancreatic Adenocarcinoma project.The methylation dataset included significantly more tumor samples than normal samples, and a similar imbalance was observed in the RNA-seq dataset. This disparity posed a challenge for direct feature selection, as it could lead to a model biased toward tumor-associated features. To address this issue, six data imbalance correction techniques were evaluated and compared: Random Oversampling, Synthetic Minority Over-sampling Technique (SMOTE), and Adaptive Synthetic (ADASYN) for oversampling, along with Random Undersampling, Cluster Centroids, and AllKNN for undersampling. Identifying the most effective imbalance correction method is essential for improving feature selection accuracy and facilitating the discovery of novel genes associated with pancreatic ductal adenocarcinoma (PDAC). A deeper understanding of these oncogenes could contribute to the development of non-invasive diagnostic tests and personalized treatment strategies for PDAC.
As the use of car dashboard cameras (dashcams) has increased, the availability of dashcam imagery has also increased. In recent years, dashcam imagery has been predominantly used in conjunction with computer vision techniques for autonomous vehicle systems. However, this research explores an alternative application of these technologies in the domain of public safety and security. Specifically, we apply object detection to dashcam imagery to address the challenge of identifying vehicles associated with active Amber Alerts. With the goal of aiding law enforcement in locating abducted children more efficiently, we employ the YOLO (You Only Look Once) object detection model, a state-of-the-art deep learning framework known for its real-time performance and accuracy. Our methodology involves training and fine-tuning the YOLO model on a custom dataset of dashcam footage, incorporating diverse environmental conditions such as varying lighting, weather, and traffic scenarios. Experimental results demonstrate that the model achieves high precision and recall rates in detecting target vehicles, validating its effectiveness for real-world deployment. This research highlights the potential of leveraging deep learning and computer vision techniques to address critical public safety challenges, offering a novel application of these technologies beyond their traditional use in autonomous driving. Our findings contribute to the growing body of work in computer science that seeks to harness AI for societal benefit.