Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (2025)

Zhiling Chen
School of Mechanical, Aerospace, and Manufacturing Engineering
University of Connecticut
Zhiling.chen@uconn.edu
&Hanning Chen
Department of Computer Science
University of California Irvine
hanningc@uci.edu
&Mohsen Imani
Department of Computer Science
University of California Irvine
m.imani@uci.edu
&Ruimin Chen
School of Mechanical, Aerospace, and Manufacturing Engineering
University of Connecticut
ruimin.chen@uconn.edu
&Farhad Imani
School of Mechanical, Aerospace, and Manufacturing Engineering
University of Connecticut
farhad.imani@uconn.edu

Abstract

Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios.Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously.We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The scene recognition identifies the current scenario to determine the necessary safety gear. The visual prompt formulates the specific visual prompts needed for the detection process. The safety items detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, the fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.

1 Introduction

Workplace safety remains a critical concern across various industries, including construction, manufacturing, and healthcare, leading to significant efforts in safety training programs, the implementation of protective equipment, and the enforcement of strict safety regulations to reduce accidents and injuries (Margaret etal., 2015; Albert and Routh, 2021). Although these measures have improved safety awareness and reduced overall incidents, serious workplace accidents and worker injuries continue to persist (U.S. Bureau of Labor Statistics, 2023a).According to the U.S. Bureau of Labor Statistics, the fatal work injury rate for all workers was 3.7 per 100,000 full-time equivalent workers in 2022, reflecting an increase from 3.6 in 2021 and 3.4 in 2020 (U.S. Bureau of Labor Statistics, 2023b). Effective accident prevention in high-risk workplaces typically involves several critical measures, including hazard screening, providing appropriate personal protective equipment (PPE), maintaining on-the-job vigilance, and promoting safety awareness and education (Foulis, 2021). Among the accident prevention approaches, the provision and proper use of PPE, such as hard hats, safety goggles, gloves, and high-visibility vests stand out as particularly vital (Occupational Safety and Health Administration, 2005). The U.S. Occupational Safety and Health Administration (OSHA) and similar agencies in other countries mandate that all personnel working in close proximity to site hazards wear proper PPE to minimize the risk of injury. In fact, OSHA reports that the proper use of PPE can prevent nearly 40% of occupational injuries and diseases, while 15% of injuries resulting in total disability are caused by the failure to wear proper PPE.

Traditionally, safety inspections have been performed manually, such as supervisors conducting walk-throughs on construction sites to check for hard hats and safety harnesses, or factory managers inspecting assembly lines to ensure workers wearing gloves and protective eyewear. However, such an approach is labor-intensive, time-consuming, and susceptible to human error. Consequently, there has been a progressive shift towards utilizing sensor-based and vision-based systems to enhance the accuracy and efficiency of safety inspections. Sensor-based systems utilize a variety of technologies to monitor PPE usage in real time. For example, radio frequency identification (RFID) in conjunction with Zigbee technologies is commonly employed to detect the presence of PPE on workers and send compliance reports to central units (Barro-Torres etal., 2012). Vision-based systems, leveraging advancements in computer vision and machine learning, analyze images captured by cameras to automatically detect PPE. Techniques such as deep learning algorithms can identify PPE, including hard hats, safety goggles, and high-visibility vests by comparing images to predefined models (Nath etal., 2020; Abouelyazid, 2022).

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (1)

However, safety requirements vary significantly across diverse and complex work environments, as illustrated in Figure 1 (a). For example, construction sites mandate hard hats, high-visibility vests, gloves, long pants, and goggles, while chemical factories require safety goggles, face shields, hairnets, gloves, and aprons. Even when similar items are required, their specific attributes can differ; for instance, construction sites demand heavy-duty gloves for mechanical protection, whereas chemical plants require latex gloves resistant to chemical exposure (Safety and Administration, 2023).As illustrated in Figure 1 (b), current models often struggle to adequately address variability in safety equipment and their nuanced attributes across different contexts. This challenge primarily arises because these models are typically trained on limited datasets that focus on specific types of safety gear. Consequently, models developed for one industry, such as manufacturing, may fail to accurately recognize PPE requirements in other settings, where the types and usage of PPE can differ significantly.This can lead to inaccurate results and potentially dangerous oversights.Despite efforts to address these issues, many models remain constrained by their reliance on narrow datasets, lacking the generalizability needed for diverse scenarios and tasks. Additionally, sensor-based systems may not capture the full range of environmental variables or worker behaviors, leading to data gaps that compromise safety assessments. Similarly, vision-based systems can suffer from biases in training data, often failing to account for all variations in PPE usage or workplace conditions.A recent study (Mohona etal., 2024) demonstrates progress in detecting general PPE worn on construction sites; however, it lacks the capability to differentiate between specific scenarios and assess the fine-grained attributes of safety equipment.This highlights the critical need for advanced models that can adapt to the specific safety requirements of different industries to enhance overall workplace safety. Furthermore, as highlighted by Johnson and Khoshgoftaar (2019) and shown in Figure 1 (c), relying solely on sensors or other automated machines for detection can result in insufficient and imbalanced data.

Current deep learning-based object detection models follow either one-stage or two-stage detectors to tackle the challenges (Lu etal., 2020). The former simultaneously performs classification and regional proposal to obtain results, while two-stage detectors perform classification and regional proposal sequentially. However, both approaches struggle to effectively address the issues of scene and task diversity, as well as data scarcity and imbalance.As transfer learning continues to advance, there is a growing emphasis on improving generalization capabilities and addressing data-related challenges (Torrey and Shavlik, 2010).Techniques such as vision language models (VLMs) are being explored to enhance model performance in diverse scenarios by leveraging pre-trained models on large datasets and adapting them to specific tasks with limited data (Bordes etal., 2024).Moreover, integrating additional context-awareness and reasoning capabilities into vision language models helps ensure that detected objects meet the specific standards required for particular tasks.For instance, in a chemical plant, a system may not only detect that a worker is wearing gloves but also verify that the gloves meet the necessary chemical resistance standards.

However, as shown in Figure 1 (d), complex environments lead to poor alignment between image and text embeddings, posing new challenges for the accurate detection of safety items. To address this issue, we leverage the object detection model to isolate the individuals or items, thereby improving the alignment of visual and textual embeddings.Based on this improved alignment, visual language models form the backbone for matching image and text embeddings in object detection, yet often lack the reasoning capabilities needed to identify correct objects in safety detection.Our framework draws inspiration from human reasoning processes, which are naturally multi-staged. Initially, humans identify potential instances of objects and extract their visual features. Simultaneously, they parse the task at hand, discerning the necessary visual or functional attributes of items that workers need to wear in specific scenarios. Subsequently, they evaluate the instances based on task-relevant attributes to determine whether the individuals or items meet safety requirements. Humans do not merely judge whether an item meets a standard; instead, they engage in a step-by-step reasoning process, using the attributes required by the specific task scenario as clues.

In particular, we propose a zero-shot learning framework for fine-grained safety violation detection across various workplaces. Our framework adopts visual language models and large language models to align the information between workspace image data and text information and consists of two steps. First, we implement an object detection model, such as YOLO, to identify and extract bounding boxes of individuals and their respective safety gear within the image. This step ensures that each person and safety equipment are isolated for further analysis. Second, we employ a vision language model to compare the cropped images with textual descriptions, such as "a person wearing [item]" for initial detection and "a [feature] [item]" for attribute verification. This step ensures that the detected safety equipment not only matches the required items but also meets the specified attributes essential for safety compliance. The framework has been implemented on real-world safety compliance datasets from six distinct environments, each with limited data, to evaluate its capability across diverse workplaces.The contribution of this work is as follows:

  • Instead of training a multi-modal model from scratch, we propose Clip2Safety, a VLM-based, multi-module, and dynamic safety compliance framework that leverages the semantic information from vision-language pretraining and the capabilities of object detection models to support a calibrated vision-text embedding space.

  • We design a scene recognition module to identify scenarios and determine the necessary safety gear, ensuring adequate matching between scene-specific requirements and visual prompts.

  • We introduce a visual prompt module that formulates visual prompts needed for the detection process, enhancing the model’s ability to adapt to varying safety compliance requirements.

  • Our framework not only demonstrates significant improvements in accuracy over state-of-the-art question-answering based VLMs but also achieves inference times that are two orders of magnitude less, showing both high efficiency and effectiveness in dynamic safety compliance scenarios.

The remainder of this paper is organized as follows:Section2 reviews PPE detection, vision language model.Section3 introduces the Clip2Safety framework for model implementation. Section4 discusses the dataset and experiments. Finally, the conclusion of this proposed study is presented in Section5.

2 Research Background

2.1 Personal Protective Equipment Detection

The detection of PPE has been significantly advanced through the application of deep learning and computer vision technologies.Early methods for PPE detection primarily relied on sensor-based systems, such as RFID technology (Barro-Torres etal., 2012).For example, one approach involved equipping each PPE component with RFID tags and using scanners at job site entrances to ensure workers were wearing the appropriate gear (Kelm etal., 2013).Another method investigated the use of short-range transponders embedded in PPE items, with a wireless system verifying compliance with safety regulations (Naticchia etal., 2013).Similarly, (Barro-Torres etal., 2012) investigated a local area network to monitor RFID tags on PPE continuously, ensuring compliance throughout the workday.Additionally, GPS technology has also been utilized to enhance PPE detection. For instance, (Zhang etal., 2015) employed GPS devices attached to workers’ safety helmets to provide supplementary safety monitoring on construction sites. In a more recent advancement, (Pisu etal., 2024) introduced an operator area network system that leverages Machine Learning and RSSI analysis to detect improper PPE usage robustly. This system uses an SVM model with a customizable post-processing algorithm, reducing false positives by 80% and detecting issues within seven seconds. Furthermore, (Yang etal., 2020) designed an automated PPE-tool pairing detection system based on a wireless sensor network, which effectively monitors the wearability status of PPE.However, despite their effectiveness in certain scenarios, these sensor-based approaches were limited by their dependency on physical sensors attached to PPE items and necessitated significant investment in purchasing, installing, and maintaining complex sensor networks, which could hinder their practical implementation (Nath etal., 2020).

With the advancement of computer vision technology, vision-based systems have revolutionized PPE detection due to their lower cost, simpler configuration, and broader application scenarios compared to sensor-based technologies. These systems utilize cameras to capture real-time images and videos of workers, which are then analyzed using computer vision techniques. The extensive deployment of cameras, combined with significant progress in computer vision technology, has established a foundation for effective PPE detection. For instance, (Wu and Zhao, 2018) proposed a color-based hybrid descriptor that combines local binary patterns, human movement invariants, and color histograms to extract features of hardhats in various colors, which are then classified using a hierarchical support vector machine (HSV). (Li etal., 2018) developed a method for detecting workers utilizing the ViBe and the C4 framework, leveraging the HSV color space for hardhat classification. (Mneymneh etal., 2019) introduced a framework that first detects moving objects using a standard deviation matrix and then classifies them with an aggregate channel feature-based object detector. This approach integrates a histogram of oriented gradients-based cascade object detectors to identify hardhats in the upper regions of detected personnel, which are subsequently processed by a color-based classification component. Despite their effectiveness, these multi-stage methods heavily depend on hand-crafted features and encounter difficulties in complex scenes with varying conditions, different viewpoints and occlusions.

Thus, convolutional neural networks (CNNs) (LeCun etal., 2015) have become the backbone of the systems due to their robust image recognition capabilities. Recent studies employing CNNs for object detection have primarily utilized faster region-based CNNs (R-CNNs) (Ren etal., 2015) and you only look once (YOLO) (Redmon etal., 2016). These models facilitate the recognition of target objects, such as persons and helmets. When trained with sufficiently large datasets, they exhibit robust and improved performance, significantly enhancing the reliability of vision-based PPE detection systems.Faster R-CNN incorporates a region proposal network to improve detection speed and accuracy. For example, (Saudi etal., 2020) leverages Faster R-CNN to check workers’ safety conditions based on PPE compliance. Also, (Chen etal., 2020) introduced retinex image enhancement to improve image quality for the outdoor complex scenes in substations based on Faster R-CNN.On the other hand, YOLO is known for its speed and efficiency making it suitable for real-time applications. YOLO divides the image into a grid and predicts bounding boxes and class probabilities directly from full images in one evaluation. Many researchers have utilized variants of YOLO for PPE detection. For instance, (Wu etal., 2019) utilizes the advantage of Densenet in model parameters and technical cost to replace the backbone of the YOLO V3 network for feature extraction, forming a YOLO-Densebackbone convolutional neural network to improve helmet detection and ensure safe construction. (Chen etal., 2023) introduced a lightweight WHU detection model called WHU-YOLO, which enhances YOLOv5s with a Ghost module and Bi-FPN. Additionally, (Benyang etal., 2020) used a multi-scale training strategy to enhance the adaptability of the YOLO V4 from different scales of detection.To enhance detection accuracy, researchers have increasingly integrated object detection models with classification models, resulting in more precise and reliable safety assessments. For instance, (Lee etal., 2023) proposed a method that combines the strengths of YOLO and CNN-based approaches. In this two-stage process, YOLO is first employed to detect whether a helmet is being worn. Subsequently, a CNN-based classification model is used to distinguish between helmets, heads, and hats, allowing for a more nuanced evaluation of PPE compliance.

Despite these advancements, most research has focused on specific, high-supervision environments, such as construction sites, industrial plants, and nuclear facilities, where clear safety protocols and structured activity schedules are in place (Chen and Demachi, 2020; Önal and Dandıl, 2021). In dynamic and less supervised environments, such as manufacturing laboratories or outdoor construction sites, the requirements for PPE usage can vary significantly depending on the time of day, specific tasks being performed, and changing environmental conditions. Consequently, there is a pressing need for interpretable detection methods that can adapt to diverse scenarios and provide insights into the specific safety requirements of each context. This necessitates a system capable of not only detecting PPE but also understanding the contextual factors influencing its usage, thereby ensuring comprehensive safety compliance across different settings (Fang etal., 2018; Hung and Su, 2021).

2.2 Vision Language Model

Recent years have witnessed substantial success in extending pre-trained vision language models to support new applications. Among the most successful efforts, models like Flamingo (Alayrac etal., 2022), OpenAI CLIP (Radford etal., 2021), and OpenCLIP (Cherti etal., 2023) have exhibited impressive performance in handling image-text matching, owing to their semantic knowledge and understanding of content that spans both modalities. These models have been applied successfully in downstream applications such as object detection (Shi etal., 2022), image captioning (Mokady etal., 2021), action recognition (Wang etal., 2021), task-oriented object detection (Chen etal., 2024), anomaly segmentation (Jeong etal., 2023), semantic segmentation (Liang etal., 2023), and dense prediction (Zhou etal., 2023). However, existing CLIP-based algorithms primarily focus on matching image patches with nouns in text, which poses challenges in understanding different people and objects within the images. Therefore, additional modules are needed to facilitate the matching between the visual attributes of image patches and the adjective phrases describing different people and items.

To enhance the alignment between images and text, a recent application of VLMs for PPE detection introduced a three-step zero-shot learning-based monitoring method (Gil and Lee, 2024). First, it detects workers on-site from images and crops body parts using human-body key points. Next, the cropped body images are described with image captioning. Finally, the generated text is compared to prompts describing PPE-wearing body parts, determining safety based on cosine similarity. However, this method still has limitations in accurately distinguishing between different types of PPE and their specific features in diverse environments.

Alongside these advancements, significant progress has been made in question-answering based VLMs, which is usually called visual question answering models (VQA). (Bulian etal., 2022) pioneered the creation of the first free-form and open-ended VQA dataset, where human workers were asked to create questions that a smart robot might not answer and then collect human answers per question. The follow-up work, VQAv2 (Goyal etal., 2017), enhanced the previous dataset to reduce statistical biases in the answer distribution. Instead of relying on human workers, the GQA dataset (Hudson and Manning, 2019) employed scene graph structures from visual genome (Krishna etal., 2017) to generate question-answer pairs. These graph structures allowed the authors to balance the answer distributions for each question, reducing the dependency on answer statistics. Both datasets have been widely used for various VQA tasks (Alayrac etal., 2022; Dai etal., 2023). Due to the question-answer format of VQA models, it aligns more naturally with the process of inspecting safety equipment. (Ding etal., 2022) formulated a "rule-question" transformation and annotation system, turning safety detection into a visual question answering task, and leverages the strengths of VQA models to enhance the accuracy and efficiency of PPE compliance checks.

Despite these advancements, a notable tradeoff exists between achieving high-performance VQA models and maintaining acceptable inference times. High-performing VQA models typically require significant computational resources and time, leading to longer inference times. However, real-time safety detection applications impose strict requirements on inference times to ensure timely and effective responses. Thus, making a new model for such applications necessitates a careful balance between performance and inference speed.

3 Research Methodology

The proposed Clip2Safety, an interpretable and fine-grained detection framework for diverse workplace safety compliance, comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification.

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (2)

3.1 Scene Recognition Module

To address the challenges posed by the diversity of real-world scenes and the corresponding variety of safety gear requirements shown in Fig.1(a), we developed a scene recognition module to identify scenes in images. As illustrated in Fig.2 ❶, this module generates high-quality image captions that provide concise summaries of the visual content. However, this abstraction process can result in the loss of specific visual details, potentially affecting our framework’s performance. To tackle this issue, we conducted a comprehensive survey of existing image captioning models, prioritizing those capable of generating more accurate scene descriptions while also considering our computational resource constraints. Ultimately, we selected Salesforce’s BLIP2-OPT-2.7B model as our image captioning model due to its advanced multi-modal capabilities, which integrate both vision and language understanding, allowing it to generate more nuanced and context-aware descriptions.

To further improve the performance of the scene recognition module while adhering to computational resource constraints, we employed Low-Rank Adaptation (LoRA) (Hu etal., 2021) to provide accurate scene recognition by fine-tuning a minimal number of weights.For this purpose, we used the text prompt "in the [scene]" paired with the respective image as the textual input.LoRA works by introducing small additive weights into existing linear weight matrices. These weights retain the original weight matrix’s dimensions and provide a parallel trainable pathway for the incoming feature map when multiplied. The core idea is that during adaptation the updates to the weight matrices have low intrinsic ranks.

The modification involves adding a rank-constrained product of matrices 𝑨𝑨\bm{A}bold_italic_A and 𝑩𝑩\bm{B}bold_italic_B. For a weight matrix 𝑾d×k𝑾superscript𝑑𝑘\bm{W}\in\mathbb{R}^{d\times k}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT, the update is represented as 𝑾=𝑾+𝑩𝑨superscript𝑾𝑾𝑩𝑨\bm{W}^{\prime}=\bm{W}+\bm{B}\bm{A}bold_italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_W + bold_italic_B bold_italic_A, where 𝑩d×r𝑩superscript𝑑𝑟\bm{B}\in\mathbb{R}^{d\times r}bold_italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and 𝑨r×k𝑨superscript𝑟𝑘\bm{A}\in\mathbb{R}^{r\times k}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT, and rmin(d,k)much-less-than𝑟𝑑𝑘r\ll\min(d,k)italic_r ≪ roman_min ( italic_d , italic_k ). Here, d𝑑ditalic_d and k𝑘kitalic_k are the dimensions of 𝑾𝑾\bm{W}bold_italic_W, and r𝑟ritalic_r is the rank of the adapters. During fine-tuning, 𝑾𝑾\bm{W}bold_italic_W remains unchanged, while the weights of 𝑨𝑨\bm{A}bold_italic_A and 𝑩𝑩\bm{B}bold_italic_B are updated. Following the initialization method in Hu etal. (2021), 𝑩𝑩\bm{B}bold_italic_B is initialized to zeros, and 𝑨𝑨\bm{A}bold_italic_A with small random values from a Gaussian distribution. The updated matrix 𝑾superscript𝑾\bm{W}^{\prime}bold_italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is given by:

𝑾=𝑾+αr𝑩𝑨.superscript𝑾𝑾𝛼𝑟𝑩𝑨\bm{W}^{\prime}=\bm{W}+\frac{\alpha}{r}\bm{B}\bm{A}.bold_italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_W + divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG bold_italic_B bold_italic_A .(1)

In this equation, α𝛼\alphaitalic_α is a scaling parameter that adjusts the influence of the new weights on the original weights.We implement this approach to the two weight matrices 𝑾q,𝑾ksubscript𝑾𝑞subscript𝑾𝑘\bm{W}_{q},\bm{W}_{k}bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, corresponding to the query, key, and output matrices in the multi-head self-attention module of the transformer architecture. The fine-tuning process aims to minimize the negative log-likelihood loss \mathcal{L}caligraphic_L, defined as:

=i=1𝑲logP(Ci|𝑰i;𝜽)superscriptsubscript𝑖1𝑲𝑃conditionalsubscript𝐶𝑖subscript𝑰𝑖𝜽\mathcal{L}=-\sum_{i=1}^{\bm{K}}\log P(C_{i}|\bm{I}_{i};\boldsymbol{\theta})caligraphic_L = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_K end_POSTSUPERSCRIPT roman_log italic_P ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ )(2)

where P(Ci|Ii;𝜽)𝑃conditionalsubscript𝐶𝑖subscript𝐼𝑖𝜽P(C_{i}|I_{i};\boldsymbol{\theta})italic_P ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) is the probability of generating the correct caption Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the data sample 𝑰isubscript𝑰𝑖\bm{I}_{i}bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and model parameters 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, includes the adapted weights 𝑾superscript𝑾\bm{W}^{\prime}bold_italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and 𝑲𝑲\bm{K}bold_italic_K is the total number of data samples.

3.2 Visual Prompt Module

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (3)

In the visual prompt module, we address the diversity of safety requirements by employing a large language model to generate the necessary safety items based on the scene, as depicted in Fig.2 ❷. To bridge the semantic and reasoning gap between the fine-grained requirements and the safety items, we also employ the LLM to generate scene-relevant visual attributes, as illustrated in Fig.2 ❺.In Fig.3, we provide a detailed description of the prompts utilized to guide the LLM in generating the appropriate safety items and their visual attributes. To illustrate this process, we take a seafood factory as an example. The specific procedure unfolds as follows:

  • (1)

    First, we receive the scene information from the image caption model. Based on our compiled and synthesized dataset, the scenes are as follows: hospital, construction site, chemical factory, seafood factory, and manufacturing zone.

  • (2)

    Then we operate the scene information generated by the image caption model to query the LLM: "List the items people should wear in a seafood factory." The response generated by the LLM encompasses several safety items, such as "hairnet", "face mask", "gloves", "aprons", and "boots".

  • (3)

    Building upon the LLM’s response, we proceed to the second prompt: "Summarize the required visual features of the ["hairnet", "face mask", "gloves", "aprons", "boots"] in a seafood factory." This prompt is designed to guide the LLM in summarizing the specific visual attributes required for the requested safety items, as the same item may have different fine-grained requirements in different scenarios, as is shown in Fig.1(b).

  • (4)

    Finally, we obtained the visual attributes that each safety item people in the seafood factory should wear from LLM responses. For instance, the "boots" should have visual attributes such as "high-visibility color", "Non-slip soles" and "waterproof". In total, we generated three attributes for each item, which belong to color, material, and functionality respectively.

3.3 Safety Items Detection Module

In the safety items detection module, we generate 𝑵pbboxsubscript𝑵pbbox\bm{N}_{\text{pbbox}}bold_italic_N start_POSTSUBSCRIPT pbbox end_POSTSUBSCRIPT bounding boxes for all the person in the scene, where pbbox denotes person bounding box, as illustrated in Fig.2 ❸. As mentioned in Fig.1 (c), obtaining a multi-scene dataset suitable for the safety compliance task is extremely challenging. Additionally, most single-scene safety detection datasets suffer from severe class imbalance, as only a few workers in the images fail to wear the required items, resulting in the majority being marked as positive samples. To deal with the data scarcity and imbalance problem, we leverage an open vocabulary object detection model, specifically YOLO-World, which is pre-trained on large-scale datasets and with a strong open vocabulary detection capability. Note that, in the basic setting of our framework during inference, the open vocabulary object detection model is only responsible for generating bounding boxes for all the persons in the image. Based on the bounding box coordination, we extract 𝑵pbboxsubscript𝑵pbbox\bm{N}_{\text{pbbox}}bold_italic_N start_POSTSUBSCRIPT pbbox end_POSTSUBSCRIPT person image patches. After generating 𝑵pbboxsubscript𝑵pbbox\bm{N}_{\text{pbbox}}bold_italic_N start_POSTSUBSCRIPT pbbox end_POSTSUBSCRIPT person image patches and 𝑵itemsubscript𝑵item\bm{N}_{\text{item}}bold_italic_N start_POSTSUBSCRIPT item end_POSTSUBSCRIPT prompts, we pass them into pre-trained VLMs, such as OpenAI CLIP (Radford etal., 2021), to generate text and image embeddings as is shown in Fig.2 ❹. Here we utilize 𝑳pbboxsubscript𝑳pbbox\bm{L}_{\text{pbbox}}bold_italic_L start_POSTSUBSCRIPT pbbox end_POSTSUBSCRIPT and 𝑳itemsubscript𝑳item\bm{L}_{\text{item}}bold_italic_L start_POSTSUBSCRIPT item end_POSTSUBSCRIPT to represent the lists of person image patches and item prompt texts. The lengths of the lists are 𝑵pbboxsubscript𝑵pbbox\bm{N}_{\text{pbbox}}bold_italic_N start_POSTSUBSCRIPT pbbox end_POSTSUBSCRIPT and 𝑵itemsubscript𝑵item\bm{N}_{\text{item}}bold_italic_N start_POSTSUBSCRIPT item end_POSTSUBSCRIPT, respectively. The computation process can be summarized as follows:

𝑬person=CLIPimage(𝑳pbbox)subscript𝑬personsubscriptCLIPimagesubscript𝑳pbbox\bm{E}_{\text{person}}=\text{CLIP}_{\text{image}}(\bm{L}_{\text{pbbox}})bold_italic_E start_POSTSUBSCRIPT person end_POSTSUBSCRIPT = CLIP start_POSTSUBSCRIPT image end_POSTSUBSCRIPT ( bold_italic_L start_POSTSUBSCRIPT pbbox end_POSTSUBSCRIPT )(3)
𝑬item=CLIPtext(𝑳item)subscript𝑬itemsubscriptCLIPtextsubscript𝑳item\bm{E}_{\text{item}}=\text{CLIP}_{\text{text}}(\bm{L}_{\text{item}})bold_italic_E start_POSTSUBSCRIPT item end_POSTSUBSCRIPT = CLIP start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ( bold_italic_L start_POSTSUBSCRIPT item end_POSTSUBSCRIPT )(4)

Suppose the embedding dimension is d, the shape of the generated vision embedding matrix and text embedding matrix will be 𝑵pbbox×\bm{N}_{\text{pbbox}}\timesbold_italic_N start_POSTSUBSCRIPT pbbox end_POSTSUBSCRIPT × d and 𝑵item×\bm{N}_{\text{item}}\timesbold_italic_N start_POSTSUBSCRIPT item end_POSTSUBSCRIPT × d. After we get the person vision embedding 𝑬person𝑬person\bm{E\textsubscript{{person}}}bold_italic_E and item text embedding 𝑬item𝑬item\bm{E\textsubscript{{item}}}bold_italic_E, we perform matrix to matrix multiplication between 𝑬person𝑬person\bm{E\textsubscript{{person}}}bold_italic_E and 𝑬item𝑬item\bm{E\textsubscript{{item}}}bold_italic_E to generate the predicted affinity matrix 𝑨predict𝑨predict\bm{A\textsubscript{{predict}}}bold_italic_A, as is shown in Fig.2 ❺. The computation could be summarized as:

𝑨predict=𝑬person×𝑬itemsubscript𝑨predictsuperscriptsubscript𝑬persontopsubscript𝑬item\bm{A}_{\text{predict}}=\bm{E}_{\text{person}}^{\top}\times\bm{E}_{\text{item}}bold_italic_A start_POSTSUBSCRIPT predict end_POSTSUBSCRIPT = bold_italic_E start_POSTSUBSCRIPT person end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT × bold_italic_E start_POSTSUBSCRIPT item end_POSTSUBSCRIPT(5)

The shape of the affinity matrix 𝑨predictsubscript𝑨predict\bm{A}_{\text{predict}}bold_italic_A start_POSTSUBSCRIPT predict end_POSTSUBSCRIPT is 𝑵pbbox×𝑵itemsubscript𝑵pbboxsubscript𝑵item\bm{N}_{\text{pbbox}}\times\bm{N}_{\text{item}}bold_italic_N start_POSTSUBSCRIPT pbbox end_POSTSUBSCRIPT × bold_italic_N start_POSTSUBSCRIPT item end_POSTSUBSCRIPT. In traditional CLIP, to perform the zero-shot image classification, we will directly apply the softmax function over the affinity matrix to obtain the prediction. In this step, rather than performing image classification, we need to verify if the person is wearing the required items. For inference, we set a threshold value (δ𝛿\deltaitalic_δ), and we proceed as follows:

𝑷i={1if𝑷iδ,0else.subscript𝑷𝑖cases1ifsubscript𝑷𝑖𝛿0else\bm{P}_{i}=\begin{cases}1&\text{if }\bm{P}_{i}\geq\delta,\\0&\text{else}.\end{cases}bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_δ , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL else . end_CELL end_ROW(6)

Here, 𝑷isubscript𝑷𝑖\bm{P}_{i}bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the prediction of whether the person is wearing the item. To optimize our results, we also adopted an LLM-driven approach, where the affinity matrix is provided to the LLM, allowing it to make the decision for us. As shown in Table 3, using GPT-4o as the decision-making LLM yielded better results.

3.4 Fine-grained Verification Module

In the fine-grained verification module, we leverage the result from the safety items detection module. If the person in the image wears the item, we utilize the same open vocabulary object detection model to generate the bounding boxes of the items and extract the image patches, as is shown in Fig.2 ❼. As illustrated in Table 1, three distinct types of attributes are generated for each item within the visual prompt module.

After we generate the prompts for each attribute of each item and the corresponding item’s image patch, we pass them into the same pre-trained VLMs to generate text and image embeddings. Here we utilize 𝑳ibbox𝑳ibbox\bm{L}\textsubscript{ibbox}bold_italic_L and 𝑳attribute𝑳attribute\bm{L}\textsubscript{attribute}bold_italic_L to represent the lists of item image patches and the attribute’s prompt texts, where ibbox means item bounding box.The lengths of the lists are 𝑵ibbox𝑵ibbox\bm{N}\textsubscript{ibbox}bold_italic_N and 𝑵attribute𝑵attribute\bm{N}\textsubscript{attribute}bold_italic_N, respectively.The computation process can be summarized as follows:

𝑬item=CLIPimage(𝑳ibbox)subscript𝑬itemsubscriptCLIPimagesubscript𝑳ibbox\bm{E}_{\text{item}}=\text{CLIP}_{\text{image}}(\bm{L}_{\text{ibbox}})bold_italic_E start_POSTSUBSCRIPT item end_POSTSUBSCRIPT = CLIP start_POSTSUBSCRIPT image end_POSTSUBSCRIPT ( bold_italic_L start_POSTSUBSCRIPT ibbox end_POSTSUBSCRIPT )(7)
𝑬attribute=CLIPtext(𝑳attribute)subscript𝑬attributesubscriptCLIPtextsubscript𝑳attribute\bm{E}_{\text{{attribute}}}=\text{CLIP}_{\text{text}}(\bm{L}_{\text{attribute}})bold_italic_E start_POSTSUBSCRIPT attribute end_POSTSUBSCRIPT = CLIP start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ( bold_italic_L start_POSTSUBSCRIPT attribute end_POSTSUBSCRIPT )(8)

where 𝑬item𝑬item\bm{E\textsubscript{{item}}}bold_italic_E and 𝑬attribute𝑬attribute\bm{E\textsubscript{{attribute}}}bold_italic_E are the item vision embedding matrix and attribute text embedding matrix, respectively. Then we extract the embedding vectors of each item image patch and its corresponding attributes’ prompt text. We denote them as 𝒗itemsubscript𝒗item\bm{v}_{\text{item}}bold_italic_v start_POSTSUBSCRIPT item end_POSTSUBSCRIPT and 𝒗attributesubscript𝒗attribute\bm{v}_{\text{attribute}}bold_italic_v start_POSTSUBSCRIPT attribute end_POSTSUBSCRIPT.During the feature verification process, we compute the cosine similarity between the item image patch and attribute prompt text embedding vectors to measure their alignment. The cosine similarity 𝝎(𝒗item,𝒗attribute)𝝎subscript𝒗itemsubscript𝒗attribute\bm{\omega}(\bm{v}_{\text{item}},\bm{v}_{\text{attribute}})bold_italic_ω ( bold_italic_v start_POSTSUBSCRIPT item end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT attribute end_POSTSUBSCRIPT ) is calculated as:

𝝎(𝒗item,𝒗attribute)=𝒗item𝒗attribute𝒗item𝒗attribute𝝎subscript𝒗itemsubscript𝒗attributesubscript𝒗itemsubscript𝒗attributenormsubscript𝒗itemnormsubscript𝒗attribute\bm{\omega}(\bm{v}_{\text{item}},\bm{v}_{\text{attribute}})=\frac{\bm{v}_{%\text{item}}\cdot\bm{v}_{\text{attribute}}}{\|\bm{v}_{\text{item}}\|\|\bm{v}_{%\text{attribute}}\|}bold_italic_ω ( bold_italic_v start_POSTSUBSCRIPT item end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT attribute end_POSTSUBSCRIPT ) = divide start_ARG bold_italic_v start_POSTSUBSCRIPT item end_POSTSUBSCRIPT ⋅ bold_italic_v start_POSTSUBSCRIPT attribute end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_v start_POSTSUBSCRIPT item end_POSTSUBSCRIPT ∥ ∥ bold_italic_v start_POSTSUBSCRIPT attribute end_POSTSUBSCRIPT ∥ end_ARG(9)

In this step, we need to check whether the item worn by the person meets the attribute requirements. So we set another threshold value (τ𝜏\tauitalic_τ) and we have:

𝑷j={1if𝑷jτ,0else.subscript𝑷𝑗cases1ifsubscript𝑷𝑗𝜏0else\bm{P}_{j}=\begin{cases}1&\text{if }\bm{P}_{j}\geq\tau,\\0&\text{else}.\end{cases}bold_italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if bold_italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ italic_τ , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL else . end_CELL end_ROW(10)

Here 𝑷jsubscript𝑷𝑗\bm{P}_{j}bold_italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the prediction of whether the item worn by the person meets the attribute requirements. We can also leverage the LLM-driven approach to optimize the result. However, this step is different from 3.3. We will generate a similarity list for all the items’ corresponding attributes, and then we give the list to GPT-4o to make desicion for us.

3.5 Evaluation

Ensuring compliance with safety item requirements involves detecting whether a person is wearing the necessary safety items and verifying that these items meet specified attribute standards. We structure this task into three stages: safety items detection, feature verification, and overall evaluation. For each stage, we define different metrics to comprehensively evaluate model performance.

The first stage, safety items detection, focuses on identifying the presence of required safety items, such as hard hats, safety goggles, and high-visibility vests, on individuals. We choose five required safety items for each environment, generated from LLM, as shown in 3.2.This stage leverages advanced object detection algorithms to accurately pinpoint and label each safety item within the visual input. The detection accuracy is measured by determining the proportion of safety items correctly identified by the model relative to the total number of items that should be present.To express this more precisely, the total number of individuals is considered, along with the actual number of safety items each person is wearing. The detection accuracy is then calculated by comparing the number of safety items the model correctly identifies with the expected number of safety items across all individuals.

The second stage, feature verification, involves validating that the detected safety items adhere to specific attribute requirements. For example, it is not only important to detect the presence of gloves but also to ensure that they comply with standards for chemical resistance or thermal protection, depending on the operational context. As detailed in table 1, the attributes are classified into three distinct categories: Directly Observable (DO), Situationally Observable (SO), and Inferentially Observable (IO), with each attribute being assessed individually.To assess the accuracy of the feature verification for each attribute type, we compare the number of correctly detected attributes within each category (DO, SO, and IO) against the total number of detected items. For Directly Observable attributes, the accuracy is calculated by determining the proportion of correctly identified attributes out of the total items detected. The same approach is applied to Situationally Observable and Inferentially Observable attributes, where the accuracy is measured by comparing the correctly identified attributes in each category to the total number of detected items.

4 Experiment

4.1 Dataset

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (4)

To meet the diverse scene requirements of our model, we collected and integrated data from four sources due to the lack of a suitable comprehensive public dataset. From these sources, we selected five distinct scenes, and each dataset is described in detail as follows:

  • Safety Detection Dataset (Roboflow, 2023): The safety detection dataset, available on Roboflow, targets various safety detection applications. It features images of PPEs in industrial environments such as construction sites and factories. The dataset annotates different types of safety equipment, including hard hats and reflective vests, facilitating effective detection in diverse working conditions.

  • PPEs Dataset (Equipment, 2022): The PPEs Dataset, provided by Roboflow, contains images of PPEs from various work environments. This dataset is used to detect and classify PPEs such as helmets, goggles, and gloves. The images are diverse, covering different workplace scenarios, ensuring a comprehensive dataset for PPE detection tasks.

  • CPPE-5: Medical Personal Protective Equipment Dataset(Dagli and Shaikh, 2021): The CCPE-5 dataset focuses on medical personal protective equipment and includes images from hospitals and other healthcare settings. This dataset annotates medical masks, face shields, gloves, and other equipment to enhance safety detection in medical environments. It is particularly useful for improving the detection and classification of medical PPEs in clinical settings.

  • Pictor-v3 (Nath etal., 2020): The Pictor-v3 dataset extends Pictor-v2 by adding classes for hat and vest. Relevant classes (worker, hat, and vest) are used in this study. This dataset focuses on construction site real-time PPE detection.

TypeObservable levelsPossible Values
ColorsDirectly Observable
black, dark blue, dark green, dark purple, yellow, light blue,
light green, light purple, white, blue, brown, green, grey,
purple, red,
MaterialsSituationally Observable

plastic, polycarbonate, leather, rubber, latex, nitrile, fabric

FunctionalitiesInferentially Observable
shock-absorbing, impact-resistant, insulated, highly-visible,
reflective, anti-slip, cut-resistant, puncture-resistant,
dust-proof, fragment-proof, UV-protected, splash-proof,
flame-retardant, chemical-protective, acid-resistant,
alkali-resistant, face-protective, eye-protective, virus-proof,
bacteria-proof, liquid-resistant, hair-covering, waterproof,
contamination-preventive, stain-resistant

As shown in Fig.4, the proposed dataset is an attribute-based detection dataset, which covers 6 scenes, and 9 object categories, encompassing 49 attributes across image datasets. The attributes used to describe objects that are reported in table 1. As shown in Table 1, to address the varying levels of complexity in visual attribute recognition, we classify attributes into three categories based on the level of context and reasoning required to identify them accurately. Directly observable attributes are those that can be immediately identified from the visual features without the need for additional context or interpretation. Situationally observable attributes require some understanding of the scene’s context but still have a significant visual component that aids in their identification. Finally, Inferentially observable attributes are those that cannot be directly seen but must be inferred through context, usage, or additional knowledge about the object. These three categories correspond to color, material, and functionality.

4.2 Implementation Details

For the scene recognition module, we used BILP 2 with configuration of blip2-opt-2.7b. For the object detection model, we experiment with Yolo-world. The implementation of Yolo-world utilizes the yolov8l-worldv2 configuration in Ultralytics. Regarding the large VLM, we employ Open-CLIP. Specifically, the vision transformer encoder configuration is ViT-L/14@224px and the text encoder is OpenAI’s tokenizer. We generate required safety items and visual feature attributes for each item using OpenAI ChatGPT 4o. All experiments are conducted with a single NVIDIA RTX 4090 24GB GPU.

4.3 Compare to state-of-art methods

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (5)

To align with our model’s process of first verifying whether the required items are worn and then checking if the items have the required attributes, we chose the VQA model as our baseline. In Fig.5, we use LLaVA-1.6-7b to show an example of how we do our task on VQA models. Similar to our model, we divide the questions to two steps. In the first step, we utilize yes/no questions form interrogative sentences to check whether the person wearing the items. In the second step, we use wh-question form interrogative sentence to check whether the item has the required attributes. To evaluate the correctness of an answer, it is required to compare a VQA model’s output with an answer. We begin with the simplest evaluation metric, called ExactMatch (EM). It involves preprocessing the model’s output p, a free-format text that resembles natural language with humanlike fluency, and the corresponding answer c by removing non-alphanumeric characters, applying lowercase conversion, and normalizing spaces. However, for EM a prediction is considered correct only if the model’s output exactly matches the correct answer. This strict criterion may not accurately evaluate the model’s performance as it does not account for semantically correct but syntactically different answers. A less restrictive option is to consider a response correct if the prediction contains the true answer after preprocessing, named as Contains (Cont) (Xu etal., 2023).As shown in Fig.5, the ground truth answers are highlighted in yellow. If the output of the VQA models contains the true answer, it is considered correct. For example, in response to the question, "What functionality does the mask worn by the person in the image have?" LLaVA-1.6-7b correctly identifies that the mask can prevent airborne particles, bacteria, and viruses, and the spread of infections. These functionalities are included in the ground truth answer, so the response is marked as correct.

Method
Image
Res. (px)
Prompt
Visual
Encoder
Text
Encoder
Model
Size
BILPvqa480{}Vit-B/16BERT0.361B
BILP-2 OPT224QAVit-g/l EVA-CLIPunssup. OPT3.745B
BILP-2 T5224QAVit-g/l EVA-CLIPinst. FlanT5 XL3.942B
InstructBILP T5224QAVit-g/l EVA-CLIPinst. FlanT5 XL4.023B
InstructBILP V224QAVit-g/l EVA-CLIPVicuna-7B7.913B
LLaVA-1.5-7B224*ViT-L/14 CLIPLLaMA 7B6.743B
LLaVA-1.6-7B224*ViT-L/14 CLIPVicuna-7B7.06B
Clip2Safety224/ViT-L/14 CLIPOpenAI’s tokenizer0.428B

In Table 3, we present a comparison of Clip2Safety’s performance with other state-of-the-art VQA models, highlighting results for Step 1 and three observable levels in Step 2. We selected one model fine-tuned on VQAv2, BLIP (Li etal., 2022); one multi-purpose model evaluated in a zero-shot manner, BLIP-2 (Li etal., 2023); and two conversational models, InstructBLIP (Dai etal., 2023) and LLaVA (Liu etal., 2024). All models consist of an image encoder and a text encoder based on the transformer architecture, along with a fusion model. These models leverage pretrained unimodal encoders and train the fusion model using image-text data.BLIP fine-tunes the encoders together with the fusion module using common losses such as image-text contrastive loss (ITC), image-text matching loss (ITM), and image-conditional language modeling loss (LM). In contrast, BLIP-2, InstructBLIP, and LLaVA keep the unimodal encoders frozen and train only the fusion module. Notably, part of the training data for InstructBLIP includes the VQAv2 dataset. This comparative analysis underscores the architectural differences and training strategies employed by each model, providing context for their performance metrics.In Table 2, we detail the configurations of various models. Each model’s image resolution, prompt type, visual encoder, text encoder, and model size are specified. The BLIPvqa model uses a 480-pixel image resolution with a Vit-B/16 visual encoder and a BERT text encoder, totaling 0.361 billion parameters. BLIP-2 OPT employs a 224-pixel resolution, Vit-g/ EVA-CLIP visual encoder, and an unsupervised OPT text encoder, with a model size of 3.745 billion parameters. Similarly, BLIP-2 T5 uses the same visual encoder but with a FlanT5 XL text encoder, resulting in a 3.942 billion parameter model. The InstructBLIP T5 and InstructBLIP V models also utilize the Vit-g/ EVA-CLIP visual encoder but differ in their text encoders, using FlanT5 XL and Vicuna-7B, with sizes of 4.023 billion and 7.913 billion parameters, respectively. The LLaVA-1.5-7B and LLaVA-1.6-7B models feature the ViT-L/14 CLIP visual encoder and text encoders LLaMA 7B and Vicuna-7B, with sizes of 6.743 billion and 7.06 billion parameters, respectively, both at a 224-pixel resolution. Finally, the Proposed Model combines the ViT-L/14 CLIP visual encoder with OpenAI’s tokenizer, resulting in a model size of 0.428 billion parameters and a 224-pixel resolution.

MethodStep 1Step 2Mean
DOSOIO
BILP vqa76.461.653.23.248.6
BILP-2 OPT72.039.658.019.238.2
BILP-2 T578.448.432.439.249.6
InstructBILP T576.852.839.229.249.5
InstructBILP V81.646.460.05.248.3
LLaVA-1.5-7b74.454.862.882.068.5
LLaVA-1.6-7b91.248.462.078.870.1
Clip2Safety76.876.961.465.870.2
Clip2Safety+GPT4o77.279.766.665.872.3

In Table 3, we show the results of comparing our model with other VQA models.In Step 1, LLaVA-1.5-7b and LLaVA-1.6-7b models performed notably well, achieving accuracy of 74.4% and 91.2%, respectively. Our Clip2Safety matched InstructBLIP T5 with an accuracy of 76.8%, while using GPT4o to make decision on our Clip2Safety scored 77.2%.In Step 2, focusing on directly observable attributes, we can see that in step 1, the powerful LLaVA is not as good as BILPvqa. Our Clip2Safety can achieve an accuracy of 76.9%, and the accuracy of using GPT4o to make decisions can be improved by 1.8%.For situationally observable attributes, our model is 1.4% lower than the best LLaVA-1.5-7b. When we use GPT4o to make decisions, the accuracy can be improved to 66.6%. In the inferentially observable (IO) category, LLaVA-1.5-7b and LLaVA-1.6-7b excelled with 82.0% and 78.0%. However, in terms of average accuracy, LLaVA-1.6-7b is ahead of other VQA models. But it is still 0.1% lower than our Clip2Safety. Clip2Safety+GPT4o can achieve the highest average accuracy of 72.3%. And we also compare the inference time in Table 4. We can see that no matter step 1 or step 2, the inference time of our model is much lower than that of other VQA models.

MethodStep 1Step 2Mean
DOSOIO
BILP vqa1.31.31.31.61.4
BILP-2 OPT7.97.08.312.08.8
BILP-2 T56.16.07.58.27.0
InstructBILP T513.99.912.79.911.6
InstructBILP V48.649.743.246.647.0
LLaVA-1.5-7b51.960.5107.0123.285.7
LLaVA-1.6-7b140.9144.1202.3237.6181.2
Clip2Safety0.42.82.12.92.1

4.4 Ablation Study

4.4.1 Comparison of Performance Across Different Object Detection Models

In this section, we focus on the contribution of the different object detection models to the final accuracy. We compare the performance with the baseline, which directly detects with the original images.To assess the impact of the object detection models, we present the accuracy for each step. The original OpenAI CLIP and OpenCLIP models were primarily designed to match images with nouns, which made it challenging for them to accurately associate each person and item with its corresponding description. As shown in Table 5 (c), (e), and (g), the accuracy improves by approximately 16%.

Additionally, Table 5 (b-d), (e-g), and (h-j) illustrate the effect of various optimizations on Clip2Safety’s performance. Specifically, Table 5 (b-d) utilizes Single Shot MultiBox Detector (SSD) (Liu etal., 2016), Table 5 (e-g) employs OWL-ViT (Minderer etal., 2022), and Table 5 (h-j) uses YOLO-World (Cheng etal., 2024). We conduct three pairs of comparisons, namely (b) and (c), (e) and (f), and (h) and (i), to illustrate the effect of different threshold strategies (different δ𝛿\deltaitalic_δ). In (b), (e), and (h), we utilize an average threshold of 0.6, while in (c), (f), and (i), we apply different thresholds based on g-means (Jain etal., 2009) for each step. The accuracy improvement achieved with different δ𝛿\deltaitalic_δ values is 8.1% and 0.2% when using OWL-ViT and YOLO-World as the object detection model, respectively.

In Table 5 (d), (g), and (j), we present the accuracy improvement when using GPT4o to make decision, as reported in 3.3. Compared to normal settings, the LLM-decision mechanism provides 4.7%, 11.7%, and 2.1% accuracy improvement when using SSD, OWL-ViT, and YOLO-world as the object detection model.

MethodStep 1Step 2Mean
DOCOIO
(a) without OD60.860.465.151.259.4
(b) SSD72.873.053.154.563.4
(c) SSD+diff δ𝛿\deltaitalic_δ72.862.863.153.563.1
(d) SSD+GPT4o76.877.664.353.568.1
(e) OWL-ViT72.852.954.358.259.6
(f) OWL-ViT+diff δ𝛿\deltaitalic_δ72.873.664.759.867.7
(g) OWL-ViT+GPT4o79.279.364.761.871.3
(h) YOLO-World76.876.961.465.870.2
(i) YOLO-World+diff δ𝛿\deltaitalic_δ76.878.863.862.270.4
(j) YOLO-World+GPT4o77.279.766.665.872.3

4.4.2 Comparison of Performance Across Different Pretrained VLMs

In Table 6, we present a comprehensive comparison of various pre-trained Vision Language models, highlighting their performance across different configurations and optimization strategies. The table provides accuracy metrics for Step 1, Step 2 (DO, CO and IO), and average accuracy.

For Table 6 (a-f), we show the performance of using different pretrained CLIP models. Among these models, CLIP-ViT-L-patch14 achieves a mean accuracy of 70.2, which improves to 70.4 with a different threshold δ𝛿\deltaitalic_δ. CLIP-ViT-B-patch16 shows a lower mean accuracy of 44.4%, but applying the threshold strategy raises it to 57.1%. The performance of CLIP-ViT-B-patch32 has a mean accuracy of 44.3%, which improves to 63.5% with the different threshold strategy.

For Table 6 (i-j), we show the performance of using different pre-trained SigLIP models, which also show notable variations in performance. SigLIP-base-patch16-384 achieves a mean accuracy of 55.6, which slightly decreases to 45.5 with the threshold strategy. SigLIP-so400m-patch14-84 and its threshold-optimized counterpart demonstrate a mean accuracy of 55.6% and 57.1%, respectively.

We also observe that models with larger patch sizes, such as CLIP-ViT-B-patch32, tend to benefit more from threshold adjustments due to their higher capacity and potential to capture more detailed information, which requires fine-tuning for optimal performance. Conversely, smaller patch size models, like SigLIP-base-patch16-384, exhibit more stability but less pronounced improvements, suggesting that they are less sensitive to threshold variations.

Moreover, the comparison highlights the strengths of different VLM architectures. The CLIP models generally outperform SigLIP models in terms of mean accuracy, suggesting that the architecture and pretraining strategies of CLIP models may be more effective for our tasks. However, the specific context and nature of the observable information (DO, CO, and IO) also play a crucial role in determining model performance, as seen in the varying accuracy scores across these categories.

MethodStep 1Step 2Mean
DOCOIO
(a) CLIP-ViT-L-patch1476.876.961.465.870.2
(b) CLIP-ViT-L-patch14+diff δ𝛿\deltaitalic_δ76.878.863.862.270.4
(c) CLIP-ViT-B-patch1673.226.029.549.044.4
(d) CLIP-ViT-B-patch16+diff δ𝛿\deltaitalic_δ61.661.358.047.657.1
(e) CLIP-ViT-B-patch3273.226.029.648.544.3
(f) CLIP-ViT-B-patch32+diff δ𝛿\deltaitalic_δ52.874.164.262.963.5
(g) SigLIP-base-patch16-38426.874.070.451.055.6
(h) SigLIP-base-patch16-384+diff δ𝛿\deltaitalic_δ36.461.348.835.345.5
(i) SigLIP-so400m-patch14-8426.874.070.451.055.6
(j) SigLIP-so400m-patch14-84+diff δ𝛿\deltaitalic_δ52.874.257.843.657.1

4.4.3 Comparison of Performance Across Different Threshold Values

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (6)

In this section, we investigate the effect of different threshold values on our model’s performance. Viewing detection as a binary classification problem, Fig.6 presents the ROC curves for various steps of our model evaluation, emphasizing the impact of varying thresholds on the true positive rate (TPR) and false positive rate (FPR) across different observable levels. We explore the range of threshold values, δ𝛿\deltaitalic_δ, from 0.58 to 0.65, as all the similarity scores fall within this interval.The color gradient in each plot represents different threshold values within this range, showing their significant impact on model performance. Higher thresholds generally result in fewer false positives but may also reduce the TPR, as indicated by the tighter clustering of points near the lower end of the FPR axis. Conversely, lower thresholds increase the TPR but also raise the FPR, illustrating the trade-off between sensitivity and specificity.

The upper left plot presents the ROC curve for Step 1, which involves detecting whether a person is wearing the required item. The model achieves an area under the curve (AUC) of 0.76, indicating a strong ability to distinguish between true positives and false positives at this initial stage. The curve shows that the model maintains a high TPR while keeping the FPR relatively low, especially at moderate threshold values, which suggests that the model is effective in detecting the presence of required safety items with fewer errors.

The upper right and lower left plots show the ROC curves for the directly observable and situationally observable steps, with AUCs of 0.64 and 0.76, respectively.The decrease in AUC for the directly observable step compared to Step 1 indicates that as the observables become more challenging to detect, the model’s performance drops. The curve for the directly observable step reveals a wider spread of points along the FPR axis, indicating increased false positives at certain thresholds.Despite the increased difficulty in observables compared to the directly observable step, the situationally observable step maintains a similar AUC to Step 1. This indicates the model’s robustness in handling situational contexts where observables may vary depending on the scenario. The curve demonstrates that the model can still achieve a high TPR with manageable FPR at optimal thresholds, which suggests that the model is adaptable and performs well even when the context of observables is less straightforward.

The lower right plot presents the ROC curve for the inferentially observable step, which involves assessing the functionality of the item. This step has the lowest AUC of 0.55. This significant drop in performance indicates that the model struggles to accurately classify true positives and false positives when the observables require inference rather than direct detection. The curve shows a broad distribution of points, reflecting high FPRs even at higher threshold values, which illustrates that the model’s sensitivity is greatly challenged in inferential scenarios.

4.5 Visualization and Discussion

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (7)

Fig.7 presents our model’s detection results on various image samples. Red bounding boxes label the person and blue bounding boxes label the items. Specifically, Fig.7 (a) showcases an example where our model performs well, with correct results in both step 1 (Safety Items Detection) and step 2 (Feature Verification).

However, Fig.7 (b) highlights an instance where our model’s performance is suboptimal. While the results are accurate in step 1, identifying the required safety items, the model fails in step 2, incorrectly verifying the attributes of shoes. This example illustrates a common challenge faced by our model: accurately verifying specific attributes of safety items in diverse and complex environments.

Additionally, Fig.7 (c) demonstrates a scenario where the model underperforms in both steps, failing to detect the required safety items and verify their attributes. This case underscores the difficulty of ensuring comprehensive safety compliance in highly variable settings, highlighting the need for further refinement and enhancement of the model to improve its robustness and accuracy across different scenarios.

5 Conclusions

In this study, we introduce Clip2Safety, a novel and comprehensive framework designed to enhance safety detection across various environments. Clip2Safety efficiently leverages pre-trained knowledge and vision-language associations from the frozen CLIP model, distinguishing it from previous research efforts in this domain.Our approach addresses the challenge of diverse safety requirements by integrating an image captioning model to accurately interpret scene information, and a large language model to bridge the semantic gap between visual and textual data.Furthermore, we employ an open-vocabulary object detection model to refine the alignment of the image-text embeddings.The efficacy of Clip2Safety is validated through empirical experiments, where it demonstrated state-of-the-art performance on our integrated dataset. Comparative analysis with existing VQA models highlights Clip2Safety’s superior performance and inference efficiency. This framework not only achieves higher accuracy rates but also operates with improved computational efficiency, making it a significant advancement in the field of safety detection.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the DARPA Young Faculty Award, the National Science Foundation (NSF) under Grants #2127780, #2319198, #2321840, #2312517, and #2235472, the Semiconductor Research Corporation (SRC), the Office of Naval Research through the Young Investigator Program Award, and Grants #N00014-21-1-2225 and #N00014-22-1-2067. Additionally, support was provided by the Air Force Office of Scientific Research under Award #FA9550-22-1-0253, along with generous gifts from Xilinx and Cisco.

References

  • Margaret etal. (2015)A.Margaret, J.R. Edwards, K.Allen-Bridson, C.Gross, P.J. Malpiedi, K.D. Peterson, D.A. Pollock, L.M. Weiner, D.M. Sievert,National healthcare safety network (nhsn) report, data summary for 2013, device-associated module,American journal of infection control 43 (2015) 206.
  • Albert and Routh (2021)L.Albert, C.Routh,Designing impactful construction safety training interventions,Safety 7 (2021) 42.
  • U.S. Bureau of Labor Statistics (2023a)U.S. Bureau of Labor Statistics, Incidence rates(1)of nonfatal occupational injuries and illnesses by industry and case types, 2022, 2023a. URL: https://www.bls.gov/web/osh/table-1-industry-rates-national.htm, accessed: 2024-08-05.
  • U.S. Bureau of Labor Statistics (2023b)U.S. Bureau of Labor Statistics, Census of fatal occupational injuries summary, 2023, https://www.bls.gov/news.release/cfoi.t04.htm, 2023b. Accessed July 16, 2024.
  • Foulis (2021)M.Foulis,7 ways to prevent workplace accidents,Canadian Occupational Safety (2021). https://www.thesafetymag.com/ca/topics/safety-and-ppe/7-ways-to-prevent-workplace-accidents/310921, Accessed July 16, 2024.
  • Occupational Safety and Health Administration (2005)Occupational Safety and Health Administration, Worker safety series: Construction, Occupational Safety and Health Administration, https://www.osha.gov/sites/default/files/publications/osha3252.pdf, 2005. Accessed July 16, 2024.
  • Barro-Torres etal. (2012)S.Barro-Torres, T.M. Fernández-Caramés, H.J. Pérez-Iglesias, C.J. Escudero,Real-time personal protective equipment monitoring system,Computer Communications 36 (2012) 42–50.
  • Nath etal. (2020)N.D. Nath, A.H. Behzadan, S.G. Paal,Deep learning for site safety: Real-time detection of personal protective equipment,Automation in construction 112 (2020) 103085.
  • Abouelyazid (2022)M.Abouelyazid,Yolov4-based deep learning approach for personal protective equipment detection,Journal of Sustainable Urban Futures 12 (2022) 1–12.
  • Safety and Administration (2023)O.Safety, H.Administration, Personal protective equipment, 2023. URL: https://www.osha.gov/sites/default/files/publications/OSHA3151.pdf, accessed: 2024-08-05.
  • Mohona etal. (2024)R.T. Mohona, S.Nawer, M.S.I. Sakib, M.N. Uddin,A yolov8 approach for personal protective equipment (ppe) detection to ensure workers’ safety,in: 2024 3rd International Conference on Advancement in Electrical and Electronic Engineering (ICAEEE), IEEE, 2024, pp. 1–6.
  • Johnson and Khoshgoftaar (2019)J.M. Johnson, T.M. Khoshgoftaar,Survey on deep learning with class imbalance,Journal of big data 6 (2019) 1–54.
  • Lu etal. (2020)X.Lu, Q.Li, B.Li, J.Yan,Mimicdet: Bridging the gap between one-stage and two-stage object detection,in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, Springer, 2020, pp. 541–557.
  • Torrey and Shavlik (2010)L.Torrey, J.Shavlik,Transfer learning,in: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, IGI global, 2010, pp. 242–264.
  • Bordes etal. (2024)F.Bordes, R.Y. Pang, A.Ajay, A.C. Li, A.Bardes, S.Petryk, O.Mañas, Z.Lin, A.Mahmoud, B.Jayaraman, etal.,An introduction to vision-language modeling,arXiv preprint arXiv:2405.17247 (2024).
  • Kelm etal. (2013)A.Kelm, L.Laußat, A.Meins-Becker, D.Platz, M.J. Khazaee, A.M. Costin, M.Helmus, J.Teizer,Mobile passive radio frequency identification (rfid) portal for automated and rapid control of personal protective equipment (ppe) on construction sites,Automation in construction 36 (2013) 38–52.
  • Naticchia etal. (2013)B.Naticchia, M.Vaccarini, A.Carbonari,A monitoring system for real-time interference control on large construction sites,Automation in Construction 29 (2013) 148–160.
  • Zhang etal. (2015)S.Zhang, J.Teizer, N.Pradhanang,Global positioning system data to model and visualize workspace density in construction safety planning,in: ISARC. Proceedings of the International Symposium on Automation and Robotics in Construction, volume32, IAARC Publications, 2015, p.1.
  • Pisu etal. (2024)A.Pisu, N.Elia, L.Pompianu, F.Barchi, A.Acquaviva, S.Carta,Enhancing workplace safety: A flexible approach for personal protective equipment monitoring,Expert Systems with Applications 238 (2024) 122285.
  • Yang etal. (2020)X.Yang, Y.Yu, S.Shirowzhan, H.Li, etal.,Automated ppe-tool pair check system for construction safety using smart iot,Journal of Building Engineering 32 (2020) 101721.
  • Wu and Zhao (2018)H.Wu, J.Zhao,An intelligent vision-based approach for helmet identification for work safety,Computers in Industry 100 (2018) 267–277.
  • Li etal. (2018)K.Li, X.Zhao, J.Bian, M.Tan,Automatic safety helmet wearing detection,arXiv preprint arXiv:1802.00264 (2018).
  • Mneymneh etal. (2019)B.E. Mneymneh, M.Abbas, H.Khoury,Vision-based framework for intelligent monitoring of hardhat wearing on construction sites,Journal of Computing in Civil Engineering 33 (2019) 04018066.
  • LeCun etal. (2015)Y.LeCun, Y.Bengio, G.Hinton,Deep learning,nature 521 (2015) 436–444.
  • Ren etal. (2015)S.Ren, K.He, R.Girshick, J.Sun,Faster r-cnn: Towards real-time object detection with region proposal networks,Advances in neural information processing systems 28 (2015).
  • Redmon etal. (2016)J.Redmon, S.Divvala, R.Girshick, A.Farhadi,You only look once: Unified, real-time object detection,in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • Saudi etal. (2020)M.M. Saudi, A.H. Ma’arof, A.Ahmad, A.S.M. Saudi, M.H. Ali, A.Narzullaev, M.I.M. Ghazali,Image detection model for construction worker safety conditions using faster r-cnn,International Journal of Advanced Computer Science and Applications 11 (2020).
  • Chen etal. (2020)S.Chen, W.Tang, T.Ji, H.Zhu, Y.Ouyang, W.Wang,Detection of safety helmet wearing based on improved faster r-cnn,in: 2020 International joint conference on neural networks (IJCNN), IEEE, 2020, pp. 1–7.
  • Wu etal. (2019)F.Wu, G.Jin, M.Gao, H.Zhiwei, Y.Yang,Helmet detection based on improved yolo v3 deep model,in: 2019 IEEE 16th International conference on networking, sensing and control (ICNSC), IEEE, 2019, pp. 363–368.
  • Chen etal. (2023)W.Chen, C.Li, H.Guo,A lightweight face-assisted object detection model for welding helmet use,Expert Systems with Applications 221 (2023) 119764.
  • Benyang etal. (2020)D.Benyang, L.Xiaochun, Y.Miao,Safety helmet detection method based on yolo v4,in: 2020 16th International conference on computational intelligence and security (CIS), IEEE, 2020, pp. 155–158.
  • Lee etal. (2023)J.-Y. Lee, W.-S. Choi, S.-H. Choi,Verification and performance comparison of cnn-based algorithms for two-step helmet-wearing detection,Expert Systems with Applications 225 (2023) 120096.
  • Chen and Demachi (2020)S.Chen, K.Demachi,A vision-based approach for ensuring proper use of personal protective equipment (ppe) in decommissioning of fukushima daiichi nuclear power station,Applied Sciences 10 (2020) 5129.
  • Önal and Dandıl (2021)O.Önal, E.Dandıl,Object detection for safe working environments using yolov4 deep learning model,Avrupa Bilim ve Teknoloji Dergisi (2021) 343–351.
  • Fang etal. (2018)Q.Fang, H.Li, X.Luo, L.Ding, H.Luo, T.M. Rose, W.An,Detecting non-hardhat-use by a deep learning method from far-field surveillance videos,Automation in construction 85 (2018) 1–9.
  • Hung and Su (2021)P.Hung, N.Su,Unsafe construction behavior classification using deep convolutional neural network,Pattern Recognition and Image Analysis 31 (2021) 271–284.
  • Alayrac etal. (2022)J.-B. Alayrac, J.Donahue, P.Luc, A.Miech, I.Barr, Y.Hasson, K.Lenc, A.Mensch, K.Millican, M.Reynolds, etal.,Flamingo: a visual language model for few-shot learning,Advances in neural information processing systems 35 (2022) 23716–23736.
  • Radford etal. (2021)A.Radford, J.W. Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P.Mishkin, J.Clark, etal.,Learning transferable visual models from natural language supervision,in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
  • Cherti etal. (2023)M.Cherti, R.Beaumont, R.Wightman, M.Wortsman, G.Ilharco, C.Gordon, C.Schuhmann, L.Schmidt, J.Jitsev,Reproducible scaling laws for contrastive language-image learning,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829.
  • Shi etal. (2022)H.Shi, M.Hayat, Y.Wu, J.Cai,Proposalclip: Unsupervised open-category object proposal generation via exploiting clip cues,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9611–9620.
  • Mokady etal. (2021)R.Mokady, A.Hertz, A.H. Bermano,Clipcap: Clip prefix for image captioning,arXiv preprint arXiv:2111.09734 (2021).
  • Wang etal. (2021)M.Wang, J.Xing, Y.Liu,Actionclip: A new paradigm for video action recognition,arXiv preprint arXiv:2109.08472 (2021).
  • Chen etal. (2024)H.Chen, W.Huang, Y.Ni, S.Yun, F.Wen, H.Latapie, M.Imani,Taskclip: Extend large vision-language model for task oriented object detection,arXiv preprint arXiv:2403.08108 (2024).
  • Jeong etal. (2023)J.Jeong, Y.Zou, T.Kim, D.Zhang, A.Ravichandran, O.Dabeer,Winclip: Zero-/few-shot anomaly classification and segmentation,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19606–19616.
  • Liang etal. (2023)F.Liang, B.Wu, X.Dai, K.Li, Y.Zhao, H.Zhang, P.Zhang, P.Vajda, D.Marculescu,Open-vocabulary semantic segmentation with mask-adapted clip,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070.
  • Zhou etal. (2023)Z.Zhou, Y.Lei, B.Zhang, L.Liu, Y.Liu,Zegclip: Towards adapting clip for zero-shot semantic segmentation,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11175–11185.
  • Gil and Lee (2024)D.Gil, G.Lee,Zero-shot monitoring of construction workers’ personal protective equipment based on image captioning,Automation in Construction 164 (2024) 105470.
  • Bulian etal. (2022)J.Bulian, C.Buck, W.Gajewski, B.Boerschinger, T.Schuster,Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation,arXiv preprint arXiv:2202.07654 (2022).
  • Goyal etal. (2017)Y.Goyal, T.Khot, D.Summers-Stay, D.Batra, D.Parikh,Making the v in vqa matter: Elevating the role of image understanding in visual question answering,in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913.
  • Hudson and Manning (2019)D.A. Hudson, C.D. Manning,Gqa: A new dataset for real-world visual reasoning and compositional question answering,in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709.
  • Krishna etal. (2017)R.Krishna, Y.Zhu, O.Groth, J.Johnson, K.Hata, J.Kravitz, S.Chen, Y.Kalantidis, L.-J. Li, D.A. Shamma, etal.,Visual genome: Connecting language and vision using crowdsourced dense image annotations,International journal of computer vision 123 (2017) 32–73.
  • Dai etal. (2023)W.Dai, J.Li, D.Li, A.M.H. Tiong, J.Zhao, W.Wang, B.Li, P.Fung, S.Hoi, Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL: https://arxiv.org/abs/2305.06500. arXiv:2305.06500.
  • Ding etal. (2022)Y.Ding, M.Liu, X.Luo,Safety compliance checking of construction behaviors using visual question answering,Automation in Construction 144 (2022) 104580.
  • Hu etal. (2021)E.J. Hu, Y.Shen, P.Wallis, Z.Allen-Zhu, Y.Li, S.Wang, L.Wang, W.Chen,Lora: Low-rank adaptation of large language models,arXiv preprint arXiv:2106.09685 (2021).
  • Roboflow (2023)Roboflow, Safety detection dataset, https://universe.roboflow.com/roboflow-ngkro/safety-detection-ge25h, 2023. URL: https://universe.roboflow.com/roboflow-ngkro/safety-detection-ge25h.
  • Equipment (2022)P.P. Equipment, Ppes dataset, https://universe.roboflow.com/personal-protective-equipment/ppes-kaxsi, 2022. URL: https://universe.roboflow.com/personal-protective-equipment/ppes-kaxsi.
  • Dagli and Shaikh (2021)R.Dagli, A.M. Shaikh, Cppe-5: Medical personal protective equipment dataset, 2021. arXiv:2112.09569.
  • Xu etal. (2023)P.Xu, W.Shao, K.Zhang, P.Gao, S.Liu, M.Lei, F.Meng, S.Huang, Y.Qiao, P.Luo,Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models,arXiv preprint arXiv:2306.09265 (2023).
  • Li etal. (2022)J.Li, D.Li, C.Xiong, S.Hoi,Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,in: International conference on machine learning, PMLR, 2022, pp. 12888–12900.
  • Li etal. (2023)J.Li, D.Li, S.Savarese, S.Hoi,Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,in: International conference on machine learning, PMLR, 2023, pp. 19730–19742.
  • Liu etal. (2024)H.Liu, C.Li, Q.Wu, Y.J. Lee,Visual instruction tuning,Advances in neural information processing systems 36 (2024).
  • Liu etal. (2016)W.Liu, D.Anguelov, D.Erhan, C.Szegedy, S.Reed, C.-Y. Fu, A.C. Berg,Ssd: Single shot multibox detector,in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, 2016, pp. 21–37.
  • Minderer etal. (2022)M.Minderer, A.Gritsenko, A.Stone, M.Neumann, D.Weissenborn, A.Dosovitskiy, A.Mahendran, A.Arnab, M.Dehghani, Z.Shen, etal.,Simple open-vocabulary object detection,in: European Conference on Computer Vision, Springer, 2022, pp. 728–755.
  • Cheng etal. (2024)T.Cheng, L.Song, Y.Ge, W.Liu, X.Wang, Y.Shan,Yolo-world: Real-time open-vocabulary object detection,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16901–16911.
  • Jain etal. (2009)P.Jain, J.M. Garibaldi, J.D. Hirst,Supervised machine learning algorithms for protein structure classification,Computational biology and chemistry 33 (2009) 216–223.
Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Lilliana Bartoletti

Last Updated:

Views: 5651

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Lilliana Bartoletti

Birthday: 1999-11-18

Address: 58866 Tricia Spurs, North Melvinberg, HI 91346-3774

Phone: +50616620367928

Job: Real-Estate Liaison

Hobby: Graffiti, Astronomy, Handball, Magic, Origami, Fashion, Foreign language learning

Introduction: My name is Lilliana Bartoletti, I am a adventurous, pleasant, shiny, beautiful, handsome, zealous, tasty person who loves writing and wants to share my knowledge and understanding with you.