Existing camouflaged object detection (COD) methods rely heavily on large-scale datasets with pixel-wise annotations. However, due to the ambiguous boundary, annotating camouflage objects pixel-wisely is very time-consuming and laborintensive, taking ∼60mins to label one image.
weakly-supervised COD
we propose the first weakly-supervised COD method, using scribble annotations as supervision.
we propose a novel consistency loss composed of two parts: a cross-view loss to attain reliable consistency over different images, and an inside-view loss to maintain consistency inside a single prediction map.
we further propose a feature-guided loss, which includes visual features directly extracted from images and semantically significant features captured by the model.
Feature-guided Loss
we design feature-guided loss based on both simple visual features (context affinity loss) and complex semantic features (semantic significance loss).
Context Affinity Loss
Nearby pixels with similar features tend to have the same class. We adopt the kernel method to measure the visual feature similarity (colors and positions).
where , are the position and colors of pixel . are hyperparameters. calculates the probability of pixel having different classes ( is the probability of positive labels for pixel )
context affinity loss 鼓励视觉上不同的像素具有不同的标签,反之亦然:
就是使用像素相关性来监督类别的相似性. 这一步引入了手工设计的先验信息, 而且限制了模型向更高精度的发展.
where is a neighbor regions ( is set to 5 in our experiments) of center pixel . Through context affinity loss, the model can quickly learn from the unlabeled pixels.
Semantic Significance Loss
The semantic significance loss has a similar formulation to context affinity loss:
where are valid boundary regions (confidently classified pixels), and is set to increase with the epoch number (exponential ramp-up to 0.15 in practice) since the model has not learned well-represented features at the beginning.
In conclusion, the feature loss can be written as the sum of both loss in .
类似于 (AAAI2021-SCWSSOD)Structure-Consistent Weakly Supervised Salient Object Detection with Local Saliency Coherence 的 Local Saliency Coherence Loss.
就是把 换成
we propose the cross-view (CV) consistency loss to alleviate the problem by minimizing the difference between the predictions of the input and its transform.
Cross-View Consistency Loss
are prediction maps of the input and its transform. is the total number of pixels and is a pixel index.
We aim for the predictions of the transform to be pushed more than that of the normal input . The key here is to weight their backward gradient differently, and the proposed crossview consistency loss can be written as: , 即 detach.
If , it is the original loss ; if , the backward gradient that pushes to is greater than the other way around, and thus the goal is reached. In practice, is set to 0.3.
Inside-view Consistency Loss
When the entropy is above a certain threshold, the prediction result is not sure and it is malicious to increase the certainty of the model in this case.