rowid,titles,summaries,terms
1,Survey on Semantic Stereo Matching / Semantic Depth Estimation,"Stereo matching is one of the widely used techniques for inferring depth from
stereo images owing to its robustness and speed. It has become one of the major
topics of research since it finds its applications in autonomous driving,
robotic navigation, 3D reconstruction, and many other fields. Finding pixel
correspondences in non-textured, occluded and reflective areas is the major
challenge in stereo matching. Recent developments have shown that semantic cues
from image segmentation can be used to improve the results of stereo matching.
Many deep neural network architectures have been proposed to leverage the
advantages of semantic segmentation in stereo matching. This paper aims to give
a comparison among the state of art networks both in terms of accuracy and in
terms of speed which are of higher importance in real-time applications.","['cs.CV', 'cs.LG']"
2,FUTURE-AI: Guiding Principles and Consensus Recommendations for Trustworthy Artificial Intelligence in Future Medical Imaging,"The recent advancements in artificial intelligence (AI) combined with the
extensive amount of data generated by today's clinical systems, has led to the
development of imaging AI solutions across the whole value chain of medical
imaging, including image reconstruction, medical image segmentation,
image-based diagnosis and treatment planning. Notwithstanding the successes and
future potential of AI in medical imaging, many stakeholders are concerned of
the potential risks and ethical implications of imaging AI solutions, which are
perceived as complex, opaque, and difficult to comprehend, utilise, and trust
in critical clinical applications. Despite these concerns and risks, there are
currently no concrete guidelines and best practices for guiding future AI
developments in medical imaging towards increased trust, safety and adoption.
To bridge this gap, this paper introduces a careful selection of guiding
principles drawn from the accumulated experiences, consensus, and best
practices from five large European projects on AI in Health Imaging. These
guiding principles are named FUTURE-AI and its building blocks consist of (i)
Fairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness
and (vi) Explainability. In a step-by-step approach, these guidelines are
further translated into a framework of concrete recommendations for specifying,
developing, evaluating, and deploying technically, clinically and ethically
trustworthy AI solutions into clinical practice.","['cs.CV', 'cs.AI', 'cs.LG']"
3,Enforcing Mutual Consistency of Hard Regions for Semi-supervised Medical Image Segmentation,"In this paper, we proposed a novel mutual consistency network (MC-Net+) to
effectively exploit the unlabeled hard regions for semi-supervised medical
image segmentation. The MC-Net+ model is motivated by the observation that deep
models trained with limited annotations are prone to output highly uncertain
and easily mis-classified predictions in the ambiguous regions (e.g. adhesive
edges or thin branches) for the image segmentation task. Leveraging these
region-level challenging samples can make the semi-supervised segmentation
model training more effective. Therefore, our proposed MC-Net+ model consists
of two new designs. First, the model contains one shared encoder and multiple
sightly different decoders (i.e. using different up-sampling strategies). The
statistical discrepancy of multiple decoders' outputs is computed to denote the
model's uncertainty, which indicates the unlabeled hard regions. Second, a new
mutual consistency constraint is enforced between one decoder's probability
output and other decoders' soft pseudo labels. In this way, we minimize the
model's uncertainty during training and force the model to generate invariant
and low-entropy results in such challenging areas of unlabeled data, in order
to learn a generalized feature representation. We compared the segmentation
results of the MC-Net+ with five state-of-the-art semi-supervised approaches on
three public medical datasets. Extension experiments with two common
semi-supervised settings demonstrate the superior performance of our model over
other existing methods, which sets a new state of the art for semi-supervised
medical image segmentation.","['cs.CV', 'cs.AI']"
4,Parameter Decoupling Strategy for Semi-supervised 3D Left Atrium Segmentation,"Consistency training has proven to be an advanced semi-supervised framework
and achieved promising results in medical image segmentation tasks through
enforcing an invariance of the predictions over different views of the inputs.
However, with the iterative updating of model parameters, the models would tend
to reach a coupled state and eventually lose the ability to exploit unlabeled
data. To address the issue, we present a novel semi-supervised segmentation
model based on parameter decoupling strategy to encourage consistent
predictions from diverse views. Specifically, we first adopt a two-branch
network to simultaneously produce predictions for each image. During the
training process, we decouple the two prediction branch parameters by quadratic
cosine distance to construct different views in latent space. Based on this,
the feature extractor is constrained to encourage the consistency of
probability maps generated by classifiers under diversified features. In the
overall training process, the parameters of feature extractor and classifiers
are updated alternately by consistency regularization operation and decoupling
operation to gradually improve the generalization performance of the model. Our
method has achieved a competitive result over the state-of-the-art
semi-supervised methods on the Atrial Segmentation Challenge dataset,
demonstrating the effectiveness of our framework. Code is available at
https://github.com/BX0903/PDC.",['cs.CV']
5,Background-Foreground Segmentation for Interior Sensing in Automotive Industry,"To ensure safety in automated driving, the correct perception of the
situation inside the car is as important as its environment. Thus, seat
occupancy detection and classification of detected instances play an important
role in interior sensing. By the knowledge of the seat occupancy status, it is
possible to, e.g., automate the airbag deployment control. Furthermore, the
presence of a driver, which is necessary for partially automated driving cars
at the automation levels two to four can be verified. In this work, we compare
different statistical methods from the field of image segmentation to approach
the problem of background-foreground segmentation in camera based interior
sensing. In the recent years, several methods based on different techniques
have been developed and applied to images or videos from different
applications. The peculiarity of the given scenarios of interior sensing is,
that the foreground instances and the background both contain static as well as
dynamic elements. In data considered in this work, even the camera position is
not completely fixed. We review and benchmark three different methods ranging,
i.e., Gaussian Mixture Models (GMM), Morphological Snakes and a deep neural
network, namely a Mask R-CNN. In particular, the limitations of the classical
methods, GMM and Morphological Snakes, for interior sensing are shown.
Furthermore, it turns, that it is possible to overcome these limitations by
deep learning, e.g.\ using a Mask R-CNN. Although only a small amount of ground
truth data was available for training, we enabled the Mask R-CNN to produce
high quality background-foreground masks via transfer learning. Moreover, we
demonstrate that certain augmentation as well as pre- and post-processing
methods further enhance the performance of the investigated methods.","['cs.CV', 'cs.LG']"
6,EdgeFlow: Achieving Practical Interactive Segmentation with Edge-Guided Flow,"High-quality training data play a key role in image segmentation tasks.
Usually, pixel-level annotations are expensive, laborious and time-consuming
for the large volume of training data. To reduce labelling cost and improve
segmentation quality, interactive segmentation methods have been proposed,
which provide the result with just a few clicks. However, their performance
does not meet the requirements of practical segmentation tasks in terms of
speed and accuracy. In this work, we propose EdgeFlow, a novel architecture
that fully utilizes interactive information of user clicks with edge-guided
flow. Our method achieves state-of-the-art performance without any
post-processing or iterative optimization scheme. Comprehensive experiments on
benchmarks also demonstrate the superiority of our method. In addition, with
the proposed method, we develop an efficient interactive segmentation tool for
practical data annotation tasks. The source code and tool is avaliable at
https://github.com/PaddlePaddle/PaddleSeg.","['cs.CV', 'cs.HC']"
7,Efficient Hybrid Transformer: Learning Global-local Context for Urban Sence Segmentation,"Semantic segmentation of fine-resolution urban scene images plays a vital
role in extensive practical applications, such as land cover mapping, urban
change detection, environmental protection and economic assessment. Driven by
rapid developments in deep learning technologies, convolutional neural networks
(CNNs) have dominated the semantic segmentation task for many years.
Convolutional neural networks adopt hierarchical feature representation and
have strong local context extraction. However, the local property of the
convolution layer limits the network from capturing global information that is
crucial for improving fine-resolution image segmentation. Recently, Transformer
comprise a hot topic in the computer vision domain. Vision Transformer
demonstrates the great capability of global information modelling, boosting
many vision tasks, such as image classification, object detection and
especially semantic segmentation. In this paper, we propose an efficient hybrid
Transformer (EHT) for semantic segmentation of urban scene images. EHT takes
advantage of CNNs and Transformer, learning global-local context to strengthen
the feature representation. Extensive experiments demonstrate that EHT has
higher efficiency with competitive accuracy compared with state-of-the-art
benchmark methods. Specifically, the proposed EHT achieves a 67.0% mIoU on the
UAVid test set and outperforms other lightweight models significantly. The code
will be available soon.",['cs.CV']
8,Towards to Robust and Generalized Medical Image Segmentation Framework,"To mitigate the radiologist's workload, computer-aided diagnosis with the
capability to review and analyze medical images is gradually deployed. Deep
learning-based region of interest segmentation is among the most exciting use
cases. However, this paradigm is restricted in real-world clinical applications
due to poor robustness and generalization. The issue is more sinister with a
lack of training data. In this paper, we address the challenge from the
representation learning point of view. We investigate that the collapsed
representations, as one of the main reasons which caused poor robustness and
generalization, could be avoided through transfer learning. Therefore, we
propose a novel two-stage framework for robust generalized segmentation. In
particular, an unsupervised Tile-wise AutoEncoder (T-AE) pretraining
architecture is coined to learn meaningful representation for improving the
generalization and robustness of the downstream tasks. Furthermore, the learned
knowledge is transferred to the segmentation benchmark. Coupled with an image
reconstruction network, the representation keeps to be decoded, encouraging the
model to capture more semantic features. Experiments of lung segmentation on
multi chest X-ray datasets are conducted. Empirically, the related experimental
results demonstrate the superior generalization capability of the proposed
framework on unseen domains in terms of high performance and robustness to
corruption, especially under the scenario of the limited training data.","['cs.CV', 'cs.AI']"
9,Semi-supervised Meta-learning with Disentanglement for Domain-generalised Medical Image Segmentation,"Generalising deep models to new data from new centres (termed here domains)
remains a challenge. This is largely attributed to shifts in data statistics
(domain shifts) between source and unseen domains. Recently, gradient-based
meta-learning approaches where the training data are split into meta-train and
meta-test sets to simulate and handle the domain shifts during training have
shown improved generalisation performance. However, the current fully
supervised meta-learning approaches are not scalable for medical image
segmentation, where large effort is required to create pixel-wise annotations.
Meanwhile, in a low data regime, the simulated domain shifts may not
approximate the true domain shifts well across source and unseen domains. To
address this problem, we propose a novel semi-supervised meta-learning
framework with disentanglement. We explicitly model the representations related
to domain shifts. Disentangling the representations and combining them to
reconstruct the input image allows unlabeled data to be used to better
approximate the true domain shifts for meta-learning. Hence, the model can
achieve better generalisation performance, especially when there is a limited
amount of labeled data. Experiments show that the proposed method is robust on
different segmentation tasks and achieves state-of-the-art generalisation
performance on two public benchmarks.",['cs.CV']
10,Semi-supervised Contrastive Learning for Label-efficient Medical Image Segmentation,"The success of deep learning methods in medical image segmentation tasks
heavily depends on a large amount of labeled data to supervise the training. On
the other hand, the annotation of biomedical images requires domain knowledge
and can be laborious. Recently, contrastive learning has demonstrated great
potential in learning latent representation of images even without any label.
Existing works have explored its application to biomedical image segmentation
where only a small portion of data is labeled, through a pre-training phase
based on self-supervised contrastive learning without using any labels followed
by a supervised fine-tuning phase on the labeled portion of data only. In this
paper, we establish that by including the limited label in formation in the
pre-training phase, it is possible to boost the performance of contrastive
learning. We propose a supervised local contrastive loss that leverages limited
pixel-wise annotation to force pixels with the same label to gather around in
the embedding space. Such loss needs pixel-wise computation which can be
expensive for large images, and we further propose two strategies, downsampling
and block division, to address the issue. We evaluate our methods on two public
biomedical image datasets of different modalities. With different amounts of
labeled data, our methods consistently outperform the state-of-the-art
contrast-based methods and other semi-supervised learning techniques.",['cs.CV']
11,Direct Estimation of Appearance Models for Segmentation,"Image segmentation algorithms often depend on appearance models that
characterize the distribution of pixel values in different image regions. We
describe a new approach for estimating appearance models directly from an
image, without explicit consideration of the pixels that make up each region.
Our approach is based on novel algebraic expressions that relate local image
statistics to the appearance of spatially coherent regions. We describe two
algorithms that can use the aforementioned algebraic expressions to estimate
appearance models directly from an image. The first algorithm solves a system
of linear and quadratic equations using a least squares formulation. The second
algorithm is a spectral method based on an eigenvector computation. We present
experimental results that demonstrate the proposed methods work well in
practice and lead to effective image segmentation algorithms.","['cs.CV', '68U10, 62M05, 62H30, 65C20']"
12,MISSFormer: An Effective Medical Image Segmentation Transformer,"The CNN-based methods have achieved impressive results in medical image
segmentation, but it failed to capture the long-range dependencies due to the
inherent locality of convolution operation. Transformer-based methods are
popular in vision tasks recently because of its capacity of long-range
dependencies and get a promising performance. However, it lacks in modeling
local context, although some works attempted to embed convolutional layer to
overcome this problem and achieved some improvement, but it makes the feature
inconsistent and fails to leverage the natural multi-scale features of
hierarchical transformer, which limit the performance of models. In this paper,
taking medical image segmentation as an example, we present MISSFormer, an
effective and powerful Medical Image Segmentation tranSFormer. MISSFormer is a
hierarchical encoder-decoder network and has two appealing designs: 1) A feed
forward network is redesigned with the proposed Enhanced Transformer Block,
which makes features aligned adaptively and enhances the long-range
dependencies and local context. 2) We proposed Enhanced Transformer Context
Bridge, a context bridge with the enhanced transformer block to model the
long-range dependencies and local context of multi-scale features generated by
our hierarchical transformer encoder. Driven by these two designs, the
MISSFormer shows strong capacity to capture more valuable dependencies and
context in medical image segmentation. The experiments on multi-organ and
cardiac segmentation tasks demonstrate the superiority, effectiveness and
robustness of our MISSFormer, the exprimental results of MISSFormer trained
from scratch even outperforms state-of-the-art methods pretrained on ImageNet,
and the core designs can be generalized to other visual segmentation tasks. The
code will be released in Github.",['cs.CV']
13,Neural Architecture Search in operational context: a remote sensing case-study,"Deep learning has become in recent years a cornerstone tool fueling key
innovations in the industry, such as autonomous driving. To attain good
performances, the neural network architecture used for a given application must
be chosen with care. These architectures are often handcrafted and therefore
prone to human biases and sub-optimal selection. Neural Architecture Search
(NAS) is a framework introduced to mitigate such risks by jointly optimizing
the network architectures and its weights. Albeit its novelty, it was applied
on complex tasks with significant results - e.g. semantic image segmentation.
In this technical paper, we aim to evaluate its ability to tackle a challenging
operational task: semantic segmentation of objects of interest in satellite
imagery. Designing a NAS framework is not trivial and has strong dependencies
to hardware constraints. We therefore motivate our NAS approach selection and
provide corresponding implementation details. We also present novel ideas to
carry out other such use-case studies.","['cs.CV', 'cs.NE']"
14,Patch-based medical image segmentation using Quantum Tensor Networks,"Tensor networks are efficient factorisations of high dimensional tensors into
a network of lower order tensors. They have been most commonly used to model
entanglement in quantum many-body systems and more recently are witnessing
increased applications in supervised machine learning. In this work, we
formulate image segmentation in a supervised setting with tensor networks. The
key idea is to first lift the pixels in image patches to exponentially high
dimensional feature spaces and using a linear decision hyper-plane to classify
the input pixels into foreground and background classes. The high dimensional
linear model itself is approximated using the matrix product state (MPS) tensor
network. The MPS is weight-shared between the non-overlapping image patches
resulting in our strided tensor network model. The performance of the proposed
model is evaluated on three 2D- and one 3D- biomedical imaging datasets. The
performance of the proposed tensor network segmentation model is compared with
relevant baseline methods. In the 2D experiments, the tensor network model
yeilds competitive performance compared to the baseline methods while being
more resource efficient.",['cs.CV']
15,Combo Loss: Handling Input and Output Imbalance in Multi-Organ Segmentation,"Simultaneous segmentation of multiple organs from different medical imaging
modalities is a crucial task as it can be utilized for computer-aided
diagnosis, computer-assisted surgery, and therapy planning. Thanks to the
recent advances in deep learning, several deep neural networks for medical
image segmentation have been introduced successfully for this purpose. In this
paper, we focus on learning a deep multi-organ segmentation network that labels
voxels. In particular, we examine the critical choice of a loss function in
order to handle the notorious imbalance problem that plagues both the input and
output of a learning model. The input imbalance refers to the class-imbalance
in the input training samples (i.e., small foreground objects embedded in an
abundance of background voxels, as well as organs of varying sizes). The output
imbalance refers to the imbalance between the false positives and false
negatives of the inference model. In order to tackle both types of imbalance
during training and inference, we introduce a new curriculum learning based
loss function. Specifically, we leverage Dice similarity coefficient to deter
model parameters from being held at bad local minima and at the same time
gradually learn better model parameters by penalizing for false
positives/negatives using a cross entropy term. We evaluated the proposed loss
function on three datasets: whole body positron emission tomography (PET) scans
with 5 target organs, magnetic resonance imaging (MRI) prostate scans, and
ultrasound echocardigraphy images with a single target organ i.e., left
ventricular. We show that a simple network architecture with the proposed
integrative loss function can outperform state-of-the-art methods and results
of the competing methods can be improved when our proposed loss is used.",['cs.CV']
16,POPCORN: Progressive Pseudo-labeling with Consistency Regularization and Neighboring,"Semi-supervised learning (SSL) uses unlabeled data to compensate for the
scarcity of annotated images and the lack of method generalization to unseen
domains, two usual problems in medical segmentation tasks. In this work, we
propose POPCORN, a novel method combining consistency regularization and
pseudo-labeling designed for image segmentation. The proposed framework uses
high-level regularization to constrain our segmentation model to use similar
latent features for images with similar segmentations. POPCORN estimates a
proximity graph to select data from easiest ones to more difficult ones, in
order to ensure accurate pseudo-labeling and to limit confirmation bias.
Applied to multiple sclerosis lesion segmentation, our method demonstrates
competitive results compared to other state-of-the-art SSL strategies.",['cs.CV']
17,HCDG: A Hierarchical Consistency Framework for Domain Generalization on Medical Image Segmentation,"Modern deep neural networks struggle to transfer knowledge and generalize
across domains when deploying to real-world applications. Domain generalization
(DG) aims to learn a universal representation from multiple source domains to
improve the network generalization ability on unseen target domains. Previous
DG methods mostly focus on the data-level consistency scheme to advance the
generalization capability of deep networks, without considering the synergistic
regularization of different consistency schemes. In this paper, we present a
novel Hierarchical Consistency framework for Domain Generalization (HCDG) by
ensembling Extrinsic Consistency and Intrinsic Consistency. Particularly, for
Extrinsic Consistency, we leverage the knowledge across multiple source domains
to enforce data-level consistency. Also, we design a novel Amplitude
Gaussian-mixing strategy for Fourier-based data augmentation to enhance such
consistency. For Intrinsic Consistency, we perform task-level consistency for
the same instance under the dual-task form. We evaluate the proposed HCDG
framework on two medical image segmentation tasks, i.e., optic cup/disc
segmentation on fundus images and prostate MRI segmentation. Extensive
experimental results manifest the effectiveness and versatility of our HCDG
framework. Code will be available once accept.",['cs.CV']
18,Pyramid Medical Transformer for Medical Image Segmentation,"Deep neural networks have been a prevailing technique in the field of medical
image processing. However, the most popular convolutional neural networks
(CNNs) based methods for medical image segmentation are imperfect because they
model long-range dependencies by stacking layers or enlarging filters.
Transformers and the self-attention mechanism are recently proposed to
effectively learn long-range dependencies by modeling all pairs of word-to-word
attention regardless of their positions. The idea has also been extended to the
computer vision field by creating and treating image patches as embeddings.
Considering the computation complexity for whole image self-attention, current
transformer-based models settle for a rigid partitioning scheme that
potentially loses informative relations. Besides, current medical transformers
model global context on full resolution images, leading to unnecessary
computation costs. To address these issues, we developed a novel method to
integrate multi-scale attention and CNN feature extraction using a pyramidal
network architecture, namely Pyramid Medical Transformer (PMTrans). The PMTrans
captured multi-range relations by working on multi-resolution images. An
adaptive partitioning scheme was implemented to retain informative relations
and to access different receptive fields efficiently. Experimental results on
three medical image datasets (gland segmentation, MoNuSeg, and HECKTOR
datasets) showed that PMTrans outperformed the latest CNN-based and
transformer-based models for medical image segmentation.",['cs.CV']
19,Temporally Coherent Person Matting Trained on Fake-Motion Dataset,"We propose a novel neural-network-based method to perform matting of videos
depicting people that does not require additional user input such as trimaps.
Our architecture achieves temporal stability of the resulting alpha mattes by
using motion-estimation-based smoothing of image-segmentation algorithm
outputs, combined with convolutional-LSTM modules on U-Net skip connections.
  We also propose a fake-motion algorithm that generates training clips for the
video-matting network given photos with ground-truth alpha mattes and
background videos. We apply random motion to photos and their mattes to
simulate movement one would find in real videos and composite the result with
the background clips. It lets us train a deep neural network operating on
videos in an absence of a large annotated video dataset and provides
ground-truth training-clip foreground optical flow for use in loss functions.",['cs.CV']
20,MEAL: Manifold Embedding-based Active Learning,"Image segmentation is a common and challenging task in autonomous driving.
Availability of sufficient pixel-level annotations for the training data is a
hurdle. Active learning helps learning from small amounts of data by suggesting
the most promising samples for labeling. In this work, we propose a new
pool-based method for active learning, which proposes promising patches
extracted from full image, in each acquisition step. The problem is framed in
an exploration-exploitation framework by combining an embedding based on
Uniform Manifold Approximation to model representativeness with entropy as
uncertainty measure to model informativeness. We applied our proposed method to
the autonomous driving datasets CamVid and Cityscapes and performed a
quantitative comparison with state-of-the-art baselines. We find that our
active learning method achieves better performance compared to previous
methods.","['cs.CV', 'cs.LG']"
21,UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer,"Most recent semantic segmentation methods adopt a U-Net framework with an
encoder-decoder architecture. It is still challenging for U-Net with a simple
skip connection scheme to model the global multi-scale context: 1) Not each
skip connection setting is effective due to the issue of incompatible feature
sets of encoder and decoder stage, even some skip connection negatively
influence the segmentation performance; 2) The original U-Net is worse than the
one without any skip connection on some datasets. Based on our findings, we
propose a new segmentation framework, named UCTransNet (with a proposed CTrans
module in U-Net), from the channel perspective with attention mechanism.
Specifically, the CTrans module is an alternate of the U-Net skip connections,
which consists of a sub-module to conduct the multi-scale Channel Cross fusion
with Transformer (named CCT) and a sub-module Channel-wise Cross-Attention
(named CCA) to guide the fused multi-scale channel-wise information to
effectively connect to the decoder features for eliminating the ambiguity.
Hence, the proposed connection consisting of the CCT and CCA is able to replace
the original skip connection to solve the semantic gaps for an accurate
automatic medical image segmentation. The experimental results suggest that our
UCTransNet produces more precise segmentation performance and achieves
consistent improvements over the state-of-the-art for semantic segmentation
across different datasets and conventional architectures involving transformer
or U-shaped framework. Code: https://github.com/McGregorWwww/UCTransNet.","['cs.CV', 'cs.LG', 'eess.IV']"
22,Attention-Based 3D Seismic Fault Segmentation Training by a Few 2D Slice Labels,"Detection faults in seismic data is a crucial step for seismic structural
interpretation, reservoir characterization and well placement. Some recent
works regard it as an image segmentation task. The task of image segmentation
requires huge labels, especially 3D seismic data, which has a complex structure
and lots of noise. Therefore, its annotation requires expert experience and a
huge workload. In this study, we present lambda-BCE and lambda-smooth L1loss to
effectively train 3D-CNN by some slices from 3D seismic data, so that the model
can learn the segmentation of 3D seismic data from a few 2D slices. In order to
fully extract information from limited data and suppress seismic noise, we
propose an attention module that can be used for active supervision training
and embedded in the network. The attention heatmap label is generated by the
original label, and letting it supervise the attention module using the
lambda-smooth L1loss. The experiment demonstrates the effectiveness of our loss
function, the method can extract 3D seismic features from a few 2D slice
labels. And it also shows the advanced performance of the attention module,
which can significantly suppress the noise in the seismic data while increasing
the model's sensitivity to the foreground. Finally, on the public test set, we
only use the 2D slice labels training that accounts for 3.3% of the 3D volume
label, and achieve similar performance to the 3D volume label training.","['cs.CV', 'physics.geo-ph']"
23,"A Survey on Machine Learning Techniques for Auto Labeling of Video, Audio, and Text Data","Machine learning has been utilized to perform tasks in many different domains
such as classification, object detection, image segmentation and natural
language analysis. Data labeling has always been one of the most important
tasks in machine learning. However, labeling large amounts of data increases
the monetary cost in machine learning. As a result, researchers started to
focus on reducing data annotation and labeling costs. Transfer learning was
designed and widely used as an efficient approach that can reasonably reduce
the negative impact of limited data, which in turn, reduces the data
preparation cost. Even transferring previous knowledge from a source domain
reduces the amount of data needed in a target domain. However, large amounts of
annotated data are still demanded to build robust models and improve the
prediction accuracy of the model. Therefore, researchers started to pay more
attention on auto annotation and labeling. In this survey paper, we provide a
review of previous techniques that focuses on optimized data annotation and
labeling for video, audio, and text data.",['cs.LG']
24,From Contexts to Locality: Ultra-high Resolution Image Segmentation via Locality-aware Contextual Correlation,"Ultra-high resolution image segmentation has raised increasing interests in
recent years due to its realistic applications. In this paper, we innovate the
widely used high-resolution image segmentation pipeline, in which an ultra-high
resolution image is partitioned into regular patches for local segmentation and
then the local results are merged into a high-resolution semantic mask. In
particular, we introduce a novel locality-aware contextual correlation based
segmentation model to process local patches, where the relevance between local
patch and its various contexts are jointly and complementarily utilized to
handle the semantic regions with large variations. Additionally, we present a
contextual semantics refinement network that associates the local segmentation
result with its contextual semantics, and thus is endowed with the ability of
reducing boundary artifacts and refining mask contours during the generation of
final high-resolution mask. Furthermore, in comprehensive experiments, we
demonstrate that our model outperforms other state-of-the-art methods in public
benchmarks. Our released codes are available at
https://github.com/liqiokkk/FCtL.",['cs.CV']
25,"A Critical Connectivity Radius for Segmenting Randomly-Generated, High Dimensional Data Points","Motivated by a $2$-dimensional (unsupervised) image segmentation task whereby
local regions of pixels are clustered via edge detection methods, a more
general probabilistic mathematical framework is devised. Critical thresholds
are calculated that indicate strong correlation between randomly-generated,
high dimensional data points that have been projected into structures in a
partition of a bounded, $2$-dimensional area, of which, an image is a special
case. A neighbor concept for structures in the partition is defined and a
critical radius is uncovered. Measured from a central structure in localized
regions of the partition, the radius indicates strong, long and short range
correlation in the count of occupied structures. The size of a short interval
of radii is estimated upon which the transition from short-to-long range
correlation is virtually assured, which defines a demarcation of when an image
ceases to be ""interesting"".","['cs.LG', '60D05, 62C99']"
26,Personalized Image Semantic Segmentation,"Semantic segmentation models trained on public datasets have achieved great
success in recent years. However, these models didn't consider the
personalization issue of segmentation though it is important in practice. In
this paper, we address the problem of personalized image segmentation. The
objective is to generate more accurate segmentation results on unlabeled
personalized images by investigating the data's personalized traits. To open up
future research in this area, we collect a large dataset containing various
users' personalized images called PIS (Personalized Image Semantic
Segmentation). We also survey some recent researches related to this problem
and report their performance on our dataset. Furthermore, by observing the
correlation among a user's personalized images, we propose a baseline method
that incorporates the inter-image context when segmenting certain images.
Extensive experiments show that our method outperforms the existing methods on
the proposed dataset. The code and the PIS dataset will be made publicly
available.",['cs.CV']
27,Segmenter: Transformer for Semantic Segmentation,"Image segmentation is often ambiguous at the level of individual image
patches and requires contextual information to reach label consensus. In this
paper we introduce Segmenter, a transformer model for semantic segmentation. In
contrast to convolution-based methods, our approach allows to model global
context already at the first layer and throughout the network. We build on the
recent Vision Transformer (ViT) and extend it to semantic segmentation. To do
so, we rely on the output embeddings corresponding to image patches and obtain
class labels from these embeddings with a point-wise linear decoder or a mask
transformer decoder. We leverage models pre-trained for image classification
and show that we can fine-tune them on moderate sized datasets available for
semantic segmentation. The linear decoder allows to obtain excellent results
already, but the performance can be further improved by a mask transformer
generating class masks. We conduct an extensive ablation study to show the
impact of the different parameters, in particular the performance is better for
large models and small patch sizes. Segmenter attains excellent results for
semantic segmentation. It outperforms the state of the art on both ADE20K and
Pascal Context datasets and is competitive on Cityscapes.","['cs.CV', 'cs.AI', 'cs.LG']"
28,Autonomous Removal of Perspective Distortion of Elevator Button Images based on Corner Detection,"Elevator button recognition is a critical function to realize the autonomous
operation of elevators. However, challenging image conditions and various image
distortions make it difficult to recognize buttons accurately. To fill this
gap, we propose a novel deep learning-based approach, which aims to
autonomously correct perspective distortions of elevator button images based on
button corner detection results. First, we leverage a novel image segmentation
model and the Hough Transform method to obtain button segmentation and button
corner detection results. Then, pixel coordinates of standard button corners
are used as reference features to estimate camera motions for correcting
perspective distortions. Fifteen elevator button images are captured from
different angles of view as the dataset. The experimental results demonstrate
that our proposed approach is capable of estimating camera motions and removing
perspective distortions of elevator button images with high accuracy.","['cs.CV', 'cs.RO']"
29,Box-Adapt: Domain-Adaptive Medical Image Segmentation using Bounding BoxSupervision,"Deep learning has achieved remarkable success in medicalimage segmentation,
but it usually requires a large numberof images labeled with fine-grained
segmentation masks, andthe annotation of these masks can be very expensive
andtime-consuming. Therefore, recent methods try to use un-supervised domain
adaptation (UDA) methods to borrow in-formation from labeled data from other
datasets (source do-mains) to a new dataset (target domain). However, due tothe
absence of labels in the target domain, the performance ofUDA methods is much
worse than that of the fully supervisedmethod. In this paper, we propose a
weakly supervised do-main adaptation setting, in which we can partially label
newdatasets with bounding boxes, which are easier and cheaperto obtain than
segmentation masks. Accordingly, we proposea new weakly-supervised domain
adaptation method calledBox-Adapt, which fully explores the fine-grained
segmenta-tion mask in the source domain and the weak bounding boxin the target
domain. Our Box-Adapt is a two-stage methodthat first performs joint training
on the source and target do-mains, and then conducts self-training with the
pseudo-labelsof the target domain. We demonstrate the effectiveness of
ourmethod in the liver segmentation task. Weakly supervised do-main adaptation",['cs.CV']
30,ISNet: Integrate Image-Level and Semantic-Level Context for Semantic Segmentation,"Co-occurrent visual pattern makes aggregating contextual information a common
paradigm to enhance the pixel representation for semantic image segmentation.
The existing approaches focus on modeling the context from the perspective of
the whole image, i.e., aggregating the image-level contextual information.
Despite impressive, these methods weaken the significance of the pixel
representations of the same category, i.e., the semantic-level contextual
information. To address this, this paper proposes to augment the pixel
representations by aggregating the image-level and semantic-level contextual
information, respectively. First, an image-level context module is designed to
capture the contextual information for each pixel in the whole image. Second,
we aggregate the representations of the same category for each pixel where the
category regions are learned under the supervision of the ground-truth
segmentation. Third, we compute the similarities between each pixel
representation and the image-level contextual information, the semantic-level
contextual information, respectively. At last, a pixel representation is
augmented by weighted aggregating both the image-level contextual information
and the semantic-level contextual information with the similarities as the
weights. Integrating the image-level and semantic-level context allows this
paper to report state-of-the-art accuracy on four benchmarks, i.e., ADE20K,
LIP, COCOStuff and Cityscapes.",['cs.CV']
31,Re-using Adversarial Mask Discriminators for Test-time Training under Distribution Shifts,"Thanks to their ability to learn flexible data-driven losses, Generative
Adversarial Networks (GANs) are an integral part of many semi- and
weakly-supervised methods for medical image segmentation. GANs jointly optimise
a generator and an adversarial discriminator on a set of training data. After
training has completed, the discriminator is usually discarded and only the
generator is used for inference. But should we discard discriminators? In this
work, we argue that training stable discriminators produces expressive loss
functions that we can re-use at inference to detect and correct segmentation
mistakes. First, we identify key challenges and suggest possible solutions to
make discriminators re-usable at inference. Then, we show that we can combine
discriminators with image reconstruction costs (via decoders) to further
improve the model. Our method is simple and improves the test-time performance
of pre-trained GANs. Moreover, we show that it is compatible with standard
post-processing techniques and it has potentials to be used for Online
Continual Learning. With our work, we open new research avenues for re-using
adversarial discriminators at inference.","['cs.CV', 'eess.IV']"
32,Mining Contextual Information Beyond Image for Semantic Segmentation,"This paper studies the context aggregation problem in semantic image
segmentation. The existing researches focus on improving the pixel
representations by aggregating the contextual information within individual
images. Though impressive, these methods neglect the significance of the
representations of the pixels of the corresponding class beyond the input
image. To address this, this paper proposes to mine the contextual information
beyond individual images to further augment the pixel representations. We first
set up a feature memory module, which is updated dynamically during training,
to store the dataset-level representations of various categories. Then, we
learn class probability distribution of each pixel representation under the
supervision of the ground-truth segmentation. At last, the representation of
each pixel is augmented by aggregating the dataset-level representations based
on the corresponding class probability distribution. Furthermore, by utilizing
the stored dataset-level representations, we also propose a representation
consistent learning strategy to make the classification head better address
intra-class compactness and inter-class dispersion. The proposed method could
be effortlessly incorporated into existing segmentation frameworks (e.g., FCN,
PSPNet, OCRNet and DeepLabV3) and brings consistent performance improvements.
Mining contextual information beyond image allows us to report state-of-the-art
performance on various benchmarks: ADE20K, LIP, Cityscapes and COCO-Stuff.",['cs.CV']
33,Fully Transformer Networks for Semantic Image Segmentation,"Transformers have shown impressive performance in various natural language
processing and computer vision tasks, due to the capability of modeling
long-range dependencies. Recent progress has demonstrated to combine such
transformers with CNN-based semantic image segmentation models is very
promising. However, it is not well studied yet on how well a pure transformer
based approach can achieve for image segmentation. In this work, we explore a
novel framework for semantic image segmentation, which is encoder-decoder based
Fully Transformer Networks (FTN). Specifically, we first propose a Pyramid
Group Transformer (PGT) as the encoder for progressively learning hierarchical
features, while reducing the computation complexity of the standard visual
transformer(ViT). Then, we propose a Feature Pyramid Transformer (FPT) to fuse
semantic-level and spatial-level information from multiple levels of the PGT
encoder for semantic image segmentation. Surprisingly, this simple baseline can
achieve new state-of-the-art results on multiple challenging semantic
segmentation benchmarks, including PASCAL Context, ADE20K and COCO-Stuff. The
source code will be released upon the publication of this work.",['cs.CV']
34,PoissonSeg: Semi-Supervised Few-Shot Medical Image Segmentation via Poisson Learning,"The application of deep learning to medical image segmentation has been
hampered due to the lack of abundant pixel-level annotated data. Few-shot
Semantic Segmentation (FSS) is a promising strategy for breaking the deadlock.
However, a high-performing FSS model still requires sufficient pixel-level
annotated classes for training to avoid overfitting, which leads to its
performance bottleneck in medical image segmentation due to the unmet need for
annotations. Thus, semi-supervised FSS for medical images is accordingly
proposed to utilize unlabeled data for further performance improvement.
Nevertheless, existing semi-supervised FSS methods has two obvious defects: (1)
neglecting the relationship between the labeled and unlabeled data; (2) using
unlabeled data directly for end-to-end training leads to degenerated
representation learning. To address these problems, we propose a novel
semi-supervised FSS framework for medical image segmentation. The proposed
framework employs Poisson learning for modeling data relationship and
propagating supervision signals, and Spatial Consistency Calibration for
encouraging the model to learn more coherent representations. In this process,
unlabeled samples do not involve in end-to-end training, but provide
supervisory information for query image segmentation through graph-based
learning. We conduct extensive experiments on three medical image segmentation
datasets (i.e. ISIC skin lesion segmentation, abdominal organs segmentation for
MRI and abdominal organs segmentation for CT) to demonstrate the
state-of-the-art performance and broad applicability of the proposed framework.","['cs.CV', 'cs.LG']"
35,Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey,"Deep reinforcement learning augments the reinforcement learning framework and
utilizes the powerful representation of deep neural networks. Recent works have
demonstrated the remarkable successes of deep reinforcement learning in various
domains including finance, medicine, healthcare, video games, robotics, and
computer vision. In this work, we provide a detailed review of recent and
state-of-the-art research advances of deep reinforcement learning in computer
vision. We start with comprehending the theories of deep learning,
reinforcement learning, and deep reinforcement learning. We then propose a
categorization of deep reinforcement learning methodologies and discuss their
advantages and limitations. In particular, we divide deep reinforcement
learning into seven main categories according to their applications in computer
vision, i.e. (i)landmark localization (ii) object detection; (iii) object
tracking; (iv) registration on both 2D image and 3D image volumetric data (v)
image segmentation; (vi) videos analysis; and (vii) other applications. Each of
these categories is further analyzed with reinforcement learning techniques,
network design, and performance. Moreover, we provide a comprehensive analysis
of the existing publicly available datasets and examine source code
availability. Finally, we present some open issues and discuss future research
directions on deep reinforcement learning in computer vision","['cs.CV', 'cs.AI']"
36,Duo-SegNet: Adversarial Dual-Views for Semi-Supervised Medical Image Segmentation,"Segmentation of images is a long-standing challenge in medical AI. This is
mainly due to the fact that training a neural network to perform image
segmentation requires a significant number of pixel-level annotated data, which
is often unavailable. To address this issue, we propose a semi-supervised image
segmentation technique based on the concept of multi-view learning. In contrast
to the previous art, we introduce an adversarial form of dual-view training and
employ a critic to formulate the learning problem in multi-view training as a
min-max problem. Thorough quantitative and qualitative evaluations on several
datasets indicate that our proposed method outperforms state-of-the-art medical
image segmentation algorithms consistently and comfortably. The code is
publicly available at https://github.com/himashi92/Duo-SegNet",['cs.CV']
37,Comprehensive Multi-Modal Interactions for Referring Image Segmentation,"We investigate Referring Image Segmentation (RIS), which outputs a
segmentation map corresponding to the given natural language description. To
solve RIS efficiently, we need to understand each word's relationship with
other words, each region in the image to other regions, and cross-modal
alignment between linguistic and visual domains. We argue that one of the
limiting factors in the recent methods is that they do not handle these
interactions simultaneously. To this end, we propose a novel architecture
called JRNet, which uses a Joint Reasoning Module(JRM) to concurrently capture
the inter-modal and intra-modal interactions. The output of JRM is passed
through a novel Cross-Modal Multi-Level Fusion (CMMLF) module which further
refines the segmentation masks by exchanging contextual information across
visual hierarchy through linguistic features acting as a bridge. We present
thorough ablation studies and validate our approach's performance on four
benchmark datasets, showing considerable performance gains over the existing
state-of-the-art methods.",['cs.CV']
38,Self-Paced Contrastive Learning for Semi-supervised Medical Image Segmentation with Meta-labels,"Pre-training a recognition model with contrastive learning on a large dataset
of unlabeled data has shown great potential to boost the performance of a
downstream task, e.g., image classification. However, in domains such as
medical imaging, collecting unlabeled data can be challenging and expensive. In
this work, we propose to adapt contrastive learning to work with meta-label
annotations, for improving the model's performance in medical image
segmentation even when no additional unlabeled data is available. Meta-labels
such as the location of a 2D slice in a 3D MRI scan or the type of device used,
often come for free during the acquisition process. We use the meta-labels for
pre-training the image encoder as well as to regularize a semi-supervised
training, in which a reduced set of annotated data is used for training.
Finally, to fully exploit the weak annotations, a self-paced learning approach
is used to help the learning and discriminate useful labels from noise. Results
on three different medical image segmentation datasets show that our approach:
i) highly boosts the performance of a model trained on a few scans, ii)
outperforms previous contrastive and semi-supervised approaches, and iii)
reaches close to the performance of a model trained on the full data.",['cs.CV']
39,Multi-task Federated Learning for Heterogeneous Pancreas Segmentation,"Federated learning (FL) for medical image segmentation becomes more
challenging in multi-task settings where clients might have different
categories of labels represented in their data. For example, one client might
have patient data with ""healthy'' pancreases only while datasets from other
clients may contain cases with pancreatic tumors. The vanilla federated
averaging algorithm makes it possible to obtain more generalizable deep
learning-based segmentation models representing the training data from multiple
institutions without centralizing datasets. However, it might be sub-optimal
for the aforementioned multi-task scenarios. In this paper, we investigate
heterogeneous optimization methods that show improvements for the automated
segmentation of pancreas and pancreatic tumors in abdominal CT images with FL
settings.","['cs.CV', 'I.4.6']"
40,Membership Inference Attacks are Easier on Difficult Problems,"Membership inference attacks (MIA) try to detect if data samples were used to
train a neural network model, e.g. to detect copyright abuses. We show that
models with higher dimensional input and output are more vulnerable to MIA, and
address in more detail models for image translation and semantic segmentation,
including medical image segmentation. We show that reconstruction-errors can
lead to very effective MIA attacks as they are indicative of memorization.
Unfortunately, reconstruction error alone is less effective at discriminating
between non-predictable images used in training and easy to predict images that
were never seen before. To overcome this, we propose using a novel
predictability error that can be computed for each sample, and its computation
does not require a training set. Our membership error, obtained by subtracting
the predictability error from the reconstruction error, is shown to achieve
high MIA accuracy on an extensive number of benchmarks.","['cs.LG', 'cs.CR']"
41,Cross-Image Region Mining with Region Prototypical Network for Weakly Supervised Segmentation,"Weakly supervised image segmentation trained with image-level labels usually
suffers from inaccurate coverage of object areas during the generation of the
pseudo groundtruth. This is because the object activation maps are trained with
the classification objective and lack the ability to generalize. To improve the
generality of the objective activation maps, we propose a region prototypical
network RPNet to explore the cross-image object diversity of the training set.
Similar object parts across images are identified via region feature
comparison. Object confidence is propagated between regions to discover new
object areas while background regions are suppressed. Experiments show that the
proposed method generates more complete and accurate pseudo object masks, while
achieving state-of-the-art performance on PASCAL VOC 2012 and MS COCO. In
addition, we investigate the robustness of the proposed method on reduced
training sets.",['cs.CV']
42,SimCVD: Simple Contrastive Voxel-Wise Representation Distillation for Semi-Supervised Medical Image Segmentation,"Automated segmentation in medical image analysis is a challenging task that
requires a large amount of manually labeled data. However, most existing
learning-based approaches usually suffer from limited manually annotated
medical data, which poses a major practical problem for accurate and robust
medical image segmentation. In addition, most existing semi-supervised
approaches are usually not robust compared with the supervised counterparts,
and also lack explicit modeling of geometric structure and semantic
information, both of which limit the segmentation accuracy. In this work, we
present SimCVD, a simple contrastive distillation framework that significantly
advances state-of-the-art voxel-wise representation learning. We first describe
an unsupervised training strategy, which takes two views of an input volume and
predicts their signed distance maps of object boundaries in a contrastive
objective, with only two independent dropout as mask. This simple approach
works surprisingly well, performing on the same level as previous fully
supervised methods with much less labeled data. We hypothesize that dropout can
be viewed as a minimal form of data augmentation and makes the network robust
to representation collapse. Then, we propose to perform structural distillation
by distilling pair-wise similarities. We evaluate SimCVD on two popular
datasets: the Left Atrial Segmentation Challenge (LA) and the NIH pancreas CT
dataset. The results on the LA dataset demonstrate that, in two types of
labeled ratios (i.e., 20% and 10%), SimCVD achieves an average Dice score of
90.85% and 89.03% respectively, a 0.91% and 2.22% improvement compared to
previous best results. Our method can be trained in an end-to-end fashion,
showing the promise of utilizing SimCVD as a general framework for downstream
tasks, such as medical image synthesis and registration.","['cs.CV', 'cs.AI', 'cs.LG']"
43,Multi-Slice Dense-Sparse Learning for Efficient Liver and Tumor Segmentation,"Accurate automatic liver and tumor segmentation plays a vital role in
treatment planning and disease monitoring. Recently, deep convolutional neural
network (DCNNs) has obtained tremendous success in 2D and 3D medical image
segmentation. However, 2D DCNNs cannot fully leverage the inter-slice
information, while 3D DCNNs are computationally expensive and memory intensive.
To address these issues, we first propose a novel dense-sparse training flow
from a data perspective, in which, densely adjacent slices and sparsely
adjacent slices are extracted as inputs for regularizing DCNNs, thereby
improving the model performance. Moreover, we design a 2.5D light-weight
nnU-Net from a network perspective, in which, depthwise separable convolutions
are adopted to improve the efficiency. Extensive experiments on the LiTS
dataset have demonstrated the superiority of the proposed method.","['cs.CV', 'cs.AI', 'cs.LG', 'eess.IV']"
44,Real-Time Multi-Modal Semantic Fusion on Unmanned Aerial Vehicles,"Unmanned aerial vehicles (UAVs) equipped with multiple complementary sensors
have tremendous potential for fast autonomous or remote-controlled semantic
scene analysis, e.g., for disaster examination. In this work, we propose a UAV
system for real-time semantic inference and fusion of multiple sensor
modalities. Semantic segmentation of LiDAR scans and RGB images, as well as
object detection on RGB and thermal images, run online onboard the UAV computer
using lightweight CNN architectures and embedded inference accelerators. We
follow a late fusion approach where semantic information from multiple
modalities augments 3D point clouds and image segmentation masks while also
generating an allocentric semantic map. Our system provides augmented semantic
images and point clouds with $\approx\,$9$\,$Hz. We evaluate the integrated
system in real-world experiments in an urban environment.","['cs.CV', 'cs.RO']"
45,Hierarchical Random Walker Segmentation for Large Volumetric Biomedical Images,"The random walker method for image segmentation is a popular tool for
semi-automatic image segmentation, especially in the biomedical field. However,
its linear asymptotic run time and memory requirements make application to 3D
datasets of increasing sizes impractical. We propose a hierarchical framework
that, to the best of our knowledge, is the first attempt to overcome these
restrictions for the random walker algorithm and achieves sublinear run time
and constant memory complexity. The goal of this framework is -- rather than
improving the segmentation quality compared to the baseline method -- to make
interactive segmentation on out-of-core datasets possible. The method is
evaluated quantitavely on synthetic data and the CT-ORG dataset where the
expected improvements in algorithm run time while maintaining high segmentation
quality are confirmed. The incremental (i.e., interaction update) run time is
demonstrated to be in seconds on a standard PC even for volumes of hundreds of
Gigabytes in size. In a small case study the applicability to large real world
from current biomedical research is demonstrated. An implementation of the
presented method is publicly available in version 5.2 of the widely used volume
rendering and processing software Voreen (https://www.uni-muenster.de/Voreen/).",['cs.CV']
46,Few-Shot Segmentation with Global and Local Contrastive Learning,"In this work, we address the challenging task of few-shot segmentation.
Previous few-shot segmentation methods mainly employ the information of support
images as guidance for query image segmentation. Although some works propose to
build cross-reference between support and query images, their extraction of
query information still depends on the support images. We here propose to
extract the information from the query itself independently to benefit the
few-shot segmentation task. To this end, we first propose a prior extractor to
learn the query information from the unlabeled images with our proposed
global-local contrastive learning. Then, we extract a set of predetermined
priors via this prior extractor. With the obtained priors, we generate the
prior region maps for query images, which locate the objects, as guidance to
perform cross interaction with support features. In such a way, the extraction
of query information is detached from the support branch, overcoming the
limitation by support, and could obtain more informative query clues to achieve
better interaction. Without bells and whistles, the proposed approach achieves
new state-of-the-art performance for the few-shot segmentation task on
PASCAL-5$^{i}$ and COCO datasets.",['cs.CV']
47,AASeg: Attention Aware Network for Real Time Semantic Segmentation,"In this paper, we present a new network named Attention Aware Network (AASeg)
for real time semantic image segmentation. Our network incorporates spatial and
channel information using Spatial Attention (SA) and Channel Attention (CA)
modules respectively. It also uses dense local multi-scale context information
using Multi Scale Context (MSC) module. The feature maps are concatenated
individually to produce the final segmentation map. We demonstrate the
effectiveness of our method using a comprehensive analysis, quantitative
experimental results and ablation study using Cityscapes, ADE20K and Camvid
datasets. Our network performs better than most previous architectures with a
74.4\% Mean IOU on Cityscapes test dataset while running at 202.7 FPS.","['cs.CV', 'cs.LG', 'eess.IV']"
48,Medical image segmentation with imperfect 3D bounding boxes,"The development of high quality medical image segmentation algorithms depends
on the availability of large datasets with pixel-level labels. The challenges
of collecting such datasets, especially in case of 3D volumes, motivate to
develop approaches that can learn from other types of labels that are cheap to
obtain, e.g. bounding boxes. We focus on 3D medical images with their
corresponding 3D bounding boxes which are considered as series of per-slice
non-tight 2D bounding boxes. While current weakly-supervised approaches that
use 2D bounding boxes as weak labels can be applied to medical image
segmentation, we show that their success is limited in cases when the
assumption about the tightness of the bounding boxes breaks. We propose a new
bounding box correction framework which is trained on a small set of
pixel-level annotations to improve the tightness of a larger set of non-tight
bounding box annotations. The effectiveness of our solution is demonstrated by
evaluating a known weakly-supervised segmentation approach with and without the
proposed bounding box correction algorithm. When the tightness is improved by
our solution, the results of the weakly-supervised segmentation become much
closer to those of the fully-supervised one.",['cs.CV']
49,Contrastive Semi-Supervised Learning for 2D Medical Image Segmentation,"Contrastive Learning (CL) is a recent representation learning approach, which
encourages inter-class separability and intra-class compactness in learned
image representations. Since medical images often contain multiple semantic
classes in an image, using CL to learn representations of local features (as
opposed to global) is important. In this work, we present a novel
semi-supervised 2D medical segmentation solution that applies CL on image
patches, instead of full images. These patches are meaningfully constructed
using the semantic information of different classes obtained via pseudo
labeling. We also propose a novel consistency regularization (CR) scheme, which
works in synergy with CL. It addresses the problem of confirmation bias, and
encourages better clustering in the feature space. We evaluate our method on
four public medical segmentation datasets and a novel histopathology dataset
that we introduce. Our method obtains consistent improvements over
state-of-the-art semi-supervised segmentation approaches for all datasets.",['cs.CV']
50,Source-Free Domain Adaptation for Image Segmentation,"Domain adaptation (DA) has drawn high interest for its capacity to adapt a
model trained on labeled source data to perform well on unlabeled or weakly
labeled target data from a different domain. Most common DA techniques require
concurrent access to the input images of both the source and target domains.
However, in practice, privacy concerns often impede the availability of source
images in the adaptation phase. This is a very frequent DA scenario in medical
imaging, where, for instance, the source and target images could come from
different clinical sites. We introduce a source-free domain adaptation for
image segmentation. Our formulation is based on minimizing a label-free entropy
loss defined over target-domain data, which we further guide with a
domain-invariant prior on the segmentation regions. Many priors can be derived
from anatomical information. Here, a class ratio prior is estimated from
anatomical knowledge and integrated in the form of a Kullback Leibler (KL)
divergence in our overall loss function. Furthermore, we motivate our overall
loss with an interesting link to maximizing the mutual information between the
target images and their label predictions. We show the effectiveness of our
prior aware entropy minimization in a variety of domain-adaptation scenarios,
with different modalities and applications, including spine, prostate, and
cardiac segmentation. Our method yields comparable results to several state of
the art adaptation techniques, despite having access to much less information,
as the source images are entirely absent in our adaptation phase. Our
straightforward adaptation strategy uses only one network, contrary to popular
adversarial techniques, which are not applicable to a source-free DA setting.
Our framework can be readily used in a breadth of segmentation problems, and
our code is publicly available: https://github.com/mathilde-b/SFDA",['cs.CV']
51,Efficient and Generic Interactive Segmentation Framework to Correct Mispredictions during Clinical Evaluation of Medical Images,"Semantic segmentation of medical images is an essential first step in
computer-aided diagnosis systems for many applications. However, given many
disparate imaging modalities and inherent variations in the patient data, it is
difficult to consistently achieve high accuracy using modern deep neural
networks (DNNs). This has led researchers to propose interactive image
segmentation techniques where a medical expert can interactively correct the
output of a DNN to the desired accuracy. However, these techniques often need
separate training data with the associated human interactions, and do not
generalize to various diseases, and types of medical images. In this paper, we
suggest a novel conditional inference technique for DNNs which takes the
intervention by a medical expert as test time constraints and performs
inference conditioned upon these constraints. Our technique is generic can be
used for medical images from any modality. Unlike other methods, our approach
can correct multiple structures simultaneously and add structures missed at
initial segmentation. We report an improvement of 13.3, 12.5, 17.8, 10.2, and
12.4 times in user annotation time than full human annotation for the nucleus,
multiple cells, liver and tumor, organ, and brain segmentation respectively. We
report a time saving of 2.8, 3.0, 1.9, 4.4, and 8.6 fold compared to other
interactive segmentation techniques. Our method can be useful to clinicians for
diagnosis and post-surgical follow-up with minimal intervention from the
medical expert. The source-code and the detailed results are available here
[1].","['cs.CV', '49-06 (Primary), 49-11(Secondary)', 'I.4.6; I.5.1']"
52,Hidden Markov Modeling for Maximum Likelihood Neuron Reconstruction,"Recent advances in brain clearing and imaging have made it possible to image
entire mammalian brains at sub-micron resolution. These images offer the
potential to assemble brain-wide atlases of projection neuron morphology, but
manual neuron reconstruction remains a bottleneck. In this paper we present a
probabilistic method which combines a hidden Markov state process that encodes
neuron geometric properties with a random field appearance model of the
flourescence process. Our method utilizes dynamic programming to efficiently
compute the global maximizers of what we call the ""most probable"" neuron path.
We applied our algorithm to the output of image segmentation models where false
negatives severed neuronal processes, and showed that it can follow axons in
the presence of noise or nearby neurons. Our method has the potential to be
integrated into a semi or fully automated reconstruction pipeline.
Additionally, it creates a framework for conditioning the probability to fixed
start and endpoints through which users can intervene with hard constraints to,
for example, rule out certain reconstructions, or assign axons to particular
cell bodies.",['cs.CV']
53,Improving Aleatoric Uncertainty Quantification in Multi-Annotated Medical Image Segmentation with Normalizing Flows,"Quantifying uncertainty in medical image segmentation applications is
essential, as it is often connected to vital decision-making. Compelling
attempts have been made in quantifying the uncertainty in image segmentation
architectures, e.g. to learn a density segmentation model conditioned on the
input image. Typical work in this field restricts these learnt densities to be
strictly Gaussian. In this paper, we propose to use a more flexible approach by
introducing Normalizing Flows (NFs), which enables the learnt densities to be
more complex and facilitate more accurate modeling for uncertainty. We prove
this hypothesis by adopting the Probabilistic U-Net and augmenting the
posterior density with an NF, allowing it to be more expressive. Our
qualitative as well as quantitative (GED and IoU) evaluations on the
multi-annotated and single-annotated LIDC-IDRI and Kvasir-SEG segmentation
datasets, respectively, show a clear improvement. This is mostly apparent in
the quantification of aleatoric uncertainty and the increased predictive
performance of up to 14 percent. This result strongly indicates that a more
flexible density model should be seriously considered in architectures that
attempt to capture segmentation ambiguity through density modeling. The benefit
of this improved modeling will increase human confidence in annotation and
segmentation, and enable eager adoption of the technology in practice.","['cs.CV', 'cs.LG']"
54,Shape Modeling with Spline Partitions,"Shape modelling (with methods that output shapes) is a new and important task
in Bayesian nonparametrics and bioinformatics. In this work, we focus on
Bayesian nonparametric methods for capturing shapes by partitioning a space
using curves. In related work, the classical Mondrian process is used to
partition spaces recursively with axis-aligned cuts, and is widely applied in
multi-dimensional and relational data. The Mondrian process outputs
hyper-rectangles. Recently, the random tessellation process was introduced as a
generalization of the Mondrian process, partitioning a domain with non-axis
aligned cuts in an arbitrary dimensional space, and outputting polytopes.
Motivated by these processes, in this work, we propose a novel parallelized
Bayesian nonparametric approach to partition a domain with curves, enabling
complex data-shapes to be acquired. We apply our method to HIV-1-infected human
macrophage image dataset, and also simulated datasets sets to illustrate our
approach. We compare to support vector machines, random forests and
state-of-the-art computer vision methods such as simple linear iterative
clustering super pixel image segmentation. We develop an R package that is
available at
\url{https://github.com/ShufeiGe/Shape-Modeling-with-Spline-Partitions}.","['stat.ML', 'cs.LG']"
55,"Distribution-Free, Risk-Controlling Prediction Sets","While improving prediction accuracy has been the focus of machine learning in
recent years, this alone does not suffice for reliable decision-making.
Deploying learning systems in consequential settings also requires calibrating
and communicating the uncertainty of predictions. To convey instance-wise
uncertainty for prediction tasks, we show how to generate set-valued
predictions from a black-box predictor that control the expected loss on future
test points at a user-specified level. Our approach provides explicit
finite-sample guarantees for any dataset by using a holdout set to calibrate
the size of the prediction sets. This framework enables simple,
distribution-free, rigorous error control for many tasks, and we demonstrate it
in five large-scale machine learning problems: (1) classification problems
where some mistakes are more costly than others; (2) multi-label
classification, where each observation has multiple associated labels; (3)
classification problems where the labels have a hierarchical structure; (4)
image segmentation, where we wish to predict a set of pixels containing an
object of interest; and (5) protein structure prediction. Lastly, we discuss
extensions to uncertainty quantification for ranking, metric learning and
distributionally robust learning.","['cs.LG', 'cs.AI', 'cs.CV', 'stat.ME', 'stat.ML']"
56,Recurrent Mask Refinement for Few-Shot Medical Image Segmentation,"Although having achieved great success in medical image segmentation, deep
convolutional neural networks usually require a large dataset with manual
annotations for training and are difficult to generalize to unseen classes.
Few-shot learning has the potential to address these challenges by learning new
classes from only a few labeled examples. In this work, we propose a new
framework for few-shot medical image segmentation based on prototypical
networks. Our innovation lies in the design of two key modules: 1) a context
relation encoder (CRE) that uses correlation to capture local relation features
between foreground and background regions; and 2) a recurrent mask refinement
module that repeatedly uses the CRE and a prototypical network to recapture the
change of context relationship and refine the segmentation mask iteratively.
Experiments on two abdomen CT datasets and an abdomen MRI dataset show the
proposed method obtains substantial improvement over the state-of-the-art
methods by an average of 16.32%, 8.45% and 6.24% in terms of DSC, respectively.
Code is publicly available.",['cs.CV']
57,Transductive image segmentation: Self-training and effect of uncertainty estimation,"Semi-supervised learning (SSL) uses unlabeled data during training to learn
better models. Previous studies on SSL for medical image segmentation focused
mostly on improving model generalization to unseen data. In some applications,
however, our primary interest is not generalization but to obtain optimal
predictions on a specific unlabeled database that is fully available during
model development. Examples include population studies for extracting imaging
phenotypes. This work investigates an often overlooked aspect of SSL,
transduction. It focuses on the quality of predictions made on the unlabeled
data of interest when they are included for optimization during training,
rather than improving generalization. We focus on the self-training framework
and explore its potential for transduction. We analyze it through the lens of
Information Gain and reveal that learning benefits from the use of calibrated
or under-confident models. Our extensive experiments on a large MRI database
for multi-class segmentation of traumatic brain lesions shows promising results
when comparing transductive with inductive predictions. We believe this study
will inspire further research on transductive learning, a well-suited paradigm
for medical image analysis.",['cs.CV']
58,Visual Boundary Knowledge Translation for Foreground Segmentation,"When confronted with objects of unknown types in an image, humans can
effortlessly and precisely tell their visual boundaries. This recognition
mechanism and underlying generalization capability seem to contrast to
state-of-the-art image segmentation networks that rely on large-scale
category-aware annotated training samples. In this paper, we make an attempt
towards building models that explicitly account for visual boundary knowledge,
in hope to reduce the training effort on segmenting unseen categories.
Specifically, we investigate a new task termed as Boundary Knowledge
Translation (BKT). Given a set of fully labeled categories, BKT aims to
translate the visual boundary knowledge learned from the labeled categories, to
a set of novel categories, each of which is provided only a few labeled
samples. To this end, we propose a Translation Segmentation Network
(Trans-Net), which comprises a segmentation network and two boundary
discriminators. The segmentation network, combined with a boundary-aware
self-supervised mechanism, is devised to conduct foreground segmentation, while
the two discriminators work together in an adversarial manner to ensure an
accurate segmentation of the novel categories under light supervision.
Exhaustive experiments demonstrate that, with only tens of labeled samples as
guidance, Trans-Net achieves close results on par with fully supervised
methods.",['cs.CV']
59,OPFython: A Python-Inspired Optimum-Path Forest Classifier,"Machine learning techniques have been paramount throughout the last years,
being applied in a wide range of tasks, such as classification, object
recognition, person identification, and image segmentation. Nevertheless,
conventional classification algorithms, e.g., Logistic Regression, Decision
Trees, and Bayesian classifiers, might lack complexity and diversity, not
suitable when dealing with real-world data. A recent graph-inspired classifier,
known as the Optimum-Path Forest, has proven to be a state-of-the-art
technique, comparable to Support Vector Machines and even surpassing it in some
tasks. This paper proposes a Python-based Optimum-Path Forest framework,
denoted as OPFython, where all of its functions and classes are based upon the
original C language implementation. Additionally, as OPFython is a Python-based
library, it provides a more friendly environment and a faster prototyping
workspace than the C language.","['cs.LG', 'cs.CV', 'stat.ML', '68T01', 'I.2.0; I.5.0']"
60,"Iterative, Deep, and Unsupervised Synthetic Aperture Sonar Image Segmentation","Deep learning has not been routinely employed for semantic segmentation of
seabed environment for synthetic aperture sonar (SAS) imagery due to the
implicit need of abundant training data such methods necessitate. Abundant
training data, specifically pixel-level labels for all images, is usually not
available for SAS imagery due to the complex logistics (e.g., diver survey,
chase boat, precision position information) needed for obtaining accurate
ground-truth. Many hand-crafted feature based algorithms have been proposed to
segment SAS in an unsupervised fashion. However, there is still room for
improvement as the feature extraction step of these methods is fixed. In this
work, we present a new iterative unsupervised algorithm for learning deep
features for SAS image segmentation. Our proposed algorithm alternates between
clustering superpixels and updating the parameters of a convolutional neural
network (CNN) so that the feature extraction for image segmentation can be
optimized. We demonstrate the efficacy of our method on a realistic benchmark
dataset. Our results show that the performance of our proposed method is
considerably better than current state-of-the-art methods in SAS image
segmentation.",['cs.CV']
61,Open-World Entity Segmentation,"We introduce a new image segmentation task, termed Entity Segmentation (ES)
with the aim to segment all visual entities in an image without considering
semantic category labels. It has many practical applications in image
manipulation/editing where the segmentation mask quality is typically crucial
but category labels are less important. In this setting, all
semantically-meaningful segments are equally treated as categoryless entities
and there is no thing-stuff distinction. Based on our unified entity
representation, we propose a center-based entity segmentation framework with
two novel modules to improve mask quality. Experimentally, both our new task
and framework demonstrate superior advantages as against existing work. In
particular, ES enables the following: (1) merging multiple datasets to form a
large training set without the need to resolve label conflicts; (2) any model
trained on one dataset can generalize exceptionally well to other datasets with
unseen domains. Our code is made publicly available at
https://github.com/dvlab-research/Entity.","['cs.CV', 'cs.LG']"
62,Mapping Vulnerable Populations with AI,"Humanitarian actions require accurate information to efficiently delegate
support operations. Such information can be maps of building footprints,
building functions, and population densities. While the access to this
information is comparably easy in industrialized countries thanks to reliable
census data and national geo-data infrastructures, this is not the case for
developing countries, where that data is often incomplete or outdated. Building
maps derived from remote sensing images may partially remedy this challenge in
such countries, but are not always accurate due to different landscape
configurations and lack of validation data. Even when they exist, building
footprint layers usually do not reveal more fine-grained building properties,
such as the number of stories or the building's function (e.g., office,
residential, school, etc.). In this project we aim to automate building
footprint and function mapping using heterogeneous data sources. In a first
step, we intend to delineate buildings from satellite data, using deep learning
models for semantic image segmentation. Building functions shall be retrieved
by parsing social media data like for instance tweets, as well as ground-based
imagery, to automatically identify different buildings functions and retrieve
further information such as the number of building stories. Building maps
augmented with those additional attributes make it possible to derive more
accurate population density maps, needed to support the targeted provision of
humanitarian aid.","['cs.CV', 'eess.IV']"
63,"Comprehensive Validation of Automated Whole Body Skeletal Muscle, Adipose Tissue, and Bone Segmentation from 3D CT images for Body Composition Analysis: Towards Extended Body Composition","The latest advances in computer-assisted precision medicine are making it
feasible to move from population-wide models that are useful to discover
aggregate patterns that hold for group-based analysis to patient-specific
models that can drive patient-specific decisions with regard to treatment
choices, and predictions of outcomes of treatment. Body Composition is
recognized as an important driver and risk factor for a wide variety of
diseases, as well as a predictor of individual patient-specific clinical
outcomes to treatment choices or surgical interventions. 3D CT images are
routinely acquired in the oncological worklows and deliver accurate rendering
of internal anatomy and therefore can be used opportunistically to assess the
amount of skeletal muscle and adipose tissue compartments. Powerful tools of
artificial intelligence such as deep learning are making it feasible now to
segment the entire 3D image and generate accurate measurements of all internal
anatomy. These will enable the overcoming of the severe bottleneck that existed
previously, namely, the need for manual segmentation, which was prohibitive to
scale to the hundreds of 2D axial slices that made up a 3D volumetric image.
Automated tools such as presented here will now enable harvesting whole-body
measurements from 3D CT or MRI images, leading to a new era of discovery of the
drivers of various diseases based on individual tissue, organ volume, shape,
and functional status. These measurements were hitherto unavailable thereby
limiting the field to a very small and limited subset. These discoveries and
the potential to perform individual image segmentation with high speed and
accuracy are likely to lead to the incorporation of these 3D measures into
individual specific treatment planning models related to nutrition, aging,
chemotoxicity, surgery and survival after the onset of a major disease such as
cancer.","['cs.CV', 'q-bio.TO']"
64,Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP,"We introduce a method that allows to automatically segment images into
semantically meaningful regions without human supervision. Derived regions are
consistent across different images and coincide with human-defined semantic
classes on some datasets. In cases where semantic regions might be hard for
human to define and consistently label, our method is still able to find
meaningful and consistent semantic classes. In our work, we use pretrained
StyleGAN2~\cite{karras2020analyzing} generative model: clustering in the
feature space of the generative model allows to discover semantic classes. Once
classes are discovered, a synthetic dataset with generated images and
corresponding segmentation masks can be created. After that a segmentation
model is trained on the synthetic dataset and is able to generalize to real
images. Additionally, by using CLIP~\cite{radford2021learning} we are able to
use prompts defined in a natural language to discover some desired semantic
classes. We test our method on publicly available datasets and show
state-of-the-art results.",['cs.CV']
65,Crosslink-Net: Double-branch Encoder Segmentation Network via Fusing Vertical and Horizontal Convolutions,"Accurate image segmentation plays a crucial role in medical image analysis,
yet it faces great challenges of various shapes, diverse sizes, and blurry
boundaries. To address these difficulties, square kernel-based encoder-decoder
architecture has been proposed and widely used, but its performance remains
still unsatisfactory. To further cope with these challenges, we present a novel
double-branch encoder architecture. Our architecture is inspired by two
observations: 1) Since the discrimination of features learned via square
convolutional kernels needs to be further improved, we propose to utilize
non-square vertical and horizontal convolutional kernels in the double-branch
encoder, so features learned by the two branches can be expected to complement
each other. 2) Considering that spatial attention can help models to better
focus on the target region in a large-sized image, we develop an attention loss
to further emphasize the segmentation on small-sized targets. Together, the
above two schemes give rise to a novel double-branch encoder segmentation
framework for medical image segmentation, namely Crosslink-Net. The experiments
validate the effectiveness of our model on four datasets. The code is released
at https://github.com/Qianyu1226/Crosslink-Net.","['cs.CV', 'cs.AI', '68T07', 'I.4.6']"
66,Reservoir Computing Approach for Gray Images Segmentation,"The paper proposes a novel approach for gray scale images segmentation. It is
based on multiple features extraction from single feature per image pixel,
namely its intensity value, using Echo state network. The newly extracted
features -- reservoir equilibrium states -- reveal hidden image characteristics
that improve its segmentation via a clustering algorithm. Moreover, it was
demonstrated that the intrinsic plasticity tuning of reservoir fits its
equilibrium states to the original image intensity distribution thus allowing
for its better segmentation. The proposed approach is tested on the benchmark
image Lena.","['cs.CV', 'cs.LG', 'eess.IV']"
67,Superpixel-guided Iterative Learning from Noisy Labels for Medical Image Segmentation,"Learning segmentation from noisy labels is an important task for medical
image analysis due to the difficulty in acquiring highquality annotations. Most
existing methods neglect the pixel correlation and structural prior in
segmentation, often producing noisy predictions around object boundaries. To
address this, we adopt a superpixel representation and develop a robust
iterative learning strategy that combines noise-aware training of segmentation
network and noisy label refinement, both guided by the superpixels. This design
enables us to exploit the structural constraints in segmentation labels and
effectively mitigate the impact of label noise in learning. Experiments on two
benchmarks show that our method outperforms recent state-of-the-art approaches,
and achieves superior robustness in a wide range of label noises. Code is
available at https://github.com/gaozhitong/SP_guided_Noisy_Label_Seg.",['cs.CV']
68,Weighted Intersection over Union (wIoU): A New Evaluation Metric for Image Segmentation,"In this paper, we propose a novel evaluation metric for performance
evaluation of semantic segmentation. In recent years, many studies have tried
to train pixel-level classifiers on large-scale image datasets to perform
accurate semantic segmentation. The goal of semantic segmentation is to assign
a class label of each pixel in the scene. It has various potential applications
in computer vision fields e.g., object detection, classification, scene
understanding and Etc. To validate the proposed wIoU evaluation metric, we
tested state-of-the art methods on public benchmark datasets (e.g., KITTI)
based on the proposed wIoU metric and compared with other conventional
evaluation metrics.",['cs.CV']
69,Vessel-CAPTCHA: an efficient learning framework for vessel annotation and segmentation,"Deep learning techniques for 3D brain vessel image segmentation have not been
as successful as in the segmentation of other organs and tissues. This can be
explained by two factors. First, deep learning techniques tend to show poor
performances at the segmentation of relatively small objects compared to the
size of the full image. Second, due to the complexity of vascular trees and the
small size of vessels, it is challenging to obtain the amount of annotated
training data typically needed by deep learning methods. To address these
problems, we propose a novel annotation-efficient deep learning vessel
segmentation framework. The framework avoids pixel-wise annotations, only
requiring weak patch-level labels to discriminate between vessel and non-vessel
2D patches in the training set, in a setup similar to the CAPTCHAs used to
differentiate humans from bots in web applications. The user-provided weak
annotations are used for two tasks: 1) to synthesize pixel-wise pseudo-labels
for vessels and background in each patch, which are used to train a
segmentation network, and 2) to train a classifier network. The classifier
network allows to generate additional weak patch labels, further reducing the
annotation burden, and it acts as a noise filter for poor quality images. We
use this framework for the segmentation of the cerebrovascular tree in
Time-of-Flight angiography (TOF) and Susceptibility-Weighted Images (SWI). The
results show that the framework achieves state-of-the-art accuracy, while
reducing the annotation time by ~77% w.r.t. learning-based segmentation methods
using pixel-wise labels for training.",['cs.CV']
70,LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation,"Medical image segmentation plays an essential role in developing
computer-assisted diagnosis and therapy systems, yet still faces many
challenges. In the past few years, the popular encoder-decoder architectures
based on CNNs (e.g., U-Net) have been successfully applied in the task of
medical image segmentation. However, due to the locality of convolution
operations, they demonstrate limitations in learning global context and
long-range spatial relations. Recently, several researchers try to introduce
transformers to both the encoder and decoder components with promising results,
but the efficiency requires further improvement due to the high computational
complexity of transformers. In this paper, we propose LeViT-UNet, which
integrates a LeViT Transformer module into the U-Net architecture, for fast and
accurate medical image segmentation. Specifically, we use LeViT as the encoder
of the LeViT-UNet, which better trades off the accuracy and efficiency of the
Transformer block. Moreover, multi-scale feature maps from transformer blocks
and convolutional blocks of LeViT are passed into the decoder via
skip-connection, which can effectively reuse the spatial information of the
feature maps. Our experiments indicate that the proposed LeViT-UNet achieves
better performance comparing to various competing methods on several
challenging medical image segmentation benchmarks including Synapse and ACDC.
Code and models will be publicly available at
https://github.com/apple1986/LeViT_UNet.",['cs.CV']
71,Double Similarity Distillation for Semantic Image Segmentation,"The balance between high accuracy and high speed has always been a
challenging task in semantic image segmentation. Compact segmentation networks
are more widely used in the case of limited resources, while their performances
are constrained. In this paper, motivated by the residual learning and global
aggregation, we propose a simple yet general and effective knowledge
distillation framework called double similarity distillation (DSD) to improve
the classification accuracy of all existing compact networks by capturing the
similarity knowledge in pixel and category dimensions, respectively.
Specifically, we propose a pixel-wise similarity distillation (PSD) module that
utilizes residual attention maps to capture more detailed spatial dependencies
across multiple layers. Compared with exiting methods, the PSD module greatly
reduces the amount of calculation and is easy to expand. Furthermore,
considering the differences in characteristics between semantic segmentation
task and other computer vision tasks, we propose a category-wise similarity
distillation (CSD) module, which can help the compact segmentation network
strengthen the global category correlation by constructing the correlation
matrix. Combining these two modules, DSD framework has no extra parameters and
only a minimal increase in FLOPs. Extensive experiments on four challenging
datasets, including Cityscapes, CamVid, ADE20K, and Pascal VOC 2012, show that
DSD outperforms current state-of-the-art methods, proving its effectiveness and
generality. The code and models will be publicly available.",['cs.CV']
72,A Weighted Difference of Anisotropic and Isotropic Total Variation for Relaxed Mumford-Shah Color and Multiphase Image Segmentation,"In a class of piecewise-constant image segmentation models, we propose to
incorporate a weighted difference of anisotropic and isotropic total variation
(AITV) to regularize the partition boundaries in an image. In particular, we
replace the total variation regularization in the Chan-Vese segmentation model
and a fuzzy region competition model by the proposed AITV. To deal with the
nonconvex nature of AITV, we apply the difference-of-convex algorithm (DCA), in
which the subproblems can be minimized by the primal-dual hybrid gradient
method with linesearch. The convergence of the DCA scheme is analyzed. In
addition, a generalization to color image segmentation is discussed. In the
numerical experiments, we compare the proposed models with the classic convex
approaches and the two-stage segmentation methods (smoothing and then
thresholding) on various images, showing that our models are effective in image
segmentation and robust with respect to impulsive noises.",['cs.CV']
73,What Image Features Boost Housing Market Predictions?,"The attractiveness of a property is one of the most interesting, yet
challenging, categories to model. Image characteristics are used to describe
certain attributes, and to examine the influence of visual factors on the price
or timeframe of the listing. In this paper, we propose a set of techniques for
the extraction of visual features for efficient numerical inclusion in
modern-day predictive algorithms. We discuss techniques such as Shannon's
entropy, calculating the center of gravity, employing image segmentation, and
using Convolutional Neural Networks. After comparing these techniques as
applied to a set of property-related images (indoor, outdoor, and satellite),
we conclude the following: (i) the entropy is the most efficient single-digit
visual measure for housing price prediction; (ii) image segmentation is the
most important visual feature for the prediction of housing lifespan; and (iii)
deep image features can be used to quantify interior characteristics and
contribute to captivation modeling. The set of 40 image features selected here
carries a significant amount of predictive power and outperforms some of the
strongest metadata predictors. Without any need to replace a human expert in a
real-estate appraisal process, we conclude that the techniques presented in
this paper can efficiently describe visible characteristics, thus introducing
perceived attractiveness as a quantitative measure into the predictive modeling
of housing.","['cs.CV', 'cs.LG']"
74,Learning from Partially Overlapping Labels: Image Segmentation under Annotation Shift,"Scarcity of high quality annotated images remains a limiting factor for
training accurate image segmentation models. While more and more annotated
datasets become publicly available, the number of samples in each individual
database is often small. Combining different databases to create larger amounts
of training data is appealing yet challenging due to the heterogeneity as a
result of differences in data acquisition and annotation processes, often
yielding incompatible or even conflicting information. In this paper, we
investigate and propose several strategies for learning from partially
overlapping labels in the context of abdominal organ segmentation. We find that
combining a semi-supervised approach with an adaptive cross entropy loss can
successfully exploit heterogeneously annotated data and substantially improve
segmentation accuracy compared to baseline and alternative approaches.",['cs.CV']
75,TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation,"In recent years, computer-aided diagnosis has become an increasingly popular
topic. Methods based on convolutional neural networks have achieved good
performance in medical image segmentation and classification. Due to the
limitations of the convolution operation, the long-term spatial features are
often not accurately obtained. Hence, we propose a TransClaw U-Net network
structure, which combines the convolution operation with the transformer
operation in the encoding part. The convolution part is applied for extracting
the shallow spatial features to facilitate the recovery of the image resolution
after upsampling. The transformer part is used to encode the patches, and the
self-attention mechanism is used to obtain global information between
sequences. The decoding part retains the bottom upsampling structure for better
detail segmentation performance. The experimental results on Synapse
Multi-organ Segmentation Datasets show that the performance of TransClaw U-Net
is better than other network structures. The ablation experiments also prove
the generalization performance of TransClaw U-Net.","['cs.CV', 'cs.AI', 'cs.LG', 'eess.IV']"
76,Privacy Preserving Domain Adaptation for Semantic Segmentation of Medical Images,"Convolutional neural networks (CNNs) have led to significant improvements in
tasks involving semantic segmentation of images. CNNs are vulnerable in the
area of biomedical image segmentation because of distributional gap between two
source and target domains with different data modalities which leads to domain
shift. Domain shift makes data annotations in new modalities necessary because
models must be retrained from scratch. Unsupervised domain adaptation (UDA) is
proposed to adapt a model to new modalities using solely unlabeled target
domain data. Common UDA algorithms require access to data points in the source
domain which may not be feasible in medical imaging due to privacy concerns. In
this work, we develop an algorithm for UDA in a privacy-constrained setting,
where the source domain data is inaccessible. Our idea is based on encoding the
information from the source samples into a prototypical distribution that is
used as an intermediate distribution for aligning the target domain
distribution with the source domain distribution. We demonstrate the
effectiveness of our algorithm by comparing it to state-of-the-art medical
image semantic segmentation approaches on two medical image semantic
segmentation datasets.","['cs.CV', 'cs.CR', 'cs.LG', 'eess.IV']"
77,A Spatial Guided Self-supervised Clustering Network for Medical Image Segmentation,"The segmentation of medical images is a fundamental step in automated
clinical decision support systems. Existing medical image segmentation methods
based on supervised deep learning, however, remain problematic because of their
reliance on large amounts of labelled training data. Although medical imaging
data repositories continue to expand, there has not been a commensurate
increase in the amount of annotated data. Hence, we propose a new spatial
guided self-supervised clustering network (SGSCN) for medical image
segmentation, where we introduce multiple loss functions designed to aid in
grouping image pixels that are spatially connected and have similar feature
representations. It iteratively learns feature representations and clustering
assignment of each pixel in an end-to-end fashion from a single image. We also
propose a context-based consistency loss that better delineates the shape and
boundaries of image regions. It enforces all the pixels belonging to a cluster
to be spatially close to the cluster centre. We evaluated our method on 2
public medical image datasets and compared it to existing conventional and
self-supervised clustering methods. Experimental results show that our method
was most accurate for medical image segmentation.",['cs.CV']
78,Anatomy of Domain Shift Impact on U-Net Layers in MRI Segmentation,"Domain Adaptation (DA) methods are widely used in medical image segmentation
tasks to tackle the problem of differently distributed train (source) and test
(target) data. We consider the supervised DA task with a limited number of
annotated samples from the target domain. It corresponds to one of the most
relevant clinical setups: building a sufficiently accurate model on the minimum
possible amount of annotated data. Existing methods mostly fine-tune specific
layers of the pretrained Convolutional Neural Network (CNN). However, there is
no consensus on which layers are better to fine-tune, e.g. the first layers for
images with low-level domain shift or the deeper layers for images with
high-level domain shift. To this end, we propose SpotTUnet - a CNN architecture
that automatically chooses the layers which should be optimally fine-tuned.
More specifically, on the target domain, our method additionally learns the
policy that indicates whether a specific layer should be fine-tuned or reused
from the pretrained network. We show that our method performs at the same level
as the best of the nonflexible fine-tuning methods even under the extreme
scarcity of annotated data. Secondly, we show that SpotTUnet policy provides a
layer-wise visualization of the domain shift impact on the network, which could
be further used to develop robust domain generalization methods. In order to
extensively evaluate SpotTUnet performance, we use a publicly available dataset
of brain MR images (CC359), characterized by explicit domain shift. We release
a reproducible experimental pipeline.","['cs.CV', 'cs.LG', 'eess.IV']"
79,Hierarchical Self-Supervised Learning for Medical Image Segmentation Based on Multi-Domain Data Aggregation,"A large labeled dataset is a key to the success of supervised deep learning,
but for medical image segmentation, it is highly challenging to obtain
sufficient annotated images for model training. In many scenarios, unannotated
images are abundant and easy to acquire. Self-supervised learning (SSL) has
shown great potentials in exploiting raw data information and representation
learning. In this paper, we propose Hierarchical Self-Supervised Learning
(HSSL), a new self-supervised framework that boosts medical image segmentation
by making good use of unannotated data. Unlike the current literature on
task-specific self-supervised pretraining followed by supervised fine-tuning,
we utilize SSL to learn task-agnostic knowledge from heterogeneous data for
various medical image segmentation tasks. Specifically, we first aggregate a
dataset from several medical challenges, then pre-train the network in a
self-supervised manner, and finally fine-tune on labeled data. We develop a new
loss function by combining contrastive loss and classification loss and
pretrain an encoder-decoder architecture for segmentation tasks. Our extensive
experiments show that multi-domain joint pre-training benefits downstream
segmentation tasks and outperforms single-domain pre-training significantly.
Compared to learning from scratch, our new method yields better performance on
various tasks (e.g., +0.69% to +18.60% in Dice scores with 5% of annotated
data). With limited amounts of training data, our method can substantially
bridge the performance gap w.r.t. denser annotations (e.g., 10% vs.~100% of
annotated data).",['cs.CV']
80,TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation,"Medical image segmentation - the prerequisite of numerous clinical needs -
has been significantly prospered by recent advances in convolutional neural
networks (CNNs). However, it exhibits general limitations on modeling explicit
long-range relation, and existing cures, resorting to building deep encoders
along with aggressive downsampling operations, leads to redundant deepened
networks and loss of localized details. Hence, the segmentation task awaits a
better solution to improve the efficiency of modeling global contexts while
maintaining a strong grasp of low-level details. In this paper, we propose a
novel parallel-in-branch architecture, TransFuse, to address this challenge.
TransFuse combines Transformers and CNNs in a parallel style, where both global
dependency and low-level spatial details can be efficiently captured in a much
shallower manner. Besides, a novel fusion technique - BiFusion module is
created to efficiently fuse the multi-level features from both branches.
Extensive experiments demonstrate that TransFuse achieves the newest
state-of-the-art results on both 2D and 3D medical image sets including polyp,
skin lesion, hip, and prostate segmentation, with significant parameter
decrease and inference speed improvement.","['cs.CV', 'cs.AI']"
81,Towards Robust General Medical Image Segmentation,"The reliability of Deep Learning systems depends on their accuracy but also
on their robustness against adversarial perturbations to the input data.
Several attacks and defenses have been proposed to improve the performance of
Deep Neural Networks under the presence of adversarial noise in the natural
image domain. However, robustness in computer-aided diagnosis for volumetric
data has only been explored for specific tasks and with limited attacks. We
propose a new framework to assess the robustness of general medical image
segmentation systems. Our contributions are two-fold: (i) we propose a new
benchmark to evaluate robustness in the context of the Medical Segmentation
Decathlon (MSD) by extending the recent AutoAttack natural image classification
framework to the domain of volumetric data segmentation, and (ii) we present a
novel lattice architecture for RObust Generic medical image segmentation (ROG).
Our results show that ROG is capable of generalizing across different tasks of
the MSD and largely surpasses the state-of-the-art under sophisticated
adversarial attacks.",['cs.CV']
82,Medical Matting: A New Perspective on Medical Segmentation with Uncertainty,"In medical image segmentation, it is difficult to mark ambiguous areas
accurately with binary masks, especially when dealing with small lesions.
Therefore, it is a challenge for radiologists to reach a consensus by using
binary masks under the condition of multiple annotations. However, these areas
may contain anatomical structures that are conducive to diagnosis. Uncertainty
is introduced to study these situations. Nevertheless, the uncertainty is
usually measured by the variances between predictions in a multiple trial way.
It is not intuitive, and there is no exact correspondence in the image.
Inspired by image matting, we introduce matting as a soft segmentation method
and a new perspective to deal with and represent uncertain regions into medical
scenes, namely medical matting. More specifically, because there is no
available medical matting dataset, we first labeled two medical datasets with
alpha matte. Secondly, the matting method applied to the natural image is not
suitable for the medical scene, so we propose a new architecture to generate
binary masks and alpha matte in a row. Thirdly, the uncertainty map is
introduced to highlight the ambiguous regions from the binary results and
improve the matting performance. Evaluated on these datasets, the proposed
model outperformed state-of-the-art matting algorithms by a large margin, and
alpha matte is proved to be a more efficient labeling form than a binary mask.",['cs.CV']
83,Flexibly Regularized Mixture Models and Application to Image Segmentation,"Probabilistic finite mixture models are widely used for unsupervised
clustering. These models can often be improved by adapting them to the topology
of the data. For instance, in order to classify spatially adjacent data points
similarly, it is common to introduce a Laplacian constraint on the posterior
probability that each data point belongs to a class. Alternatively, the mixing
probabilities can be treated as free parameters, while assuming Gauss-Markov or
more complex priors to regularize those mixing probabilities. However, these
approaches are constrained by the shape of the prior and often lead to
complicated or intractable inference. Here, we propose a new parametrization of
the Dirichlet distribution to flexibly regularize the mixing probabilities of
over-parametrized mixture distributions. Using the Expectation-Maximization
algorithm, we show that our approach allows us to define any linear update rule
for the mixing probabilities, including spatial smoothing regularization as a
special case. We then show that this flexible design can be extended to share
class information between multiple mixture models. We apply our algorithm to
artificial and natural image segmentation tasks, and we provide quantitative
and qualitative comparison of the performance of Gaussian and Student-t
mixtures on the Berkeley Segmentation Dataset. We also demonstrate how to
propagate class information across the layers of deep convolutional neural
networks in a probabilistically optimal way, suggesting a new interpretation
for feedback signals in biological visual systems. Our flexible approach can be
easily generalized to adapt probabilistic mixture models to arbitrary data
topologies.","['cs.CV', 'cs.LG', 'q-bio.NC']"
84,Hierarchical Semantic Segmentation using Psychometric Learning,"Assigning meaning to parts of image data is the goal of semantic image
segmentation. Machine learning methods, specifically supervised learning is
commonly used in a variety of tasks formulated as semantic segmentation. One of
the major challenges in the supervised learning approaches is expressing and
collecting the rich knowledge that experts have with respect to the meaning
present in the image data. Towards this, typically a fixed set of labels is
specified and experts are tasked with annotating the pixels, patches or
segments in the images with the given labels. In general, however, the set of
classes does not fully capture the rich semantic information present in the
images. For example, in medical imaging such as histology images, the different
parts of cells could be grouped and sub-grouped based on the expertise of the
pathologist.
  To achieve such a precise semantic representation of the concepts in the
image, we need access to the full depth of knowledge of the annotator. In this
work, we develop a novel approach to collect segmentation annotations from
experts based on psychometric testing. Our method consists of the psychometric
testing procedure, active query selection, query enhancement, and a deep metric
learning model to achieve a patch-level image embedding that allows for
semantic segmentation of images. We show the merits of our method with
evaluation on the synthetically generated image, aerial image and histology
image.","['cs.CV', 'cs.AI']"
85,Semi-supervised Left Atrium Segmentation with Mutual Consistency Training,"Semi-supervised learning has attracted great attention in the field of
machine learning, especially for medical image segmentation tasks, since it
alleviates the heavy burden of collecting abundant densely annotated data for
training. However, most of existing methods underestimate the importance of
challenging regions (e.g. small branches or blurred edges) during training. We
believe that these unlabeled regions may contain more crucial information to
minimize the uncertainty prediction for the model and should be emphasized in
the training process. Therefore, in this paper, we propose a novel Mutual
Consistency Network (MC-Net) for semi-supervised left atrium segmentation from
3D MR images. Particularly, our MC-Net consists of one encoder and two slightly
different decoders, and the prediction discrepancies of two decoders are
transformed as an unsupervised loss by our designed cycled pseudo label scheme
to encourage mutual consistency. Such mutual consistency encourages the two
decoders to have consistent and low-entropy predictions and enables the model
to gradually capture generalized features from these unlabeled challenging
regions. We evaluate our MC-Net on the public Left Atrium (LA) database and it
obtains impressive performance gains by exploiting the unlabeled data
effectively. Our MC-Net outperforms six recent semi-supervised methods for left
atrium segmentation, and sets the new state-of-the-art performance on the LA
database.",['cs.CV']
86,Medical Transformer: Gated Axial-Attention for Medical Image Segmentation,"Over the past decade, Deep Convolutional Neural Networks have been widely
adopted for medical image segmentation and shown to achieve adequate
performance. However, due to the inherent inductive biases present in the
convolutional architectures, they lack understanding of long-range dependencies
in the image. Recently proposed Transformer-based architectures that leverage
self-attention mechanism encode long-range dependencies and learn
representations that are highly expressive. This motivates us to explore
Transformer-based solutions and study the feasibility of using
Transformer-based network architectures for medical image segmentation tasks.
Majority of existing Transformer-based network architectures proposed for
vision applications require large-scale datasets to train properly. However,
compared to the datasets for vision applications, for medical imaging the
number of data samples is relatively low, making it difficult to efficiently
train transformers for medical applications. To this end, we propose a Gated
Axial-Attention model which extends the existing architectures by introducing
an additional control mechanism in the self-attention module. Furthermore, to
train the model effectively on medical images, we propose a Local-Global
training strategy (LoGo) which further improves the performance. Specifically,
we operate on the whole image and patches to learn global and local features,
respectively. The proposed Medical Transformer (MedT) is evaluated on three
different medical image segmentation datasets and it is shown that it achieves
better performance than the convolutional and other related transformer-based
architectures. Code: https://github.com/jeya-maria-jose/Medical-Transformer",['cs.CV']
87,Label noise in segmentation networks : mitigation must deal with bias,"Imperfect labels limit the quality of predictions learned by deep neural
networks. This is particularly relevant in medical image segmentation, where
reference annotations are difficult to collect and vary significantly even
across expert annotators. Prior work on mitigating label noise focused on
simple models of mostly uniform noise. In this work, we explore biased and
unbiased errors artificially introduced to brain tumour annotations on MRI
data. We found that supervised and semi-supervised segmentation methods are
robust or fairly robust to unbiased errors but sensitive to biased errors. It
is therefore important to identify the sorts of errors expected in medical
image labels and especially mitigate the biased errors.","['cs.CV', 'cs.LG']"
88,Quality-Aware Memory Network for Interactive Volumetric Image Segmentation,"Despite recent progress of automatic medical image segmentation techniques,
fully automatic results usually fail to meet the clinical use and typically
require further refinement. In this work, we propose a quality-aware memory
network for interactive segmentation of 3D medical images. Provided by user
guidance on an arbitrary slice, an interaction network is firstly employed to
obtain an initial 2D segmentation. The quality-aware memory network
subsequently propagates the initial segmentation estimation bidirectionally
over the entire volume. Subsequent refinement based on additional user guidance
on other slices can be incorporated in the same manner. To further facilitate
interactive segmentation, a quality assessment module is introduced to suggest
the next slice to segment based on the current segmentation quality of each
slice. The proposed network has two appealing characteristics: 1) The
memory-augmented network offers the ability to quickly encode past segmentation
information, which will be retrieved for the segmentation of other slices; 2)
The quality assessment module enables the model to directly estimate the
qualities of segmentation predictions, which allows an active learning paradigm
where users preferentially label the lowest-quality slice for multi-round
refinement. The proposed network leads to a robust interactive segmentation
engine, which can generalize well to various types of user annotations (e.g.,
scribbles, boxes). Experimental results on various medical datasets demonstrate
the superiority of our approach in comparison with existing techniques.",['cs.CV']
89,A Deep Learning Object Detection Method for an Efficient Clusters Initialization,"Clustering is an unsupervised machine learning method grouping data samples
into clusters of similar objects. In practice, clustering has been used in
numerous applications such as banking customers profiling, document retrieval,
image segmentation, and e-commerce recommendation engines. However, the
existing clustering techniques present significant limitations, from which is
the dependability of their stability on the initialization parameters (e.g.
number of clusters, centroids). Different solutions were presented in the
literature to overcome this limitation (i.e. internal and external validation
metrics). However, these solutions require high computational complexity and
memory consumption, especially when dealing with big data. In this paper, we
apply the recent object detection Deep Learning (DL) model, named YOLO-v5, to
detect the initial clustering parameters such as the number of clusters with
their sizes and centroids. Mainly, the proposed solution consists of adding a
DL-based initialization phase making the clustering algorithms free of
initialization. Two model solutions are provided in this work, one for isolated
clusters and the other one for overlapping clusters. The features of the
incoming dataset determine which model to use. Moreover, The results show that
the proposed solution can provide near-optimal clusters initialization
parameters with low computational and resources overhead compared to existing
solutions.",['cs.CV']
90,CHASE: Robust Visual Tracking via Cell-Level Differentiable Neural Architecture Search,"A strong visual object tracker nowadays relies on its well-crafted modules,
which typically consist of manually-designed network architectures to deliver
high-quality tracking results. Not surprisingly, the manual design process
becomes a particularly challenging barrier, as it demands sufficient prior
experience, enormous effort, intuition and perhaps some good luck. Meanwhile,
neural architecture search has gaining grounds in practical applications such
as image segmentation, as a promising method in tackling the issue of automated
search of feasible network structures. In this work, we propose a novel
cell-level differentiable architecture search mechanism to automate the network
design of the tracking module, aiming to adapt backbone features to the
objective of a tracking network during offline training. The proposed approach
is simple, efficient, and with no need to stack a series of modules to
construct a network. Our approach is easy to be incorporated into existing
trackers, which is empirically validated using different differentiable
architecture search-based methods and tracking objectives. Extensive
experimental evaluations demonstrate the superior performance of our approach
over five commonly-used benchmarks. Meanwhile, our automated searching process
takes 41 (18) hours for the second (first) order DARTS method on the
TrackingNet dataset.","['cs.CV', 'cs.AI', 'cs.LG']"
91,Cooperative Training and Latent Space Data Augmentation for Robust Medical Image Segmentation,"Deep learning-based segmentation methods are vulnerable to unforeseen data
distribution shifts during deployment, e.g. change of image appearances or
contrasts caused by different scanners, unexpected imaging artifacts etc. In
this paper, we present a cooperative framework for training image segmentation
models and a latent space augmentation method for generating hard examples.
Both contributions improve model generalization and robustness with limited
data. The cooperative training framework consists of a fast-thinking network
(FTN) and a slow-thinking network (STN). The FTN learns decoupled image
features and shape features for image reconstruction and segmentation tasks.
The STN learns shape priors for segmentation correction and refinement. The two
networks are trained in a cooperative manner. The latent space augmentation
generates challenging examples for training by masking the decoupled latent
space in both channel-wise and spatial-wise manners. We performed extensive
experiments on public cardiac imaging datasets. Using only 10 subjects from a
single site for training, we demonstrated improved cross-site segmentation
performance and increased robustness against various unforeseen imaging
artifacts compared to strong baseline methods. Particularly, cooperative
training with latent space data augmentation yields 15% improvement in terms of
average Dice score when compared to a standard training method.","['cs.CV', 'cs.AI', 'cs.LG', 'q-bio.QM']"
92,UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation,"Transformer architecture has emerged to be successful in a number of natural
language processing tasks. However, its applications to medical vision remain
largely unexplored. In this study, we present UTNet, a simple yet powerful
hybrid Transformer architecture that integrates self-attention into a
convolutional neural network for enhancing medical image segmentation. UTNet
applies self-attention modules in both encoder and decoder for capturing
long-range dependency at different scales with minimal overhead. To this end,
we propose an efficient self-attention mechanism along with relative position
encoding that reduces the complexity of self-attention operation significantly
from $O(n^2)$ to approximate $O(n)$. A new self-attention decoder is also
proposed to recover fine-grained details from the skipped connections in the
encoder. Our approach addresses the dilemma that Transformer requires huge
amounts of data to learn vision inductive bias. Our hybrid layer design allows
the initialization of Transformer into convolutional networks without a need of
pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac
magnetic resonance imaging cohort. UTNet demonstrates superior segmentation
performance and robustness against the state-of-the-art approaches, holding the
promise to generalize well on other medical image segmentations.",['cs.CV']
93,Unsupervised Image Segmentation by Mutual Information Maximization and Adversarial Regularization,"Semantic segmentation is one of the basic, yet essential scene understanding
tasks for an autonomous agent. The recent developments in supervised machine
learning and neural networks have enjoyed great success in enhancing the
performance of the state-of-the-art techniques for this task. However, their
superior performance is highly reliant on the availability of a large-scale
annotated dataset. In this paper, we propose a novel fully unsupervised
semantic segmentation method, the so-called Information Maximization and
Adversarial Regularization Segmentation (InMARS). Inspired by human perception
which parses a scene into perceptual groups, rather than analyzing each pixel
individually, our proposed approach first partitions an input image into
meaningful regions (also known as superpixels). Next, it utilizes
Mutual-Information-Maximization followed by an adversarial training strategy to
cluster these regions into semantically meaningful classes. To customize an
adversarial training scheme for the problem, we incorporate adversarial pixel
noise along with spatial perturbations to impose photometrical and geometrical
invariance on the deep neural network. Our experiments demonstrate that our
method achieves the state-of-the-art performance on two commonly used
unsupervised semantic segmentation datasets, COCO-Stuff, and Potsdam.",['cs.CV']
94,Inter Extreme Points Geodesics for Weakly Supervised Segmentation,"We introduce $\textit{InExtremIS}$, a weakly supervised 3D approach to train
a deep image segmentation network using particularly weak train-time
annotations: only 6 extreme clicks at the boundary of the objects of interest.
Our fully-automatic method is trained end-to-end and does not require any
test-time annotations. From the extreme points, 3D bounding boxes are extracted
around objects of interest. Then, deep geodesics connecting extreme points are
generated to increase the amount of ""annotated"" voxels within the bounding
boxes. Finally, a weakly supervised regularised loss derived from a Conditional
Random Field formulation is used to encourage prediction consistency over
homogeneous regions. Extensive experiments are performed on a large open
dataset for Vestibular Schwannoma segmentation. $\textit{InExtremIS}$ obtained
competitive performance, approaching full supervision and outperforming
significantly other weakly supervised techniques based on bounding boxes.
Moreover, given a fixed annotation time budget, $\textit{InExtremIS}$
outperforms full supervision. Our code and data are available online.",['cs.CV']
95,Segmenting two-dimensional structures with strided tensor networks,"Tensor networks provide an efficient approximation of operations involving
high dimensional tensors and have been extensively used in modelling quantum
many-body systems. More recently, supervised learning has been attempted with
tensor networks, primarily focused on tasks such as image classification. In
this work, we propose a novel formulation of tensor networks for supervised
image segmentation which allows them to operate on high resolution medical
images. We use the matrix product state (MPS) tensor network on non-overlapping
patches of a given input image to predict the segmentation mask by learning a
pixel-wise linear classification rule in a high dimensional space. The proposed
model is end-to-end trainable using backpropagation. It is implemented as a
Strided Tensor Network to reduce the parameter complexity. The performance of
the proposed method is evaluated on two public medical imaging datasets and
compared to relevant baselines. The evaluation shows that the strided tensor
network yields competitive performance compared to CNN-based models while using
fewer resources. Additionally, based on the experiments we discuss the
feasibility of using fully linear models for segmentation tasks.",['cs.CV']
96,Segmentation with Multiple Acceptable Annotations: A Case Study of Myocardial Segmentation in Contrast Echocardiography,"Most existing deep learning-based frameworks for image segmentation assume
that a unique ground truth is known and can be used for performance evaluation.
This is true for many applications, but not all. Myocardial segmentation of
Myocardial Contrast Echocardiography (MCE), a critical task in automatic
myocardial perfusion analysis, is an example. Due to the low resolution and
serious artifacts in MCE data, annotations from different cardiologists can
vary significantly, and it is hard to tell which one is the best. In this case,
how can we find a good way to evaluate segmentation performance and how do we
train the neural network? In this paper, we address the first problem by
proposing a new extended Dice to effectively evaluate the segmentation
performance when multiple accepted ground truth is available. Then based on our
proposed metric, we solve the second problem by further incorporating the new
metric into a loss function that enables neural networks to flexibly learn
general features of myocardium. Experiment results on our clinical MCE data set
demonstrate that the neural network trained with the proposed loss function
outperforms those existing ones that try to obtain a unique ground truth from
multiple annotations, both quantitatively and qualitatively. Finally, our
grading study shows that using extended Dice as an evaluation metric can better
identify segmentation results that need manual correction compared with using
Dice.",['cs.CV']
97,Information-Theoretic Segmentation by Inpainting Error Maximization,"We study image segmentation from an information-theoretic perspective,
proposing a novel adversarial method that performs unsupervised segmentation by
partitioning images into maximally independent sets. More specifically, we
group image pixels into foreground and background, with the goal of minimizing
predictability of one set from the other. An easily computed loss drives a
greedy search process to maximize inpainting error over these partitions. Our
method does not involve training deep networks, is computationally cheap,
class-agnostic, and even applicable in isolation to a single unlabeled image.
Experiments demonstrate that it achieves a new state-of-the-art in unsupervised
segmentation quality, while being substantially faster and more general than
competing approaches.",['cs.CV']
98,Are conditional GANs explicitly conditional?,"This paper proposes two important contributions for conditional Generative
Adversarial Networks (cGANs) to improve the wide variety of applications that
exploit this architecture. The first main contribution is an analysis of cGANs
to show that they are not explicitly conditional. In particular, it will be
shown that the discriminator and subsequently the cGAN does not automatically
learn the conditionality between inputs. The second contribution is a new
method, called acontrario, that explicitly models conditionality for both parts
of the adversarial architecture via a novel acontrario loss that involves
training the discriminator to learn unconditional (adverse) examples. This
leads to a novel type of data augmentation approach for GANs (acontrario
learning) which allows to restrict the search space of the generator to
conditional outputs using adverse examples. Extensive experimentation is
carried out to evaluate the conditionality of the discriminator by proposing a
probability distribution analysis. Comparisons with the cGAN architecture for
different applications show significant improvements in performance on well
known datasets including, semantic image synthesis, image segmentation and
monocular depth prediction using different metrics including Fr\'echet
Inception Distance(FID), mean Intersection over Union (mIoU), Root Mean Square
Error log (RMSE log) and Number of statistically-Different Bins (NDB)","['cs.CV', 'cs.AI']"
99,K-Net: Towards Unified Image Segmentation,"Semantic, instance, and panoptic segmentations have been addressed using
different and specialized frameworks despite their underlying connections. This
paper presents a unified, simple, and effective framework for these essentially
similar tasks. The framework, named K-Net, segments both instances and semantic
categories consistently by a group of learnable kernels, where each kernel is
responsible for generating a mask for either a potential instance or a stuff
class. To remedy the difficulties of distinguishing various instances, we
propose a kernel update strategy that enables each kernel dynamic and
conditional on its meaningful group in the input image. K-Net can be trained in
an end-to-end manner with bipartite matching, and its training and inference
are naturally NMS-free and box-free. Without bells and whistles, K-Net
surpasses all previous state-of-the-art single-model results of panoptic
segmentation on MS COCO and semantic segmentation on ADE20K with 52.1% PQ and
54.3% mIoU, respectively. Its instance segmentation performance is also on par
with Cascade Mask R-CNNon MS COCO with 60%-90% faster inference speeds. Code
and models will be released at https://github.com/open-mmlab/mmdetection.","['cs.CV', 'cs.AI']"
100,Poisoning the Search Space in Neural Architecture Search,"Deep learning has proven to be a highly effective problem-solving tool for
object detection and image segmentation across various domains such as
healthcare and autonomous driving. At the heart of this performance lies neural
architecture design which relies heavily on domain knowledge and prior
experience on the researchers' behalf. More recently, this process of finding
the most optimal architectures, given an initial search space of possible
operations, was automated by Neural Architecture Search (NAS). In this paper,
we evaluate the robustness of one such algorithm known as Efficient NAS (ENAS)
against data agnostic poisoning attacks on the original search space with
carefully designed ineffective operations. By evaluating algorithm performance
on the CIFAR-10 dataset, we empirically demonstrate how our novel search space
poisoning (SSP) approach and multiple-instance poisoning attacks exploit design
flaws in the ENAS controller to result in inflated prediction error rates for
child networks. Our results provide insights into the challenges to surmount in
using NAS for more adversarially robust architecture search.","['cs.LG', 'cs.CR', 'cs.NE', 'stat.ML']"