video
y_i \in \ \mathbb{R}^{3 \times T \times W \times H} \\ \\
for 1 \leq i \leq N \\ \\
where W: width. H: height. T: number of sampled frames in video
During learning, we want to use the posterior probabilities from teacher vision network
g_k(y_i) \\ \\
to train student network
f_k(x_i) \\ \\
to recognize concepts given sound. $k$ enumerates the transferred concepts.
During learning, we optimize
min_\theta \sum^K_{k=1} \sum^N_{i=i} D_{KL}(g_k(y_i)||f_k(x_i)) \\ \\
where
D_{KL}(P||Q) = \sum_j P_j \log \frac{P_j}{Q_j}
is the KL-divergence. KL-divergence is used because the outputs from $g_k$ can be interpreted as a distribution of categories.
KL-div is differentiable, optimized using back prop and stochastic gradient descent.
Transfer scene and object visual networks (K=2).
### Sound Classification
- The categories to categorize with sound may not appear in visual models. For that, output layer is ignored and the internal representation of a layer is used as input features to train a linear SVM.
### Implementation
- Torch7
- Adam optimizer
- learning rate: 0.001
- momentum term: 0.9
- batch size: 64
- weights initialized to zero mean gaussian noise with std: 0.01
- Batch normalization after each convolution
- 100,000 iterations
- Optimization took 1 day on a GPU
## Experiments
Two trainings: one with videos and one with sound only.
1st training:
- 2M videos for training
- 140,000 for validation
2nd training:
- Use hidden representation of the trained network as a feature extractor for learning on smaller labeled sound-only datasets.
- train SVM
### Acoustic Scene Classification
- Databases DCASE, ESC-50, ESC-10 are described.
### Ablation Analysis
#### Comparison of Loss and Teacher Net
- Performance improves with visual supervision
- Using both ImageNet and Places networks as supervision better than single one.
#### Comparison of Network Depth
- eight-layer architecture is 8% better than five-layer network.
- five-layer network still better than state-of-the-art.
#### Comparison of Supervision
- Train network without video, just using target sound training set.
- Output of network is class probabilities
- five-layer network performs slightly better than a convolutional network trained with same data.
- eight-layer network performs worse, maybe because overfitting
-
#### Comparison of Layer and Teacher Network
- Features from pool5 layer give best performance
- Tried different teacher networks, one was better for a database, the other was better for another database. So, inconclusive.
### Multi-Modal Recognition
#### Vision vs Sound Embeddings
- 2-dimensional t-SNE to visualize features from visual networks and SoundNet.
- Sound features alone also contain considerable amount of semantic information.
#### Object and Scene Classification
- Trained a SVM over both sound and visual features.
- Sound is not as informative as vision, it still contains considerable amount of discriminative information.
### Visualizations
- Visualize what network learned.
- Learned filters are diverse: low and high frequencies, wavelet-like patterns, increasing and decreasing amplitude filters.
## Conclusion
- Train deep sound networks (SoundNet) by transferring knowledge from established vision networks and large amounts of unlabelled video.
- Transfer resulted in semantically rich audio representations for natural sounds. Powerful paradigm.