Link of the Paper: https://arxiv.org/abs/1411.4555

Main Points:

SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。
  1. A generative model ( NIC, GoogLeNet + LSTM ) based on a deep recurrent architecture: the model is trained to maximize the likelihoodP(S|I) of the target description sentence given the training image I. S = { S1, S2, ... } is the target sequence of words and each word St comes from a given dictionary, that describes the image adequately.
  2. The authors use a CNN as an image "encoder", by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences. They call this model the Neural Image Caption, or NIC.

Paper Reading - Show and Tell: A Neural Image Caption Generator ( CVPR 2015 ) 人工智能 第1张  Paper Reading - Show and Tell: A Neural Image Caption Generator ( CVPR 2015 ) 人工智能 第2张

Other Key Points:

  1. A description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in.
  2. The inspiration of Image Captioning could come from advances in Machine Translation.
  3. There are multiple approaches that can be used to generate a sentence given an image, with NIC. The first one is Sampling where the authors just sample the first word according to p1, then provide the corresponding embedding as input and sample p2, continuing like this until we sample the special end-of-sentence token or some maximum length. The second one is Beamsearch: iteratively consider the set of the k best sentences up to time t as candidates to generate sentences of size t+1, and keep only the resulting best k of them. This better approximates S = arg maxS' p(S'|I).
扫码关注我们
微信号:SRE实战
拒绝背锅 运筹帷幄