anystyle 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (82) hide show
  1. checksums.yaml +7 -0
  2. data/HISTORY.md +78 -0
  3. data/LICENSE +27 -0
  4. data/README.md +103 -0
  5. data/lib/anystyle.rb +71 -0
  6. data/lib/anystyle/dictionary.rb +132 -0
  7. data/lib/anystyle/dictionary/gdbm.rb +52 -0
  8. data/lib/anystyle/dictionary/lmdb.rb +67 -0
  9. data/lib/anystyle/dictionary/marshal.rb +27 -0
  10. data/lib/anystyle/dictionary/redis.rb +55 -0
  11. data/lib/anystyle/document.rb +264 -0
  12. data/lib/anystyle/errors.rb +14 -0
  13. data/lib/anystyle/feature.rb +27 -0
  14. data/lib/anystyle/feature/affix.rb +43 -0
  15. data/lib/anystyle/feature/brackets.rb +32 -0
  16. data/lib/anystyle/feature/canonical.rb +13 -0
  17. data/lib/anystyle/feature/caps.rb +20 -0
  18. data/lib/anystyle/feature/category.rb +70 -0
  19. data/lib/anystyle/feature/dictionary.rb +16 -0
  20. data/lib/anystyle/feature/indent.rb +16 -0
  21. data/lib/anystyle/feature/keyword.rb +52 -0
  22. data/lib/anystyle/feature/line.rb +39 -0
  23. data/lib/anystyle/feature/locator.rb +18 -0
  24. data/lib/anystyle/feature/number.rb +39 -0
  25. data/lib/anystyle/feature/position.rb +28 -0
  26. data/lib/anystyle/feature/punctuation.rb +22 -0
  27. data/lib/anystyle/feature/quotes.rb +20 -0
  28. data/lib/anystyle/feature/ref.rb +21 -0
  29. data/lib/anystyle/feature/terminal.rb +19 -0
  30. data/lib/anystyle/feature/words.rb +74 -0
  31. data/lib/anystyle/finder.rb +94 -0
  32. data/lib/anystyle/format/bibtex.rb +63 -0
  33. data/lib/anystyle/format/csl.rb +28 -0
  34. data/lib/anystyle/normalizer.rb +65 -0
  35. data/lib/anystyle/normalizer/brackets.rb +13 -0
  36. data/lib/anystyle/normalizer/container.rb +13 -0
  37. data/lib/anystyle/normalizer/date.rb +109 -0
  38. data/lib/anystyle/normalizer/edition.rb +16 -0
  39. data/lib/anystyle/normalizer/journal.rb +14 -0
  40. data/lib/anystyle/normalizer/locale.rb +30 -0
  41. data/lib/anystyle/normalizer/location.rb +24 -0
  42. data/lib/anystyle/normalizer/locator.rb +22 -0
  43. data/lib/anystyle/normalizer/names.rb +88 -0
  44. data/lib/anystyle/normalizer/page.rb +29 -0
  45. data/lib/anystyle/normalizer/publisher.rb +18 -0
  46. data/lib/anystyle/normalizer/pubmed.rb +18 -0
  47. data/lib/anystyle/normalizer/punctuation.rb +23 -0
  48. data/lib/anystyle/normalizer/quotes.rb +14 -0
  49. data/lib/anystyle/normalizer/type.rb +54 -0
  50. data/lib/anystyle/normalizer/volume.rb +26 -0
  51. data/lib/anystyle/parser.rb +199 -0
  52. data/lib/anystyle/support.rb +4 -0
  53. data/lib/anystyle/support/finder.mod +3234 -0
  54. data/lib/anystyle/support/finder.txt +75 -0
  55. data/lib/anystyle/support/parser.mod +15025 -0
  56. data/lib/anystyle/support/parser.txt +75 -0
  57. data/lib/anystyle/utils.rb +70 -0
  58. data/lib/anystyle/version.rb +3 -0
  59. data/res/finder/bb132pr2055.ttx +6803 -0
  60. data/res/finder/bb550sh8053.ttx +18660 -0
  61. data/res/finder/bb599nz4341.ttx +2957 -0
  62. data/res/finder/bb725rt6501.ttx +15276 -0
  63. data/res/finder/bc605xz1554.ttx +18815 -0
  64. data/res/finder/bd040gx5718.ttx +4271 -0
  65. data/res/finder/bd413nt2715.ttx +4956 -0
  66. data/res/finder/bd466fq0394.ttx +6100 -0
  67. data/res/finder/bf668vw2021.ttx +3578 -0
  68. data/res/finder/bg495cx0468.ttx +7267 -0
  69. data/res/finder/bg599vt3743.ttx +6752 -0
  70. data/res/finder/bg608dx2253.ttx +4094 -0
  71. data/res/finder/bh410qk3771.ttx +8785 -0
  72. data/res/finder/bh989ww6442.ttx +17204 -0
  73. data/res/finder/bj581pc8202.ttx +2719 -0
  74. data/res/parser/bad.xml +5199 -0
  75. data/res/parser/core.xml +7924 -0
  76. data/res/parser/gold.xml +2707 -0
  77. data/res/parser/good.xml +34281 -0
  78. data/res/parser/stanford-books.xml +2280 -0
  79. data/res/parser/stanford-diss.xml +726 -0
  80. data/res/parser/stanford-theses.xml +4684 -0
  81. data/res/parser/ugly.xml +33246 -0
  82. metadata +195 -0
@@ -0,0 +1,3578 @@
1
+ title | A LIGHT-WEIGHT 3-D INDOOR ACQUISITION SYSTEM
2
+ | USING AN RGB-D CAMERA
3
+ blank |
4
+ |
5
+ |
6
+ |
7
+ title | A DISSERTATION
8
+ | SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
9
+ | ENGINEERING
10
+ | AND THE COMMITTEE ON GRADUATE STUDIES
11
+ | OF STANFORD UNIVERSITY
12
+ | IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
13
+ | FOR THE DEGREE OF
14
+ | DOCTOR OF PHILOSOPHY
15
+ blank |
16
+ |
17
+ |
18
+ |
19
+ text | Young Min Kim
20
+ | August 2013
21
+ | © 2013 by Young Min Kim. All Rights Reserved.
22
+ | Re-distributed by Stanford University under license with the author.
23
+ blank |
24
+ |
25
+ |
26
+ text | This work is licensed under a Creative Commons Attribution-
27
+ | Noncommercial 3.0 United States License.
28
+ | http://creativecommons.org/licenses/by-nc/3.0/us/
29
+ blank |
30
+ |
31
+ |
32
+ |
33
+ text | This dissertation is online at: http://purl.stanford.edu/bf668vw2021
34
+ blank |
35
+ text | Includes supplemental files:
36
+ | 1. Video for Chapter 4 (video_final_medium3.wmv)
37
+ | 2. Video for Chapter 2 (Reconstruct.mpg)
38
+ blank |
39
+ |
40
+ |
41
+ |
42
+ meta | ii
43
+ text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
44
+ | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
45
+ blank |
46
+ text | Leonidas Guibas, Primary Adviser
47
+ blank |
48
+ |
49
+ |
50
+ text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
51
+ | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
52
+ blank |
53
+ text | Bernd Girod
54
+ blank |
55
+ |
56
+ |
57
+ text | I certify that I have read this dissertation and that, in my opinion, it is fully adequate
58
+ | in scope and quality as a dissertation for the degree of Doctor of Philosophy.
59
+ blank |
60
+ text | Sebastian Thrun
61
+ blank |
62
+ |
63
+ |
64
+ |
65
+ text | Approved for the Stanford University Committee on Graduate Studies.
66
+ | Patricia J. Gumport, Vice Provost for Graduate Education
67
+ blank |
68
+ |
69
+ |
70
+ |
71
+ text | This signature page was generated electronically upon submission of this dissertation in
72
+ | electronic format. An original signed hard copy of the signature page is on file in
73
+ | University Archives.
74
+ blank |
75
+ |
76
+ |
77
+ |
78
+ meta | iii
79
+ title | Abstract
80
+ blank |
81
+ text | Large-scale acquisition of exterior urban environments is by now a well-established
82
+ | technology, supporting many applications in map searching, navigation, and com-
83
+ | merce. The same is, however, not the case for indoor environments, where access is
84
+ | often restricted and the spaces can be cluttered. Recent advances in real-time 3D
85
+ | acquisition devices (e.g., Microsoft Kinect) enable everyday users to scan complex
86
+ | indoor environments at a video rate. Raw scans, however, are often noisy, incom-
87
+ | plete, and significantly corrupted, making semantic scene understanding difficult, if
88
+ | not impossible. In this dissertation, we present ways of utilizing prior information
89
+ | to semantically understand the environments from the noisy scans of real-time 3-D
90
+ | sensors. The presented pipelines are light-weighted, and have the potential to allow
91
+ | users to provide feedback at interactive rates.
92
+ | We first present a hand-held system for real-time, interactive acquisition of res-
93
+ | idential floor plans. The system integrates a commodity range camera, a micro-
94
+ | projector, and a button interface for user input and allows the user to freely move
95
+ | through a building to capture its important architectural elements. The system uses
96
+ | the Manhattan world assumption, which posits that wall layouts are rectilinear. This
97
+ | assumption allows generation of floor plans in real time, enabling the operator to
98
+ | interactively guide the reconstruction process and to resolve structural ambiguities
99
+ | and errors during the acquisition. The interactive component aids users with no ar-
100
+ | chitectural training in acquiring wall layouts for their residences. We show a number
101
+ | of residential floor plans reconstructed with the system.
102
+ | We then discuss how we exploit the fact that public environments typically contain
103
+ | a high density of repeated objects (e.g., tables, chairs, monitors, etc.) in regular or
104
+ blank |
105
+ |
106
+ meta | iv
107
+ text | non-regular arrangements with significant pose variations and articulations. We use
108
+ | the special structure of indoor environments to accelerate their 3D acquisition and
109
+ | recognition. Our approach consists of two phases: (i) a learning phase wherein we
110
+ | acquire 3D models of frequently occurring objects and capture their variability modes
111
+ | from only a few scans, and (ii) a recognition phase wherein from a single scan of a
112
+ | new area, we identify previously seen objects but in different poses and locations at
113
+ | an average recognition time of 200ms/model. We evaluate the robustness and limits
114
+ | of the proposed recognition system using a range of synthetic and real-world scans
115
+ | under challenging settings.
116
+ | Last, we present a guided real-time scanning setup, wherein the incoming 3D
117
+ | data stream is continuously analyzed, and the data quality is automatically assessed.
118
+ | While the user is scanning an object, the proposed system discovers and highlights
119
+ | the missing parts, thus guiding the operator (or the autonomous robot) to ’‘where
120
+ | to scan next”. We assess the data quality and completeness of the 3D scan data
121
+ | by comparing to a large collection of commonly occurring indoor man-made objects
122
+ | using an efficient, robust, and effective scan descriptor. We have tested the system
123
+ | on a large number of simulated and real setups, and found the guided interface to be
124
+ | effective even in cluttered and complex indoor environments. Overall, the research
125
+ | presented in the dissertation discusses how low-quality 3-D scans can be effectively
126
+ | used to understand indoor environments and allow necessary user-interaction in real-
127
+ | time. The presented pipelines are designed to be quick and effective by utilizing
128
+ | different geometric priors depending on the target applications.
129
+ blank |
130
+ |
131
+ |
132
+ |
133
+ meta | v
134
+ title | Acknowledgements
135
+ blank |
136
+ text | All the work presented in this thesis would not have been possible without help from
137
+ | many people.
138
+ | First of all, I would like to express my sincerest gratitude to my advisor, Leonidas
139
+ | Guibas. He is not only an intelligent and inspiring scholar in amazingly diverse
140
+ | topics, but also a very caring advisor with deep insights into various aspects of life.
141
+ | He guided me through one of the toughest times of my life, and I am lucky to be one
142
+ | of his students.
143
+ | During my life at Stanford, I had the privilege of working with the smartest people
144
+ | in the world learning not only about research, but also about the different mind-sets
145
+ | that lead to successful careers. I would like to thank Bernd Girod, Christian Theobalt,
146
+ | Sebastian Thrun, Vladlen Koltun, Niloy Mitra, Saumitra Das, Stephen Gould, and
147
+ | Adrian Butscher for being mentors during different stages of my graduate career. I
148
+ | also appreciate help of wonderful collaborators on exciting projects: Jana Kosecka,
149
+ | Branislav Miscusik, James Diebel, Mike Sokolsky, Jen Dolson, Dongming Yan, and
150
+ | Qixing Huang.
151
+ | The work presented here was generously supported by the following funding
152
+ | sources: Samsung Scholarship, MPC-VCC, Qualcomm corporation.
153
+ | I adore my officemates for being cheerful and encouraging, and most of all, being
154
+ | there: Derek Chan, Rahul Biswas, Stephanie Lefevre, Qixing Huang, Jonathan Jiang,
155
+ | Art Tevs, Michael Kerber, Justin Solomon, Jonathan Huang, Fan Wang, Daniel Chen,
156
+ | Kyle Heath, Vangelis Kalogerakis, and Sharath Kumar Raghvendra. I often spent
157
+ | more time with them than with any other people.
158
+ | I have to thank all the friends I met at Stanford. In particular, I would like to
159
+ blank |
160
+ |
161
+ meta | vi
162
+ text | thank Stephanie Kwan, Karen Zhu, Landry Huet, and Yiting Yeh for fun hangouts
163
+ | and random conversations in my early years. I was also fortunate enough to meet a
164
+ | wonderful chamber music group led by Dr. Herbert Myers in which I could play early
165
+ | music with Michael Peterson and Lisa Silverman. I also appreciate for being able to
166
+ | participate in a wonderful WISE (Women in Science and Engineering) group. WISE
167
+ | girls have always been smart, tender and supportive. Many Korean friends at Stanford
168
+ | were like family for me here. I will not attempt to name them all, but I would like to
169
+ | especially thank Jeongha Park, Soogine Chong, Sun-Hae Hong, Jenny Lee, Ga-Young
170
+ | Suh, Joyce Lee, Hyeji Kim, Sun Goo Lee, Wookyung Kim, Han Ho Song and Su-In
171
+ | Lee. While I was enjoying my life at Stanford, I was always connected to my friends
172
+ | in Korea. I would like to express my thanks for their trust and everlasting friendship.
173
+ | Last, I cannot thank to my family enough. I would like to dedicate my thesis to my
174
+ | parents, Kwang Woo Kim and Mi Ja Lee. Their constant love and trust have helped
175
+ | me overcome hardships ever since I was born. I also enjoyed having my brother, Joo
176
+ | Hwan Kim, in the Bay Area. His passion and thoughtful advice always helped me
177
+ | and cheered me up. I thank my husband, Sung-Boem Park, for being by my side no
178
+ | matter what happened. He is my best friend, and he made me face and overcome
179
+ | challenges. I also need to thank my soon-to-be born son (due in August), for allowing
180
+ | me to accelerate the last stages of my Ph. D.
181
+ | Thank you all for making me who I am today.
182
+ blank |
183
+ |
184
+ |
185
+ |
186
+ meta | vii
187
+ title | Contents
188
+ blank |
189
+ text | Abstract iv
190
+ blank |
191
+ text | Acknowledgements vi
192
+ blank |
193
+ text | 1 Introduction 1
194
+ | 1.1 Background on RGB-D Cameras . . . . . . . . . . . . . . . . . . . . 3
195
+ | 1.1.1 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
196
+ | 1.1.2 Noise Characteristics . . . . . . . . . . . . . . . . . . . . . . . 5
197
+ | 1.2 3-D Indoor Acquisition System . . . . . . . . . . . . . . . . . . . . . 6
198
+ | 1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 7
199
+ | 1.3.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
200
+ blank |
201
+ text | 2 Interactive Acquisition of Residential Floor Plans1 11
202
+ | 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
203
+ | 2.2 System Overview and Usage . . . . . . . . . . . . . . . . . . . . . . . 14
204
+ | 2.3 Data Acquisition Process . . . . . . . . . . . . . . . . . . . . . . . . . 16
205
+ | 2.3.1 Pair-Wise Registration . . . . . . . . . . . . . . . . . . . . . . 19
206
+ | 2.3.2 Plane Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 22
207
+ | 2.3.3 Global Adjustment . . . . . . . . . . . . . . . . . . . . . . . . 23
208
+ | 2.3.4 Map Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
209
+ | 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
210
+ | 2.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 29
211
+ blank |
212
+ |
213
+ |
214
+ |
215
+ meta | viii
216
+ text | 3 Environments with Variability and Repetition 33
217
+ | 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
218
+ | 3.1.1 Scanning Technology . . . . . . . . . . . . . . . . . . . . . . . 35
219
+ | 3.1.2 Geometric Priors for Objects . . . . . . . . . . . . . . . . . . . 35
220
+ | 3.1.3 Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . 36
221
+ | 3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
222
+ | 3.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
223
+ | 3.2.2 Hierarchical Structure . . . . . . . . . . . . . . . . . . . . . . 40
224
+ | 3.3 Learning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
225
+ | 3.3.1 Initializing the Skeleton of the Model . . . . . . . . . . . . . . 43
226
+ | 3.3.2 Incrementally Completing a Coherent Model . . . . . . . . . . 45
227
+ | 3.4 Recognition Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
228
+ | 3.4.1 Initial Assignment for Parts . . . . . . . . . . . . . . . . . . . 47
229
+ | 3.4.2 Refined Assignment with Geometry . . . . . . . . . . . . . . . 49
230
+ | 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
231
+ | 3.5.1 Synthetic Scenes . . . . . . . . . . . . . . . . . . . . . . . . . 51
232
+ | 3.5.2 Real-World Scenes . . . . . . . . . . . . . . . . . . . . . . . . 54
233
+ | 3.5.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
234
+ | 3.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
235
+ | 3.5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
236
+ | 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
237
+ blank |
238
+ text | 4 Guided Real-Time Scanning 64
239
+ | 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
240
+ | 4.1.1 Interactive Acquisition . . . . . . . . . . . . . . . . . . . . . . 67
241
+ | 4.1.2 Scan Completion . . . . . . . . . . . . . . . . . . . . . . . . . 67
242
+ | 4.1.3 Part-Based Modeling . . . . . . . . . . . . . . . . . . . . . . . 67
243
+ | 4.1.4 Template-Based Completion . . . . . . . . . . . . . . . . . . . 68
244
+ | 4.1.5 Shape Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 68
245
+ | 4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
246
+ | 4.2.1 Scan Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 70
247
+ blank |
248
+ |
249
+ meta | ix
250
+ text | 4.2.2 Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 70
251
+ | 4.2.3 Scan Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
252
+ | 4.3 Partial Shape Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 71
253
+ | 4.3.1 View-Dependent Simulated Scans . . . . . . . . . . . . . . . . 72
254
+ | 4.3.2 A2h Scan Descriptor . . . . . . . . . . . . . . . . . . . . . . . 73
255
+ | 4.3.3 Descriptor-Based Shape Matching . . . . . . . . . . . . . . . . 74
256
+ | 4.3.4 Scan Registration . . . . . . . . . . . . . . . . . . . . . . . . . 75
257
+ | 4.4 Interface Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
258
+ | 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
259
+ | 4.5.1 Model Database . . . . . . . . . . . . . . . . . . . . . . . . . . 76
260
+ | 4.5.2 Retrieval Results with Simulated Data . . . . . . . . . . . . . 77
261
+ | 4.5.3 Retrieval Results with Real Data . . . . . . . . . . . . . . . . 78
262
+ | 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
263
+ blank |
264
+ text | 5 Conclusions 89
265
+ blank |
266
+ text | Bibliography 91
267
+ blank |
268
+ |
269
+ |
270
+ |
271
+ meta | x
272
+ title | List of Tables
273
+ blank |
274
+ text | 2.1 Accuracy comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 29
275
+ blank |
276
+ text | 3.1 Parameters used in our algorithm . . . . . . . . . . . . . . . . . . . . 41
277
+ | 3.2 Models obtained from the learning phase . . . . . . . . . . . . . . . . 55
278
+ | 3.3 Statistics for the recognition phase . . . . . . . . . . . . . . . . . . . 56
279
+ | 3.4 Statistics between objects learned for each scene category . . . . . . . 59
280
+ blank |
281
+ text | 4.1 Database and scan statistics . . . . . . . . . . . . . . . . . . . . . . . 76
282
+ blank |
283
+ |
284
+ |
285
+ |
286
+ meta | xi
287
+ title | List of Figures
288
+ blank |
289
+ text | 1.1 Triangulation principle . . . . . . . . . . . . . . . . . . . . . . . . . . 4
290
+ | 1.2 Kinect sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
291
+ blank |
292
+ text | 2.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
293
+ | 2.2 System pipeline and usage . . . . . . . . . . . . . . . . . . . . . . . . 15
294
+ | 2.3 Notation and representation . . . . . . . . . . . . . . . . . . . . . . . 17
295
+ | 2.4 Illustration for pair-wise registration . . . . . . . . . . . . . . . . . . 19
296
+ | 2.5 Optical flow and image plane correspondence . . . . . . . . . . . . . . 20
297
+ | 2.6 Silhouette points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
298
+ | 2.7 Optimizing the map . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
299
+ | 2.8 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
300
+ | 2.9 Analysis on computational time . . . . . . . . . . . . . . . . . . . . . 27
301
+ | 2.10 Visual comparisons of the generated floor plans . . . . . . . . . . . . 31
302
+ | 2.11 An possible example of extensions . . . . . . . . . . . . . . . . . . . . 32
303
+ blank |
304
+ text | 3.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
305
+ | 3.2 Acquisition pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
306
+ | 3.3 Hierarchical data structure. . . . . . . . . . . . . . . . . . . . . . . . 39
307
+ | 3.4 Overview of the learning phase . . . . . . . . . . . . . . . . . . . . . 42
308
+ | 3.5 Attachment of the model . . . . . . . . . . . . . . . . . . . . . . . . . 46
309
+ | 3.6 Overview of the recognition phase . . . . . . . . . . . . . . . . . . . . 47
310
+ | 3.7 Refining the segmentation . . . . . . . . . . . . . . . . . . . . . . . . 50
311
+ | 3.8 Recognition results on synthetic scans of virtual scenes . . . . . . . . 52
312
+ | 3.9 Chair models used in synthetic scenes . . . . . . . . . . . . . . . . . . 53
313
+ blank |
314
+ |
315
+ meta | xii
316
+ text | 3.10 Precision-recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
317
+ | 3.11 Various models learned/used in our test . . . . . . . . . . . . . . . . 55
318
+ | 3.12 Recognition results for various office and auditorium scenes . . . . . . 61
319
+ | 3.13 A close-up office scene . . . . . . . . . . . . . . . . . . . . . . . . . . 62
320
+ | 3.14 Comparison with an indoor labeling system . . . . . . . . . . . . . . 63
321
+ blank |
322
+ text | 4.1 A real-time guided scanning system . . . . . . . . . . . . . . . . . . . 65
323
+ | 4.2 Pipeline of the real-time guided scanning framework . . . . . . . . . . 69
324
+ | 4.3 Representative shape retrieval results . . . . . . . . . . . . . . . . . . 80
325
+ | 4.4 The proposed guided real-time scanning setup . . . . . . . . . . . . . 81
326
+ | 4.5 Retrieval results with simulated data using a chair data set . . . . . . 82
327
+ | 4.6 Retrieval results with simulated data using a couch data set . . . . . 83
328
+ | 4.7 Retrieval results with simulated data using a lamp data set . . . . . . 84
329
+ | 4.8 Retrieval results with simulated data using a table data set . . . . . . 85
330
+ | 4.9 Comparison between retrieval with view-dependant and merged scans 86
331
+ | 4.10 Effect of density-aware sampling . . . . . . . . . . . . . . . . . . . . . 87
332
+ | 4.11 Effect of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
333
+ | 4.12 Real-time retrieval results on various datasets . . . . . . . . . . . . . 88
334
+ blank |
335
+ |
336
+ |
337
+ |
338
+ meta | xiii
339
+ | Chapter 1
340
+ blank |
341
+ title | Introduction
342
+ blank |
343
+ text | Acquiring a 3-D model of a real-world object, also known as 3-D reconstruction
344
+ | technology, has long been a challenge for various applications, including robotics
345
+ | navigation, 3-D modeling of virtual worlds, augmented reality, computer graphics,
346
+ | and manufacturing. In the graphics community, a 3-D model is typically acquired in a
347
+ | carefully calibrated set-up with highly accurate laser scans, followed by a complicated
348
+ | off-line process from scan registration to surface reconstruction. Because this is a very
349
+ | long process that requires special equipment, only a limited number of objects can be
350
+ | modeled, and the method cannot be scaled to larger environments.
351
+ | One of the most common applications of a large-scale 3-D reconstruction comes
352
+ | from modeling of urban environments. To build a model, a vehicle equipped with
353
+ | different sensors drives along roads and collects a large amount of data from lasers,
354
+ | GPS signals, wheel counters, cameras, etc. The data is then processed and stored in a
355
+ | compact form which includes important roads, buildings, parking lots. The mapped
356
+ | environments are used frequently in cell-phone applications, mapping technology or
357
+ | navigation tools.
358
+ | However, we cannot simply extend the same technology used in the 3-D reconstruc-
359
+ | tion of urban environments to indoor environments. First, unlike urban environments,
360
+ | where permanent roads exist, there are no clearly defined pathways that people must
361
+ | follow in an indoor environment. Occupants walk in various patterns around an in-
362
+ | door area, and often the space is cluttered, which could result in safety issues if, say,
363
+ blank |
364
+ |
365
+ meta | 1
366
+ | CHAPTER 1. INTRODUCTION 2
367
+ blank |
368
+ |
369
+ |
370
+ text | a robot with sensors drives within the area. Second, an indoor environment is not
371
+ | static. As residents and workers of the building engage in daily activities in interior
372
+ | environments, many objects are moved around or disappear, and new objects can be
373
+ | introduced. Third, interior shapes are much more complex compared to the outdoor
374
+ | surfaces of buildings, and it cannot simply be assumed that the objects present in a
375
+ | space are composed of flat surfaces as is generally the case in outdoor urban settings.
376
+ | Last, the modality of sensors used for outdoor mapping is not suitable for interior
377
+ | mapping and needs to be changed. A GPS signal does not work in indoor environ-
378
+ | ments, and the lighting conditions can vary significantly from one space to another
379
+ | compared to relatively constant sunlight outdoors.
380
+ | Yet, 3-D reconstruction of indoor environments also have a variety of potential
381
+ | applications. After a 3-D model of an indoor environment is acquired, the model
382
+ | could be used for interior design, indoor navigation, surveillance, or understanding
383
+ | the interior layouts and existence of objects in a space. Depending on the applications
384
+ | for which the reconstructed model would be used, the distance range and level of detail
385
+ | needed can vary as well.
386
+ | Recently, real-time 3-D sensors, such as the RGB-D sensors, a light-weight com-
387
+ | modity device, have been specifically designed to function in indoor environments and
388
+ | used to provide real-time 3-D data. Although the data captured from these sensors
389
+ | suffer from a limited field of view and complex noise characteristics, and therefore
390
+ | might not be suitable for accurate 3-D reconstruction, it can be used for everyday
391
+ | users to easily capture and utilize 3-D information of indoor environments. The work
392
+ | presented in this dissertation uses the data captured from RGB-D cameras with the
393
+ | goal of providing a useful 3-D acquisition while overcoming the limitations of the
394
+ | captured data. To do this, we have assumed different geometric priors depending on
395
+ | the targeted applications.
396
+ | In the remainder of this chapter, we first describe the characteristics of RGB-
397
+ | D camera sensors (Section 1.1). The subsequent section (Section 1.2) presents our
398
+ | approach to acquire 3-D indoor environments. The chapter concludes with an outline
399
+ | of the remainder of the dissertation (Section 1.3).
400
+ meta | CHAPTER 1. INTRODUCTION 3
401
+ blank |
402
+ |
403
+ |
404
+ title | 1.1 Background on RGB-D Cameras
405
+ text | Building a 3-D model of actual objects enables the real world to be connected to a
406
+ | virtual world. After obtaining a digital model from a real-world object, the model can
407
+ | be used in various applications. A benefit of 3D modeling is that the digital object
408
+ | can be saved and altered freely without an actual space being damaged or destroyed.
409
+ | Until recently, it was not possible for non-expert users to capture real-world envi-
410
+ | ronments in 3D because of the complexity and cost of the required equipment. RGB-D
411
+ | cameras, which provide real-time depth and color information, only became available
412
+ | a few years ago. The pioneering commodity product is the X-Box Kinect [Mic10],
413
+ | launched on October 2011. Originally developed as a gaming device, the sensor pro-
414
+ | vides real-time depth streams enabling interaction between a user and a system.
415
+ | The Kinect is affordable and easy to operate for non-expert users, and the pro-
416
+ | duced data can be accessed through open-source drivers. Although the main purpose
417
+ | of the Kinect by far was motion-sensing, thus providing a real-time interface for gam-
418
+ | ing or control, the device has served many purposes and has been used as a tool to
419
+ | develop personalized applications with the help of the drivers. Some developers also
420
+ | use the device to extend computer vision-related tasks (such as object recognition
421
+ | or structure from motion) but with depth measurements augmented as an additional
422
+ | modality of input. In addition, the device can also be viewed as a 3-D sensor that
423
+ | produces 3-D pointcloud data. In our work, this is how we view the device, and the
424
+ | goal of the research presented here, as noted above, was to acquire 3-D indoor objects
425
+ | or environments using the RGB-D cameras of the Kinect sensor.
426
+ blank |
427
+ |
428
+ title | 1.1.1 Technology
429
+ text | The underlying core technology of the depth-capturing capacity of Kinect comes
430
+ | from its structured-light 3D scanner. This scanner measures the three-dimensional
431
+ | shape of an object using projected light patterns and a camera system. A typical
432
+ | scanner measuring assembly consists of one stripe projector and at least one camera.
433
+ | Projecting a narrow band of light onto a three-dimensionally shaped surface produces
434
+ | a line of illumination that appears distorted from perspectives other than that of the
435
+ meta | CHAPTER 1. INTRODUCTION 4
436
+ blank |
437
+ |
438
+ |
439
+ |
440
+ text | Figure 1.1: Triangulation principle shown by one of multiple stripes (image from
441
+ | http://en.wikipedia.org/wiki/File:1-stripesx7.svg)
442
+ blank |
443
+ text | projector, and this line can be used for an exact geometric reconstruction of the
444
+ | surface shape. A sample setup with the projected line pattern is shown in Figure 1.1.
445
+ | The displacement of the stripes can be converted into 3D coordinates, which allow
446
+ | any details on an object’s surface to be retrieved.
447
+ | An invisible structured-light scanner scans a 3-D shape of an object by projecting
448
+ | patterns with light in an invisible spectrum. The Kinect uses projecting patterns
449
+ | composed of points in infrared (IR) light to generate video data in 3D. As shown in
450
+ | Figure 1.2, the Kinect is a horizontal bar with an IR light emitter and IR sensor. The
451
+ | IR emitter emits infrared light beams, and the IR sensor reads the IR beams reflected
452
+ | back to the sensor. The reflected beams are converted into depth information that
453
+ | measures the distance between an object and the sensor. This makes capturing a
454
+ | depth image possible. The color sensor captures normal video (visible light) that is
455
+ | synchronized with the depth data. The horizontal bar of the Kinect also contains
456
+ | microphone arrays and is connected to a small base by a tilt motor. While the color
457
+ | video and microphone provide additional means for a natural user interface, in this
458
+ meta | CHAPTER 1. INTRODUCTION 5
459
+ blank |
460
+ |
461
+ |
462
+ |
463
+ text | Figure 1.2: Kinect sensor (left) and illustration of the integrated hardware (right).
464
+ | (images from http://i.msdn.microsoft.com/dynimg/IC568992.png and http://
465
+ | i.msdn.microsoft.com/dynimg/IC584396.png)
466
+ blank |
467
+ text | dissertation, we are focused on the depth-sensing capability of the device.
468
+ | The Kinect has a limited working range, mainly designed for the volume that a
469
+ | person will require while playing a game. Kinect’s official documentation1 suggests
470
+ | a working range from 0.8 m to 4 m from the sensor. The sensor has angular field
471
+ | of view of 57◦ horizontally and 43◦ vertically. When an object is out of range for
472
+ | a particular pixel, the system will return no values. The RGB video streams are
473
+ | produced in a 1280×960 resolution. However, the default RGB video stream uses 8-
474
+ | bit VGA resolution (640×480 pixels). The monochrome depth sensing video stream
475
+ | is also in VGA resolution with 11-bit depth, which provides 2,048 levels of sensitivity.
476
+ | The depth and color stream are produced at the frame rate of 30 Hz.
477
+ | The depth data is originally produced as a 2-D grid of raw depth values. The
478
+ | values in each pixel can then be converted into (x, y, z) coordinates with calibration
479
+ | data. Depending on the application, the developer can regard the 2-D grid of values
480
+ | as a depth image, or the scattered points in 3-D ((x, y, z) coordinates) as unstructured
481
+ | pointcloud data.
482
+ blank |
483
+ |
484
+ title | 1.1.2 Noise Characteristics
485
+ text | While RGB-D cameras can provide real-time depth information, the obtained mea-
486
+ | surements exhibit convoluted noise characteristics. The measurements are extracted
487
+ meta | 1
488
+ text | http://msdn.microsoft.com/en-us/library/jj131033.aspx
489
+ meta | CHAPTER 1. INTRODUCTION 6
490
+ blank |
491
+ |
492
+ |
493
+ text | from identification of corresponding points of infrared projections in image pixels,
494
+ | and there are multiple possible sources of errors: (i) calibration error both of the
495
+ | extrinsic calibration parameters, which are given as the displacement between the
496
+ | projector and cameras, and the intrinsic calibration parameters, which depend on
497
+ | the focal points and size of pixels on the sensor grid, vary for each product; (ii)
498
+ | distance-dependent quantization error – because the accuracy of measurements de-
499
+ | pends on the resolution of a pixel compared to the details of projected pattern on
500
+ | the measured object, measurements are more noisy for farther points with more se-
501
+ | vere quantization artifacts; (iii) error from ambiguous or poor projection, in which
502
+ | the cameras cannot clearly observe the projected patterns – as the measurements are
503
+ | made by identifying the projected location of the infrared pattern, the distortion of
504
+ | the projected patterns on depth boundaries or on reflective material can result in
505
+ | wrong measurements. Sometimes the system cannot locate the corresponding points
506
+ | due to occlusion by parallax, or distance range and the data is reported as missing.
507
+ | In short, the depth data exhibits highly non-linear noise characteristics, and it is very
508
+ | hard to model all of the noise analytically.
509
+ blank |
510
+ |
511
+ title | 1.2 3-D Indoor Acquisition System
512
+ text | Given the complex noise characteristics of RGB-D cameras, we assumed that the de-
513
+ | vice produces noisy pointcloud data. Instead of reverse-engineering and correcting the
514
+ | noise from each source, we overcame the limitation on data by imposing assumptions
515
+ | on the 3-D shape of the objects being scanned.
516
+ | There are three possible ways to reconstruct 3-D models from noisy data. The first
517
+ | is to overcome the limitation of data is accumulating multiple frames from slightly dif-
518
+ | ferent viewpoints [IKH+ 11]. By averaging the noise measurements and merging them
519
+ | into a single volumetric structure, a very high-quality mesh model can be recovered.
520
+ | The second is using a machine learning-based method. In this approach, multiple
521
+ | instances of measurements and actual object labels are first collected. Classifiers are
522
+ | then trained to produce the object labels given the measurements and later used to
523
+ | understand the given measurements. The third way is to assume geometric priors on
524
+ meta | CHAPTER 1. INTRODUCTION 7
525
+ blank |
526
+ |
527
+ |
528
+ text | the data being captured. Assuming that the underlying scene is not completely ran-
529
+ | dom, the shape to be reconstructed has a limited degree of freedom, and can thus be
530
+ | reconstructed by inferring the most probable shape within the scope of the assumed
531
+ | structure.
532
+ | This third way is the method used in our work. By focusing on acquiring the pre-
533
+ | defined modes or degree of freedom given the geometric priors, the acquired model
534
+ | naturally capture high-level information of the structure. In addition, the acquisition
535
+ | pipeline becomes lightweight and the entire process can stay real-time. Because the in-
536
+ | put data stream is also real-time, there is possibility of incorporating user-interaction
537
+ | during the capturing process.
538
+ blank |
539
+ |
540
+ title | 1.3 Outline of the Dissertation
541
+ text | The chapters to follow, outlined below, discuss in detail the specific approaches we
542
+ | took to mitigate the problems inherent in indoor reconstruction from noisy sensor
543
+ | data.
544
+ | Chapter 2 discusses a pipeline used to acquire floor plans in residential areas. The
545
+ | proposed system is quick and convenient compared to the common pipeline used to
546
+ | acquire floor plans from manual sketching and measurements, which are frequently
547
+ | required for remodeling or selling a property. We posit that the world is composed of
548
+ | relatively large, flat surfaces that meet at right angles. We focus on continuous collec-
549
+ | tion of points that occupy large, flat areas and align with the axes and ignoring other
550
+ | points. Even with very noisy data, the process can be performed at an interactive
551
+ | rate since the space of possible plane arrangements is sparse given the measurements.
552
+ | We take advantage of real-time data and allow users to provide intuitive feedback
553
+ | to assist the acquisition pipeline. The research described in the chapter was first
554
+ | published as Y.M. Kim, J. Dolson, M. Sokolsky, V. Koltun, S.Thrun, Interactive
555
+ | Acquisition of Residential Floor Plans, IEEE International Conference on Robotics
556
+ | and Animation (ICRA), 2012 c 2012 IEEE, and the contents were also replicated
557
+ | with small modifications.
558
+ meta | CHAPTER 1. INTRODUCTION 8
559
+ blank |
560
+ |
561
+ |
562
+ text | Chapter 3 discusses how we targeted public spaces with many repeating ob-
563
+ | jects in different poses or variation modes. Even though indoor environments can
564
+ | frequently change, we can identify patterns and possible movements by reasoning
565
+ | in object-level. Especially in public buildings (offices, cafeterias, auditoriums, and
566
+ | seminar rooms), chairs, tables, monitors, etc, are repeatedly used in similar pat-
567
+ | terns. We first build abstract models of the objects of interest with simple geometric
568
+ | primitives and deformation modes. We then use the built models to quickly de-
569
+ | tect the objects of interest within an indoor scene that the objects repeatedly ap-
570
+ | pear. While the models are simple approximation of actual complex geometry, we
571
+ | demonstrate that the models are sufficient to detect the object within noisy, par-
572
+ | tial indoor scene data. The learned variability modes not only factor out nuisance
573
+ | modes of variability (e.g., motions of chairs, etc.) from meaningful changes (e.g.,
574
+ | security, where the new scene objects should be flagged), but also provide the func-
575
+ | tional modes of the object (the status of open drawers, closed laptop, etc.), which
576
+ | potentially provide high-level understanding of the scene. The study discussed have
577
+ | first appeared as a publication, Young Min Kim, Niloy J. Mitra, Dong-Ming Yan,
578
+ | and Leonidas Guibas. 2012. Acquiring 3D indoor environments with variability and
579
+ | repetition. ACM Trans. Graph. 31, 6, Article 138 (November 2012), 11 pages.
580
+ | DOI=10.1145/2366145.2366157 http://doi.acm.org/10.1145/2366145.2366157, from
581
+ | which the major written parts of the chapter were adapted.
582
+ | Chapter 4 discusses a reconstruction approach that utilizes 3-D models down-
583
+ | loaded from the web to assist in understanding the objects being scanned. The data
584
+ | stream from an RGB-D camera is noisy and exhibit lots of missing data, making it
585
+ | very hard to accurately build a full model of an object being scanned. We take an
586
+ | approach to use a large database of 3-D models to match against partial, noisy scan
587
+ | of the input data stream. To this end, we propose a simple, efficient, yet discrimina-
588
+ | tive descriptor that can be evaluated in real-time and used to process complex indoor
589
+ | scenes. The matching models are quickly found from the database with help of our
590
+ | proposed shape descriptor. This also allows real-time assessment of the quality of the
591
+ | data captured, and the system provides the user with real-time feedback on where to
592
+ | scan. Eventually the user can retrieve the closest model as quickly as possible during
593
+ meta | CHAPTER 1. INTRODUCTION 9
594
+ blank |
595
+ |
596
+ |
597
+ text | the scanning session. The research and contents of the chapter will be published as
598
+ | Y.M. Kim, N. Mitra, Q. Huang, L. Guibas, Guided Real-Time Scanning of Indoor
599
+ | Environments, Pacific Graphics 2013.
600
+ | Chapter 5 concludes the dissertation with a summary of our work and a discussion
601
+ | of future directions this research could take.
602
+ blank |
603
+ |
604
+ title | 1.3.1 Contributions
605
+ text | The major contribution of the dissertation is to present methods to quickly acquire
606
+ | 3-D information from noisy, occluded pointcloud data by assuming geometric pri-
607
+ | ors. The pre-defined modes not only provide high-level understanding of the current
608
+ | mode, but also allow the data size to stay compact, which, in turn, saves memory
609
+ | and processing time. The proposed geometric priors have been previously used for
610
+ | different settings, but our approach incorporate the priors tuned for the practical
611
+ | tasks at hand with real scans from RGB-D data acquired from actual environments.
612
+ | The example geometric priors that are covered are as following:
613
+ blank |
614
+ text | • Based on Manhattan world assumption, important architectural elements (walls,
615
+ | floor and ceiling) can be retrieved in real-time.
616
+ blank |
617
+ text | • By building an abstract model composed of simple geometric primitives and joint
618
+ | information between primitives, objects under severe occlusion and different
619
+ | configuration can be located. The bottom-up approach can quickly populate
620
+ | large indoor environments with variability and repetition (around 200 ms per
621
+ | object).
622
+ blank |
623
+ text | • Online public database of 3-D models recover the structure of objects from
624
+ | partial, noisy scans in a matter of seconds. We developed a relation-based
625
+ | lightweight descriptor for fast and accurate model retrieval and positioning.
626
+ blank |
627
+ text | We also take an advantage of the representation and demonstrate quick and effi-
628
+ | cient pipeline, including user-interaction when possible. More specifically, we demon-
629
+ | strate following novel prototypes of systems:
630
+ meta | CHAPTER 1. INTRODUCTION 10
631
+ blank |
632
+ |
633
+ |
634
+ text | • A new hand-held system that a user can capture the space and automatically
635
+ | generate a floor plan. The user does not have to measure distances or manually
636
+ | sketch the layout.
637
+ blank |
638
+ text | • A projector attached to the RGB-D camera to communicate current status of
639
+ | the acquisition on the physical surface with user, and thus allow user to provide
640
+ | intuitive feedback.
641
+ blank |
642
+ text | • A real-time guided scanning setup for online quality assessment of streaming
643
+ | RGB-D data obtained with help of 3-D database of models.
644
+ blank |
645
+ text | While the specific geometric priors and prototypes listed above come from under-
646
+ | standing of the characteristic of the task at hand, the underlying assumptions and
647
+ | approach provide a direction to allow everyday user to acquire useful 3-D information
648
+ | in the years to come as real-time 3-D scans become available.
649
+ meta | Chapter 2
650
+ blank |
651
+ title | Interactive Acquisition of
652
+ | Residential Floor Plans1
653
+ blank |
654
+ text | Acquiring an accurate floor plan of a residence is a challenging task, yet one that
655
+ | is required for many situations, such as remodeling or sale of a property. Original
656
+ | blueprints can be difficult to find, especially for older residences. In practice, contrac-
657
+ | tors and interior designers use point-to-point laser measurement devices to acquire
658
+ | a set of distance measurements. Based on these measurements, an expert creates a
659
+ | floor plan that respects the measurements and represents the layout of the residence.
660
+ | Both taking measurements and representing the layout are cumbersome manual tasks
661
+ | that require experience and time.
662
+ | In this chapter, we present a hand-held system for indoor architectural reconstruc-
663
+ | tion. This system eliminates the manual post-processing necessary for reconstructing
664
+ | the layout of walls in a residence. Instead, an operator with no architectural exper-
665
+ | tise can interactively guide the reconstruction process by moving freely through an
666
+ meta | 1
667
+ text | The contents of the chapter was originally published as Y.M. Kim, J. Dolson, M. Sokolsky, V.
668
+ | Koltun, S.Thrun, Interactive Acquisition of Residential Floor Plans, IEEE International Conference
669
+ | on Robotics and Animation (ICRA), 2012 c 2012 IEEE.
670
+ | In reference to IEEE copyrighted material which is used with permission in this thesis, the
671
+ | IEEE does not endorse any of Stanford University’s products or services. Internal or personal
672
+ | use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material
673
+ | for advertising or promotional purposes or for creating new collective works for resale or redis-
674
+ | tribution, please go to http://www.ieee.org/publications_standards/publications/rights/
675
+ | rights_link.html to learn how to obtain a License from RightsLink.
676
+ blank |
677
+ |
678
+ meta | 11
679
+ | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 12
680
+ blank |
681
+ |
682
+ | 
683
+ |
684
+ |
685
+ |
686
+ |
687
+ | 
688
+ |
689
+ |
690
+ |
691
+ |
692
+ | 
693
+ |
694
+ |
695
+ |  
696
+ |
697
+ |
698
+ |
699
+ |
700
+ | 
701
+ |
702
+ |
703
+ |
704
+ | 
705
+ | 
706
+ |
707
+ |
708
+ |
709
+ |
710
+ | 
711
+ |
712
+ | 
713
+ |
714
+ |
715
+ | 
716
+ |
717
+ |
718
+ |
719
+ |
720
+ text | Figure 2.1: Our hand-held system is composed of a projector, a Microsoft Kinect
721
+ | sensor, and an input button (left). The system uses augmented reality feedback
722
+ | (middle left) to project the status of the current model onto the environment and to
723
+ | enable real-time acquisition of residential wall layouts (middle right). The floor plan
724
+ | (middle right) and visualization (right) were generated using data captured by our
725
+ | system.
726
+ blank |
727
+ text | interior with the hand-held system until all walls have been observed by the sensor
728
+ | in the system.
729
+ | Our system is composed of a laptop connected to an RGB-D camera, a lightweight
730
+ | optical projector, and an input button interface (Figure 2.1, left). The RGB-D cam-
731
+ | era is a real-time depth sensor that acts as the main input modality. As noted in
732
+ | Chapter 1, we use the Microsoft Kinect, a lightweight commodity device that out-
733
+ | puts VGA-resolution range and color images at video rates. The data is processed
734
+ | in real time to create the floor plan by focusing on large flat surfaces and ignoring
735
+ | clutter. The generated floor plan can be used directly for remodeling or real-estate
736
+ | applications or to produce a 3D model of the interior for applications in virtual envi-
737
+ | ronments. In Section 2.4, we present and discuss a number of residential wall layouts
738
+ | reconstructed with our system, captured from actual apartments. Even though the
739
+ | results presented here were obtained focus on residential spaces, the system can also
740
+ | be used in other types of interior environments.
741
+ | The attached projector is initially calibrated to have an overlapping field of view
742
+ | with the same image center as the depth sensor. It projects the reconstruction status
743
+ | onto the surface being scanned. Under normal lighting, the projector does not provide
744
+ | a sophisticated rendering. Rather, the projection allows the user to visualize the
745
+ | reconstruction process. The user can then detect reconstruction errors that arise due
746
+ | to deficiencies in the data capture path and can complete missing data in response.
747
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 13
748
+ blank |
749
+ |
750
+ |
751
+ text | The user can also note which walls have been included in the model and easily resolve
752
+ | ambiguities with a simple input device. The proposed system has advantages over
753
+ | other previous applications by allowing a new type of user interaction in real time that
754
+ | focuses only on architectural elements relevant to the task at hand. This difference
755
+ | is discussed in detail in the following section.
756
+ blank |
757
+ |
758
+ title | 2.1 Related Work
759
+ text | A number of approaches have been proposed for indoor reconstruction in computer
760
+ | graphics, computer vision, and robotics. Real-time indoor reconstruction using either
761
+ | a depth sensor [HKH+ 12] or an optical camera [ND10] has been recently explored.
762
+ | The results at these studies suggest that the key to real-time performance is the
763
+ | fast registration of successive frames. Similar to [HKH+ 12], we fuse both color and
764
+ | depth information to register frames. Furthermore, our approach extends real-time
765
+ | acquisition and reconstruction by allowing the operator to visualize the current re-
766
+ | construction status without consulting a computer screen. Because the feedback loop
767
+ | in our system is immediate, the operator can resolve failures and ambiguities while
768
+ | the acquisition session is in progress.
769
+ | Previous approaches have also been limited to a dense 3-D reconstruction (reg-
770
+ | istration of point cloud data) with no higher-level information, which is memory
771
+ | intensive. A few exceptions include [GCCMC08], by means of which high-level fea-
772
+ | tures (lines and planes) are detected to reduce complexity and noise. The high-level
773
+ | structures, however, do not necessarily correspond to actual architectural elements,
774
+ | such as walls, floors, or ceilings. In contrast, our system identifies and focuses on
775
+ | significant architectural elements using the Manhattan-world assumption, which is
776
+ | based on the observation that many indoor scenes are largely rectilinear [CY99]. This
777
+ | assumption is widely made for indoor scene reconstruction from images to overcome
778
+ | the inherent limitations of image data [FCSS09][VAB10]. While the traditional stereo
779
+ | method only reconstructs 3-D locations of image feature points, the Manhattan-world
780
+ | assumption successfully fills an area between the sparse feature points during post-
781
+ | processing. Our system, based on the Manhattan-world assumption, differentiates
782
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 14
783
+ blank |
784
+ |
785
+ |
786
+ text | between architectural features and miscellaneous objects in the space, producing a
787
+ | clean architectural floor plan and simplifying the representation of the environment.
788
+ | Even with the Manhattan-world assumption, however, the system still cannot fully
789
+ | resolve ambiguities introduced by large furniture items and irregular features in the
790
+ | space without user input. The interactive capability offered by our system allows the
791
+ | user to easily disambiguate the situation and integrate new input into a global map
792
+ | of the space in real time.
793
+ | Not only does our system simplify the representation of the feature of a space, but
794
+ | by doing so it reduces the computational burden of processing a map. Employing the
795
+ | Manhattan-world assumption simplifies the map construction to a one-dimensional,
796
+ | closed-form problem. Registration of successive point clouds results in an accumula-
797
+ | tion of errors, especially for a large environment, and requires a global optimization
798
+ | step in order to build a consistent map. This is similar to reconstruction tasks en-
799
+ | countered in robotic mapping. In other approaches, the problem is usually solved by
800
+ | bundle adjustment, a costly off-line process [TMHF00][Thr02].
801
+ | The augmented reality component of our system is inspired by the SixthSense
802
+ | project [MM09]. Instead of simply augmenting a user’s view of the world, however,
803
+ | our projected output serves to guide an interactive reconstruction process. Directing
804
+ | the user in this way is similar to re-photography [BAD10], where a user is guided
805
+ | to capture a photograph from the same viewpoint as in a previous photograph. By
806
+ | using a micro-projector as the output modality, our system allows the operator to
807
+ | focus on interacting with the environment.
808
+ blank |
809
+ |
810
+ title | 2.2 System Overview and Usage
811
+ text | The data acquisition process is initiated by the user pointing the sensor to a corner,
812
+ | where three mutually orthogonal planes meet. This corner defines the Manhattan-
813
+ | world coordinate system. The attached projector indicates successful initialization by
814
+ | overlaying blue-colored planes with white edges onto the scene (Figure 2.2 (a)). After
815
+ | the initialization, the user scans each room individually as he or she loops around in
816
+ | it holding the device. If the movement is too fast or if there are not enough features,
817
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 15
818
+ blank |
819
+ |
820
+ |
821
+ |
822
+ text | Fetch a new frame
823
+ blank |
824
+ text | Exists
825
+ | Global
826
+ | Success
827
+ | adjustment
828
+ | Pair-wise
829
+ | Initialization Plane extraction
830
+ | registration
831
+ blank |
832
+ text | Map update
833
+ | New
834
+ blank |
835
+ text | User interaction
836
+ | Failure Left click Right click
837
+ | Adjust data Start a new
838
+ | Visual feedback Select planes
839
+ | path room
840
+ blank |
841
+ |
842
+ |
843
+ |
844
+ text | (a) (b) (c)
845
+ blank |
846
+ |
847
+ text | Figure 2.2: System overview and usage. When an acquisition session is initiated by
848
+ | observing a corner, the user is notified by a blue projection (a). After the initial-
849
+ | ization, the system updates the camera pose by registering consecutive frames. If a
850
+ | registration failure occurs, the user is notified by a red projection and is required to
851
+ | adjust the data capture path (b). Otherwise, the updated camera configuration is
852
+ | used to detect planes that satisfy the Manhattan-world assumption in the environ-
853
+ | ment and to integrate them into the global map. The user interacts with the system
854
+ | by selecting planes in the space (c). When the acquisition session is completed, the
855
+ | acquired map is used to construct a floor plan consisting of user-selected planes.
856
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 16
857
+ blank |
858
+ |
859
+ |
860
+ text | a red projection on the surface guides the user to recover the position of the device
861
+ | (Figure 2.2 (b)) and re-acquire that area.
862
+ | The system extracts flat surfaces that align with the Manhattan coordinate system
863
+ | and creates complete rectilinear polygons, even when connectivity between planes is
864
+ | occluded. At times, the user might not want some of the extracted planes (parts
865
+ | of furniture or open doors) to be included in the model even if these planes satisfy
866
+ | the Manhattan-world assumption. In these cases, when the user clicks the input
867
+ | button (left click), the extracted wall toggles between inclusion (indicated in blue)
868
+ | and exclusion (indicated in grey) to the model (Figure 2.2 (c)). As the user finishes
869
+ | scanning a room, he or she can move to another room and scan it. A new rectilinear
870
+ | polygon is initiated by a right click. Another rectilinear polygon is similarly created
871
+ | by including the selected planes, and the room is correctly positioned into the global
872
+ | coordinate system. The model is updated in real time and stored in either a CAD
873
+ | format or a 3-D mesh format that can be loaded into most 3-D modeling software.
874
+ blank |
875
+ |
876
+ title | 2.3 Data Acquisition Process
877
+ text | Some notations used throughout the section are introduced in Figure 2.3. At each
878
+ | time step t, the sensor produces a new frame of data, Ft = {Xt , It }, composed
879
+ | of a range image Xt (a 2-D array of depth measurements) and a color image It ,
880
+ | Figure 2.3(a). T t represents the transformation from the frame Ft , measured from
881
+ | the current sensor position, to the global coordinate system, which is where the map
882
+ | Mt = {Ltr , Rtr } is defined, Figure 2.3(b). Throughout the data capture session, the
883
+ | system maintains the global map Mt , and the two most recent frames, Ft−1 and Ft
884
+ | to update the transformation information. Instead of storing information from all
885
+ | frames, the system keeps the total computational and memory requirements minimal
886
+ | by incrementally updating the global map only with components that need to be
887
+ | added to the final model. Additionally, the frame with the last observed corner Fc is
888
+ | stored to recover the sensor position when lost.
889
+ | After the transformation is found, the relationship between the planes in global
890
+ | map Mt and the measurement in the current frame Xt is represented as Pt , a 2-D
891
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 17
892
+ blank |
893
+ |
894
+ |
895
+ |
896
+ text | Xt
897
+ | P3
898
+ | P2
899
+ | P6
900
+ | t P4
901
+ | I P0 P5
902
+ | T t (F t )
903
+ | P7 P8
904
+ blank |
905
+ |
906
+ text | (a) F t (b) Ltr
907
+ | P3
908
+ | P2
909
+ blank |
910
+ text | P4
911
+ | P0 P5
912
+ | P3 P5
913
+ | P6
914
+ | P7
915
+ blank |
916
+ text | (c) P t (d) R tr
917
+ blank |
918
+ text | Figure 2.3: Notation and representation. Each frame of the sensor Ft is composed of
919
+ | a 2-D array of depth measurements Xt and color image It (a). The global map Mt
920
+ | is composed of sequence of observed planes Ltr (b) and loops of rectilinear polygons
921
+ | built from the planes Rtr (d). After the registration of the current frame T t is found
922
+ | with respect to the global coordinate system, planes are extracted Pt (c), the system
923
+ | automatically update the room structure based on the observation Rtr (d).
924
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 18
925
+ blank |
926
+ |
927
+ |
928
+ text | array of plane labels for each pixel, Figure 2.3(c). The map Mt is composed of lists of
929
+ | observed axis-parallel planes Ltr and loops of current room structure Rtr , defined with
930
+ | subsets of the planes from Ltr . Each plane has its axis label (x, y, or z) and the offset
931
+ | value (e.g., x = x0 ), as well as its left or right plane if the connectivity is observed. A
932
+ | plane can be selected (shown as solid line in Figure 2.3(b)) or ignored (dotted line in
933
+ | Figure 2.3(b)) based on user input. The selected planes are extracted from Ltr as the
934
+ | loop of the room Rtr , which can be converted into the floor plan as a 2-D rectilinear
935
+ | polygon. To have a fully connected a rectilinear polygon per room, Rtr is constrained
936
+ | to have alternating axis labels (x and y). For the z direction (vertical direction), the
937
+ | system retains only the ceiling and the floor. The system also keeps the sequence of
938
+ | observation (S x , S y , and S z ) of offset values for each axis direction, and stores the
939
+ | measured distance and the uncertainty of the measurement between planes.
940
+ | The overall reconstruction process is summarized in Figure 2.2. As mentioned in
941
+ | Sec. 2.2, this process is initiated by extracting three mutually orthogonal planes when
942
+ | a user points the system to one of the corners or a room. To detect planes in the range
943
+ | data, our system fits plane equations to groups of range points and their corresponding
944
+ | normals using the RANSAC algorithm [FB81]: the system first randomly samples a
945
+ | few points, then fits a plane equation to them. the system then tests the detected
946
+ | plane by counting the number of points that can be explained by the plane equation.
947
+ | After convergence, the detected plane is classified as valid only if the detected points
948
+ | constitute a large, connected portion of the depth information within the frame. If
949
+ | there are three planes detected, and they are orthogonal to each other, our system
950
+ | assigns the x, y and z axes to be the normal directions of these three planes, which
951
+ | form the right-handed coordinate system for our Manhattan world. Now the map Mt
952
+ | has two planes (the floor or ceiling is excluded), and the transformation T t between
953
+ | Mt and Ft is also found.
954
+ | A new measurement Ft is registered with the previous frame Ft−1 by aligning
955
+ | depth and color features (Sec. 2.3.1). This registration is used to update T t−1 to a
956
+ | new transformation T t . The system extracts planes that satisfy the Manhattan-world
957
+ | assumption from T t (Ft ) (Sec. 2.3.2). If the extracted planes already exist in Ltr , the
958
+ | current measurement is compared with the global map and the registration is refined
959
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 19
960
+ blank |
961
+ |
962
+ |
963
+ |
964
+ text | (a) (b) (c) (d)
965
+ blank |
966
+ text | Figure 2.4: (a) Flat wall features (depicted by the triangle and circle) are observed
967
+ | from two different locations. Diagram (b) shows both observations with respect to
968
+ | the camera coordinate system. Without features, using projection-based ICP can
969
+ | lead to registration errors in the image-plane direction (c), while the use of features
970
+ | will provide better registration (d).
971
+ blank |
972
+ text | (Sec. 2.3.3). If there is a new plane extracted, or if there is user input to specify the
973
+ | map structure, the map is updated accordingly (Sec. 2.3.4).
974
+ blank |
975
+ |
976
+ title | 2.3.1 Pair-Wise Registration
977
+ text | To propagate information from previous frames and to detect new planes in the scene,
978
+ | each incoming frame must be registered with respect to the global coordinate system.
979
+ | To start this process, the system finds the relative registration between the two most
980
+ | recent frames, Ft−1 and Ft . By using both the depth point clouds (Xt−1 , Xt ) and
981
+ | optical images (It−1 , It ), the system can efficiently register frames in real time (about
982
+ | 15 fps).
983
+ | Given two sets of point clouds, Xt−1 = {xt−1 N t t N
984
+ | i }i=1 and X = {xi }i=1 , and the
985
+ | transformation for the previous point cloud T t−1 , the correct rigid transformation T t
986
+ | will minimize the error between correspondences in the two sets:
987
+ blank |
988
+ text | X
989
+ | mint kwi (T t−1 (xit−1 ) − T t (yit ))k2 (2.1)
990
+ | yi ,T
991
+ | i
992
+ blank |
993
+ text | yit ∈ Xt is the corresponding point for xt−1
994
+ | i ∈ Xt−1 . Once the correspondence is
995
+ | known, minimizing Eq. (2.1) becomes a closed-form solution [BM92]. In conventional
996
+ | approaches, correspondence is found by searching for the closest point, which is com-
997
+ | putationally expensive. Real-time registration methods reduce the cost by projecting
998
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 20
999
+ blank |
1000
+ |
1001
+ |
1002
+ |
1003
+ text | (a) it−1 ∈ It−1 (b) j t ∈ It (c) Ht (It−1 ) (d) |I t − Ht (It−1 )|
1004
+ blank |
1005
+ text | Figure 2.5: From optical flow between two consecutive frames, sparse image features
1006
+ | are matched between (a) it−1 ∈ It−1 and (b) j t ∈ It . The matched features are then
1007
+ | used to calculate homography Ht such that the previous image It−1 can be warped to
1008
+ | the space of the current image It and create dense projective correspondences (c). The
1009
+ | difference image (d) shows that most of dense correspondences are within a few-pixel
1010
+ | error in image plane with slight offset around silhouette area.
1011
+ blank |
1012
+ text | the 3-D points onto a 2-D image plane and assigning correspondences to points that
1013
+ | project onto the same pixel locations [RL01]. However, projection will only reduce the
1014
+ | distance in the ray direction; the offset parallel to the image plane cannot be adjusted.
1015
+ | This phenomenon can result in the algorithm not compensating for the translation
1016
+ | parallel to the plane and therefore shrinking the size of the room (Figure 2.4).
1017
+ | Our pair-wise registration is similar to [RL01], but it compensates for the dis-
1018
+ | placement parallel to the image plane using image features and silhouette points.
1019
+ | Intuitively, the system uses homography to compensate for errors parallel to the
1020
+ | plane if the structure can be approximated into a plane, and silhouette points are
1021
+ | used to compensate for remaining errors when the features are not planar.
1022
+ | Our system first computes the optical flow between color images It and It−1 and
1023
+ | finds a sparse set of features matched between them, Figure 2.5(a)(b). The sparse set
1024
+ | of features then can be used to create dense projective correspondence between the
1025
+ | two frames, Figure 2.5(c)(d). More specifically, homography is a transform between
1026
+ | 2-D homogeneous coordinates defined by a matrix H ∈ R3×3 :
1027
+ blank |
1028
+ text |    
1029
+ | ui wuj
1030
+ | X
1031
+ | kHit−1 − j t k2 , where it−1
1032
+ |    
1033
+ | min = v i
1034
+ |  ∈ It−1 , j t =  wvj  ∈ It (2.2)
1035
+ | H    
1036
+ | it−1 ,j t
1037
+ | 1 w
1038
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 21
1039
+ blank |
1040
+ |
1041
+ |
1042
+ |
1043
+ text | Figure 2.6: Silhouette points. There are two different types of depth discontinuity:
1044
+ | the boundaries of a shadow made on the background by a foreground object (empty
1045
+ | circles), and the boundaries of a foreground object (filled circles). The meaningful
1046
+ | depth features are the foreground points, which are the silhouette points used for our
1047
+ | registration pipeline.
1048
+ blank |
1049
+ text | Compared to naive projective correspondence used in [RL01], a homography de-
1050
+ | fines a map between two planar surfaces in 3-D space. The homography represents
1051
+ | the displacement parallel to the image plane, and is used to compute dense corre-
1052
+ | spondences between the two frames. While a homography does not represent a full
1053
+ | transformation in 3-D, the planar approximation works well in practice for our sce-
1054
+ | nario, where the scene is mostly composed of flat planes and the relative movement is
1055
+ | small. From the second iteration, the correspondence is found by projecting individual
1056
+ | points onto the image plane, as shown in [RL01].
1057
+ | Given the correspondence, the registration between the frames for the current iter-
1058
+ | ation can be given as a closed-form solution (Equation 2.1). Additionally, the system
1059
+ | modifies the correspondence for silhouette points (points of depth discontinuity in
1060
+ | the foreground, shown in Figure 2.6). For silhouette points in Xt−1 , the system finds
1061
+ | the closest silhouette points in Xt within a small search window from the original
1062
+ | corresponding location. If the matching silhouette point exists, the correspondence is
1063
+ | weighted more. (We used wi = 100 for silhouette points and wi = 1 for non-silhouette
1064
+ | points.) The process iterates until it converges.
1065
+ blank |
1066
+ title | Registration Failure
1067
+ blank |
1068
+ text | The real-time registration is a crucial part of our algorithm for accurate reconstruc-
1069
+ | tion. Even with the hybrid approach in which both color and depth features are used,
1070
+ | the registration can fail, and it is important to detect the failure immediately and
1071
+ | to recover the position of the sensor. The registration failure is detected either (1)
1072
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 22
1073
+ blank |
1074
+ |
1075
+ |
1076
+ text | if the pair-wise registration does not converge or (2) if there were not enough color
1077
+ | and depth features. The first case can be easily detected as the algorithm runs. The
1078
+ | second case is detected if the optical flow did not find homography (i.e., there is a
1079
+ | lack of color feature) and there were not enough matched silhouette points (i.e., there
1080
+ | is a lack of depth feature).
1081
+ | In cases of registration failure, the projected image turns red, indicating that the
1082
+ | user should return the system’s viewpoint to the most recently observed corner. This
1083
+ | movement usually takes only a small amount of back-tracking because the failure
1084
+ | is detected within milliseconds of leaving the previous successfully registered area.
1085
+ | Similar to the initialization step, the system extracts planes from Xt using RANSAC
1086
+ | and matches the planes with the desired corner. Figure 2.2 (b) depicts the process of
1087
+ | overcoming a registration failure. The user then deliberately moves the sensor along
1088
+ | the path with richer features or steps farther from a wall to cover a wider view.
1089
+ blank |
1090
+ |
1091
+ title | 2.3.2 Plane Extraction
1092
+ text | Based on the transformation T t , the system extracts axes-aligned planes and asso-
1093
+ | ciated edges. The planes and detected features will provide higher-level information
1094
+ | that relates the raw point cloud Xt to the global map Mt . Because the system only
1095
+ | considers the planes that satisfy the Manhattan-world coordinate system, we were
1096
+ | able to simplify the plane detection procedure.
1097
+ | The planes from the previous frame that remain visible can be easily found by
1098
+ | using the correspondence. From the pair-wise registration (Sec. 2.3.1), our system
1099
+ | has the point-wise correspondence between the previous frame and the current frame.
1100
+ | The plane label Pt−1 from the previous frame is updated simply by being copied over
1101
+ | to the corresponding location. Then, the system refines Pt by alternating between
1102
+ | fitting points and fitting parameters.
1103
+ | A new plane can be found by projecting remaining points for the x, y, and z axes.
1104
+ | For each axis direction, a histogram is built with the bin size 20cm. The system then
1105
+ | tests the plane equation for populated bins. Compared to the RANSAC procedure
1106
+ | for initialization, the Manhattan-world assumption reduces the number of degrees of
1107
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 23
1108
+ blank |
1109
+ |
1110
+ |
1111
+ text | freedom from three to one, making plane extraction more efficient.
1112
+ | For extracted planes, the boundary edges are also extracted; the system detects
1113
+ | groups of boundary points that can be explained by an axis-parallel line segment.
1114
+ | The system also retains the information about relative positions for extracted planes
1115
+ | (left/right). As long as the sensor is not flipped upside-down, this information pro-
1116
+ | vides an important cue to build a room with the correct topology, even when the
1117
+ | connectivity between neighboring planes has not been observed.
1118
+ blank |
1119
+ title | Data Association
1120
+ blank |
1121
+ text | After the planes are extracted, the data association process finds the link between the
1122
+ | global map Mt and the extracted planes to be Pt , a 2-D array of plane labels for each
1123
+ | pixel. The system automatically finds plane labels that existed from the previous
1124
+ | frame and extract the plane by copying over the plane labels using correspondences.
1125
+ | The plane labels for the newly detected plane can be found by comparing T t (Ft )
1126
+ | and Mt . In addition to the plane equation, the relative position of the newly observed
1127
+ | plane with respect to other observed planes is used to label the plane. If the plane
1128
+ | has not been previously observed, a new plane will be added into Ltr based on the
1129
+ | left-right information.
1130
+ | After the data association step, the system updates the sequence of observation
1131
+ | S. The planes that have been assigned as previously observed are used for global
1132
+ | adjustment (Sec. 2.3.3). If a new plane is observed, the room Rtr will be updated
1133
+ | accordingly (Sec. 2.3.4).
1134
+ blank |
1135
+ |
1136
+ title | 2.3.3 Global Adjustment
1137
+ text | Due to noise in the point cloud, frame-to-frame registration is not perfect, and er-
1138
+ | ror accumulates over time. This is a common problem in pose estimation. Large-
1139
+ | scale localization approaches use bundle adjustment to compensate error accumula-
1140
+ | tion [TMHF00, Thr02]. Enforcing this global constraint involves detecting landmark
1141
+ | objects, or stationary objects observed at different times during a sequence of mea-
1142
+ | surements. Usually this global adjustment becomes an optimization problem in many
1143
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 24
1144
+ blank |
1145
+ |
1146
+ |
1147
+ |
1148
+ text | Figure 2.7: As errors accumulate in T t and in measurements, the map Mt becomes
1149
+ | inconsistent. By comparing previous and recent measurements, the system can correct
1150
+ | for inconsistency and update the value of c such that c = a.
1151
+ blank |
1152
+ text | dimensions. The problem is formulated by constraining the landmarks to predefined
1153
+ | global locations, and by solving an energy function that encodes noise in a pose es-
1154
+ | timation of both sensor and landmark locations. The Manhattan-world assumption
1155
+ | allows us to reduce the error accumulation efficiently in real time by refining our
1156
+ | registration estimate and by optimizing the global map.
1157
+ blank |
1158
+ title | Refining the Registration
1159
+ blank |
1160
+ text | After data association, the system performs a second round of registration with re-
1161
+ | spect to the global map Mt to reduce the error accumulation in T t by incremental,
1162
+ | pair-wise registration. The extracted planes Pt , if already observed by the system,
1163
+ | have been assigned to the planes in Mt that have associated plane equations. For
1164
+ | example, suppose a point T t (xu,v ) = (x, y, z) has a plane label Pt (u, v) = pk (assigned
1165
+ | to plane k). If plane k has normal parallel to the x axis, the plane equation in the
1166
+ | global map Mt can be written as x = x0 (x0 ∈ R). Consequently, the registration
1167
+ | should be refined to minimize kx − x0 k2 . In other words, the refined registration can
1168
+ | be found by defining the corresponding point for xu,v as (x0 , y, z). The corresponding
1169
+ | points are likewise assigned for every point with a plane assignment in Pt . Given the
1170
+ | correspondence, the system can refine the registration between the current frame Ft
1171
+ | and the global map Mt . This second round of registration reduces the error in the
1172
+ | axis direction. In our example, the refinement is active while the plane x = x0 is
1173
+ | visible and reduces the uncertainty in the x direction with respect to the global map.
1174
+ | The error in the x direction is not accumulated during the interval.
1175
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 25
1176
+ blank |
1177
+ |
1178
+ |
1179
+ title | Optimizing the Map
1180
+ blank |
1181
+ text | As error accumulates, the reconstructed map Mt may also require global adjust-
1182
+ | ment in each axis direction. The Manhattan-world assumption simplifies this global
1183
+ | optimization into two separate, one-dimensional problems (we are excluding the z
1184
+ | direction for now, but the idea can be extended to a 3-D case).
1185
+ | Figure 2.7 shows a simple example in the x-axis direction. Let us assume that
1186
+ | the figure represents an overhead view of a rectangular room. There should be two
1187
+ | walls whose normals are parallel to the x-axis. The sensor detects the first wall
1188
+ | (x = a), sweeps around the room, observes another wall (x = b), and returns to
1189
+ | the previously observed wall. Because of error accumulation, parts of the same wall
1190
+ | have two different offset values (x = a and x = c), but by observing the left-right
1191
+ | relationship between walls, the system infers that the two walls are indeed the same
1192
+ | wall.
1193
+ | To optimize the offset values, the system tracks the sequence of observations
1194
+ | S x = {a, b, c} and the variances at the point of observation for each wall, as well as the
1195
+ | constraints represented by the pair of the same offset values C x = {(c11 , c12 ) = (a, c)}.
1196
+ | We introduce two random variables, ∆1 and ∆2 , to constrain the global map op-
1197
+ | timization. ∆1 is a random variable with mean m1 = b − a and variance σ12 that
1198
+ | represents the error between the moment when the sensor observes the x = a wall
1199
+ | and the moment it observes the x = b wall. Likewise, a random variable ∆2 represents
1200
+ | the error with mean m2 = c − b and variance σ22 .
1201
+ | Whenever a new constraint is added, or when the system observes a plane that
1202
+ | was previously observed, the global adjustment routine is triggered. This is usually
1203
+ | when the user finishes scanning a room by looping around it and returning to the
1204
+ | first wall measured. By confining the axis direction, the global adjustment becomes
1205
+ | a one-dimensional quadratic equation:
1206
+ blank |  
1207
+ text | P k∆i −mi k2
1208
+ | minS x i σi2
1209
+ | (2.3)
1210
+ | x
1211
+ | s. t. cj1 = cj2 , ∀(cj1 , cj2 ) ∈ C .
1212
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 26
1213
+ blank |
1214
+ |
1215
+ |
1216
+ |
1217
+ text | Figure 2.8: Selection. In sequence (a), the user is observing two new planes in the
1218
+ | scene (colored white) and one currently included plane (colored blue). The user selects
1219
+ | one of the new planes by pointing at it and clicking. Then, the second new plane is
1220
+ | added. All planes are blue in the final frame, confirming that all planes have been
1221
+ | successfully selected. Sequence (b) shows a configuration where the user has decided
1222
+ | not to include the large cabinet. Sequence (c) shows successful selection of the ceiling
1223
+ | and the wall despite clutter.
1224
+ blank |
1225
+ title | 2.3.4 Map Update
1226
+ text | Our algorithm ignores most irrelevant features by using the Manhattan-world as-
1227
+ | sumption. However, the system cannot distinguish architectural components from
1228
+ | other axis-aligned objects using the Manhattan-world assumption. For example, fur-
1229
+ | niture, open doors, parts of other rooms that might be visible, or reflections from
1230
+ | mirrors may be detected as axis-aligned planes. The system solves the challenging
1231
+ | cases by allowing the user to manually specify the planes that he or she would like to
1232
+ | include in the final model. This manual specification consists of simply clicking the
1233
+ | input button during scanning when pointing at a plane, as shown in Figure 2.8. If
1234
+ | the user enters a new room, a right click of the button indicates that the user wishes
1235
+ | to include this new room and to optimize it individually. The system creates a new
1236
+ | loop of planes, and any newly observed planes are added to the loop.
1237
+ | Whenever a new plane is added to Ltr or there is user input to specify the room
1238
+ | structure, the map update routine extracts a 2-D rectilinear polygon Rtr from Ltr with
1239
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 27
1240
+ blank |
1241
+ |
1242
+ |
1243
+ text | 5.797, 5% 0.104, data i/o
1244
+ | 11.845, 10% prepare image pre-processing
1245
+ | 6.728, 6% 0%
1246
+ | 3.318, 3% optical flow (25%)
1247
+ | 14.517, 13.203, 12% pair-wise registration
1248
+ | 13% plane extraction
1249
+ | data association
1250
+ | [unit: ms] refine registration
1251
+ | 58.672, 51% optimize map
1252
+ blank |
1253
+ text | Figure 2.9: The average computational time for each step of the system.
1254
+ blank |
1255
+ text | the help of user input. A valid rectilinear polygon structure should have alternating
1256
+ | axis directions for any pair of adjacent walls (a x = xi wall should be connected to
1257
+ | a y = yj wall). The system starts by adding all selected planes into Rtr as well as
1258
+ | whichever unselected planes in Ltr are necessary to have alternating axis direction.
1259
+ | When planes are added, the planes with observed boundary edges are preferred. If
1260
+ | the two observed walls have the same axis direction, the unobserved wall is added
1261
+ | between them on the boundary of the planes to form a complete loop.
1262
+ blank |
1263
+ |
1264
+ title | 2.4 Evaluation
1265
+ text | The goal of the system is building a floor plan of an any possible interior environment.
1266
+ | In our testing of the system, we mapped different apartments of six different volunteers
1267
+ | ranging approximately 500-2000 ft2 located at Palo Alto. The residents were living
1268
+ | in the scanned places and thus the apartments exhibited different amounts and types
1269
+ | of objects.
1270
+ | For each data set, we compare the floor plan generated by our system with one
1271
+ | manually-generated using measurements from a commercially available measuring
1272
+ | device.1 The current practice in architecture and real estate is to use a point-to-
1273
+ | point laser device to measure distances between pairs of parallel planes. Making
1274
+ | such measurements requires a clear, level line of sight between two planes, which
1275
+ meta | 1
1276
+ text | measuring range 0.05 to 40m; average measurement accuracy +/- 1.5mm; measurement duration
1277
+ | < 0.5s to 4s per measurement.
1278
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 28
1279
+ blank |
1280
+ |
1281
+ |
1282
+ text | may be time-consuming to find due to the presence of furniture, windows, and other
1283
+ | obstructions. Moreover, after making all the distance measurements, a user is required
1284
+ | to manually draw a floor plan that respects the measurements. In our tests, roughly
1285
+ | 10-20 minutes were needed to build a floor plan of each apartment in the conventional
1286
+ | way as described.
1287
+ | Using our system, the data acquisition process took approximately 2-5 minutes per
1288
+ | apartment to initiate, run, and generate the full floor plan. Table 2.1 summarizes the
1289
+ | timing data for each data set. The average frame rate is 7.5 frames per second running
1290
+ | on an Intel 2.50GHz Dual Core laptop. Figure 2.9 depicts the average computational
1291
+ | time for each step of the algorithm. The pair-wise registration routine (Sec.2.3.1)
1292
+ | contributes more than half of the computational time followed by the pre-processing
1293
+ | step of fetching a new frame and calculating optical flow (25%).
1294
+ | In Figure 2.10, we visually compare the floor plans reconstructed in a conventional
1295
+ | way with those built by our system. The floor plans in blue were reconstructed using
1296
+ | point-to-point laser measurements, and the floor plans in red were reconstructed by
1297
+ | our system. For each apartment, the topology of the reconstructed walls agrees with
1298
+ | the manually-constructed floor plan. In all cases the detection and labeling of planar
1299
+ | surfaces by our algorithm enabled the user to add or remove these surfaces from
1300
+ | the model in real time, allowing the final model to be constructed using only the
1301
+ | important architectural elements from the scene.
1302
+ | The overlaid floor plans in Figure 2.10(c) show that that the relative placement of
1303
+ | the rooms may be misaligned. This is because our global adjustment routine optimizes
1304
+ | rooms individually, thus errors can accumulate in transitions between rooms. The
1305
+ | algorithm could be extended to enforce global constraints on the relative placement
1306
+ | of rooms, such as maintaining a certain wall thickness and/or aligning the outer-most
1307
+ | walls, but such global constraints may induce other errors.
1308
+ | Table 2.1 contains a quantitative comparison of the errors. The reported depth
1309
+ | resolution of the sensor is 0.01m at 2m, and for each model we have an average of
1310
+ | 0.075m error per wall. The relative error stays in the range of 2-5%, which shows
1311
+ | that the accumulation of small registration errors continues to accumulate as more
1312
+ | frames are processed.
1313
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 29
1314
+ blank |
1315
+ |
1316
+ text | data no. of run average error
1317
+ | fps
1318
+ | set frames time m %
1319
+ | 1 1465 2m 56s 8.32 0.115 4.14
1320
+ | 2 1009 1m 57s 8.66 0.064 1.90
1321
+ | 3 2830 5m 19s 8.88 0.053 2.40
1322
+ | 4 1129 2m 39s 7.08 0.088 2.34
1323
+ | 5 1533 3m 52s 6.59 0.178 3.52
1324
+ | 6 2811 7m 4s 6.65 0.096 3.10
1325
+ | ave. 1795 3m 57s 7.54 0.075 2.86
1326
+ blank |
1327
+ text | Table 2.1: Accuracy comparison between floor plans reconstructed by our system, and
1328
+ | manually constructed floor plans generated from point-to-point laser measurements.
1329
+ blank |
1330
+ text | Fundamentally, the limitations of our method reflect the limitations of the Kinect
1331
+ | sensor, namely, the processing power of the laptop and the assumptions made in our
1332
+ | approach. Because the accuracy of real-time depth data is worse than that from
1333
+ | visual features, our approach exhibits larger errors compared to visual SLAM (e.g.,
1334
+ | [ND10]). Some of the uncertainty can be reduced by adapting approaches from the
1335
+ | well-explored visual SLAM literature. Still, the system is limited when meaningful
1336
+ | features can not be detected. The Kinect sensor’s reported measurement range is
1337
+ | between 1.2 and 3.5m from an object; outside that range, data is noisy or unavailable.
1338
+ | As a consequence, data in narrow hallways or large atriums was difficult to collect.
1339
+ | Another source of potential error is a user outpacing the operating rate of approx-
1340
+ | imately 7.5 fps. This frame rate already allows for a reasonable data capture pace,
1341
+ | but with more processing power, the pace of the system could always be guaranteed
1342
+ | to exceed normal human motion.
1343
+ blank |
1344
+ |
1345
+ title | 2.5 Conclusions and Future Work
1346
+ text | We have presented an interactive system that allows a user to capture accurate ar-
1347
+ | chitectural information and to automatically generate a floor plan. Leveraging the
1348
+ | Manhattan-world assumption, we have created a representation that is tractable in
1349
+ | real time while ignoring clutter. In the presented system, the current status of the
1350
+ | reconstruction is projected on the scanned environment to enable the user to provide
1351
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 30
1352
+ blank |
1353
+ |
1354
+ |
1355
+ text | high-level feedback to the system. This feedback helps overcome ambiguous situa-
1356
+ | tions and allows the user to interactively specify the important planes that should be
1357
+ | included in the model.
1358
+ | If there are not enough features scanned for the system to determine that the
1359
+ | operator has moved, the system will assume that motion has not occurred, leading to
1360
+ | general underestimation of wall lengths when no depth or image features are available.
1361
+ | The challenges can be overcome by including an IMU or other devices to assist in the
1362
+ | pose tracking of the system.
1363
+ | We have limited our Manhattan-world features to axis-aligned planes in vertical
1364
+ | directions. However, in future work, we could generalize the system to handle rec-
1365
+ | tilinear polyhedra which are not convex in the vertical direction. Furthermore, the
1366
+ | world could be expanded to include walls that are not aligned with the axes of the
1367
+ | global coordinate system.
1368
+ | More broadly, our interactive system can be extended to other applications in
1369
+ | indoor environments. For example, a user could visualize modifications to the space
1370
+ | shown in Figure 2.11, where we show a user clicking and dragging a cursor across a
1371
+ | plane to “add” a window. This example illustrates the range of possible uses of our
1372
+ | system.
1373
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 31
1374
+ blank |
1375
+ |
1376
+ |
1377
+ |
1378
+ text | house 1
1379
+ blank |
1380
+ |
1381
+ |
1382
+ |
1383
+ text | house 2
1384
+ blank |
1385
+ |
1386
+ |
1387
+ |
1388
+ text | house 3
1389
+ blank |
1390
+ |
1391
+ |
1392
+ |
1393
+ text | house 4
1394
+ blank |
1395
+ |
1396
+ |
1397
+ |
1398
+ text | house 5
1399
+ blank |
1400
+ |
1401
+ |
1402
+ |
1403
+ text | house 6
1404
+ | (a) (b) (c)
1405
+ blank |
1406
+ text | Figure 2.10: (a) Manually constructed floor plans generated from point-to-point laser
1407
+ | measurements, (b) floor plans acquired with our system, and (c) overlay. For house
1408
+ | 4, some parts (pillars in large open space, stairs, and an elevator) are ignored by the
1409
+ | user. The system still uses the measurements from those parts and other objects to
1410
+ | correctly understand the relative positions of the rooms.
1411
+ meta | CHAPTER 2. INTERACTIVE ACQUISITION OF RESIDENTIAL FLOOR PLANS1 32
1412
+ blank |
1413
+ |
1414
+ |
1415
+ |
1416
+ text | Figure 2.11: The system, having detected the planes in the scene, also allows the user
1417
+ | to interact directly with the physical world. Here the user adds a window to the room
1418
+ | by dragging a cursor across the wall (left). This motion updates the internal model
1419
+ | of the world (right).
1420
+ meta | Chapter 3
1421
+ blank |
1422
+ title | Acquiring 3D Indoor Environments
1423
+ | with Variability and Repetition2
1424
+ blank |
1425
+ text | Unlike mapping of urban environments, interior mapping would focus on interior
1426
+ | objects, which can be geometrically complex, located in cluttered setting and undergo
1427
+ | significant variations. In addition, the indoor 3-D data captured from RGB-D cameras
1428
+ | suffer from limited resolution and data quality. The process is further complicated
1429
+ | when the model deforms between successive acquisitions. The work described in this
1430
+ | chapter focused on acquiring and understanding objects in interiors of public buildings
1431
+ | (e.g., schools, hospitals, hotels, restaurants, airports, train stations) or office buildings
1432
+ | from RGB-D camera scans of such interiors.
1433
+ | We exploited three observations to make the problem of indoor 3D acquisition
1434
+ | tractable: (i) most such building interiors are composed of basic elements such as
1435
+ | walls, doors, windows, furniture (e.g., chairs, tables, lamps, computers, cabinets),
1436
+ | which come from a small number of prototypes and repeat many times. (ii) such
1437
+ | building components usually consist of rigid parts of simple geometry, i.e., they have
1438
+ | surfaces that are well approximated by planar, cylindrical, conical, spherical proxies.
1439
+ | Further, although variability and articulation are dominant (e.g., a chair is moved
1440
+ meta | 2
1441
+ text | The contents of the chapter was originally published as Young Min Kim, Niloy J. Mitra,
1442
+ | Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3D indoor environments with vari-
1443
+ | ability and repetition. ACM Trans. Graph. 31, 6, Article 138 (November 2012), 11 pages.
1444
+ | DOI=10.1145/2366145.2366157 http://doi.acm.org/10.1145/2366145.2366157.
1445
+ blank |
1446
+ |
1447
+ meta | 33
1448
+ | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 34
1449
+ blank |
1450
+ |
1451
+ |
1452
+ |
1453
+ text | office scene
1454
+ blank |
1455
+ |
1456
+ |
1457
+ |
1458
+ text | input single-view scan recognized objects retrieved and posed models
1459
+ blank |
1460
+ |
1461
+ text | Figure 3.1: (Left) Given a single view scan of a 3D environment obtained using a
1462
+ | fast range scanner, the system performs scene understanding by recognizing repeated
1463
+ | objects, while factoring out their modes of variability (middle). The repeating ob-
1464
+ | jects have been learned beforehand as low-complexity models, along with their joint
1465
+ | deformations. The system extracts the objects despite a poor-quality input scan with
1466
+ | large missing parts and many outliers. The extracted parameters can then be used
1467
+ | to pose 3D models to create a plausible scene reconstruction (right).
1468
+ blank |
1469
+ text | or rotated, a lamp arm is bent and adjusted), such variability is limited and low-
1470
+ | dimensional (e.g., translational motion, hinge joint, telescopic joint). (iii) mutual
1471
+ | relationships among the basic objects satisfy strong priors (e.g., a chair stands on the
1472
+ | floor, a monitor rests on the table).
1473
+ | We present a simple yet practical system to acquire models of indoor objects such
1474
+ | as furniture, together with their variability modes, and discover object repetitions
1475
+ | and exploit them to speed up large-scale indoor acquisition towards high-level scene
1476
+ | understanding. Our algorithm works in two phases. First, in the learning phase, the
1477
+ | system starts from a few scans of individual objects to construct primitive-based 3D
1478
+ | models while explicitly recovering respective joint attributes and modes of variation.
1479
+ | Second, in the fast recognition phase (about 200ms/model), the system starts from a
1480
+ | single-view scan to segment and classify it into plausible objects, recognize them, and
1481
+ | extract the pose parameters for the low-complexity models generated in the learning
1482
+ | phase. Intuitively, our system uses priors for primitive types and their connections,
1483
+ | thus greatly reducing the number of unknowns to enable model fitting even from
1484
+ | very sparse and low-resolution datasets, while hierarchically associating subsets of
1485
+ | scans to parts of objects. We also demonstrate that simple inter- and intra-object
1486
+ | relations simplify segmentation and classification tasks necessary for high-level scene
1487
+ | understanding (see [MPWC12] and references therein).
1488
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 35
1489
+ blank |
1490
+ |
1491
+ |
1492
+ text | We tested our method on a range of challenging synthetic and real-world scenes.
1493
+ | We present, for the first time, basic scene reconstruction for massive indoor scenes
1494
+ | (e.g., office spaces, building auditoriums on a university campus) from unreliable
1495
+ | sparse data by exploiting the low-complexity variability of common scene objects. We
1496
+ | show how we can now detect meaningful changes in an environment. For example,
1497
+ | our system was able to discover a new object placed in a office space by rescanning the
1498
+ | scene, despite articulations and motions of the previously extant objects (e.g., desk,
1499
+ | chairs, monitors, lamps). Thus, the system factors out nuisance modes of variability
1500
+ | (e.g., motions of the chairs, etc.) from variability modes that has importance in an
1501
+ | application (e.g., security, where the new scene objects should be flagged).
1502
+ blank |
1503
+ |
1504
+ title | 3.1 Related Work
1505
+ blank |
1506
+ title | 3.1.1 Scanning Technology
1507
+ text | Rusinkiewicz et al. [RHHL02] demonstrated the possibility of real-time lightweight 3D
1508
+ | scanning. More generally, surface reconstruction from unorganized pointcloud data
1509
+ | has been extensively studied in computer graphics, computational geometry, and
1510
+ | computer vision (see [Dey07]). Further, powered by recent developments in real-time
1511
+ | range scanning, everyday users can now easily acquire 3D data at high frame-rates.
1512
+ | Researchers have proposed algorithms to accumulate multiple poor-quality individual
1513
+ | frames to obtain better quality pointclouds [MFO+ 07, HKH+ 12, IKH+ 11]. Our main
1514
+ | goal differed, however, because our system focused on recognizing important elements
1515
+ | and semantically understanding large 3D indoor environments.
1516
+ blank |
1517
+ |
1518
+ title | 3.1.2 Geometric Priors for Objects
1519
+ text | Our system utilizes geometry on the level of individual objects, which are possible
1520
+ | abstractions used by humans to understand the environment [MZL+ 09]. Similar to Xu
1521
+ | et al. [XLZ+ 10], we understand an object as a collection of primitive parts and segment
1522
+ | the object based on the prior. Such a prior can successfully fill regions of missing
1523
+ | parts [PMG+ 05], infer plausible part motions of mechanical assemblies [MYY+ 10],
1524
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 36
1525
+ blank |
1526
+ |
1527
+ |
1528
+ text | extract shape by deforming a template model to match silhouette images [XZZ+ 11],
1529
+ | locate an object from photographs [XS12], or semantically edit images based of simple
1530
+ | scene proxies [ZCC+ 12].
1531
+ | The system focuses on locating 3D deformable objects in unsegmented, noisy,
1532
+ | single-view data in a cluttered environment. Researchers have used non-rigid align-
1533
+ | ment to better align (warped) multiple scans [LAGP09]. Alternately, temporal infor-
1534
+ | mation across multiple frames can be used to track and recover a deformable model
1535
+ | with joints between rigid parts [CZ11]. Instead, our system learns an instance-specific
1536
+ | geometric prior as a collection of simple primitives along with deformation modes from
1537
+ | a very small number of scans. Note that the priors are extracted in the learning stage,
1538
+ | rather than being hard coded in the framework. We demonstrate that such models
1539
+ | are sufficiently representative to extract the essence of real-world indoor scenes (see
1540
+ | also concurrent efforts by Nan et al. [NXS12] and Shao et al [SXZ+ 12].)
1541
+ blank |
1542
+ |
1543
+ title | 3.1.3 Scene Understanding
1544
+ text | In the context of image understanding, Lee et al. [LGHK10] constructed a box-
1545
+ | based reconstruction of indoor scenes using volumetric considerations, while Gupta
1546
+ | et al. [GEH10] applied geometric constraints and physical considerations to obtain a
1547
+ | block-based 3D scene model. In the context of range scans, there have been only a few
1548
+ | efforts: Triebel et al. [TSS10] presented an unsupervised algorithm to detect repeating
1549
+ | parts by clustering on pre-segmented input data, while Koppula et al. [KAJS11] used
1550
+ | a graphical model to learn features and contextual relations across objects. Earlier,
1551
+ | Schnabel et al. [SWWK08] detected features in large point clouds using constrained
1552
+ | graphs that describe configurations of basic shapes (e.g., planes, cylinders, etc.) and
1553
+ | then performed a graph matching, which cannot be directly used in large, cluttered
1554
+ | environments captured at low resolutions.
1555
+ | Various learning-based approaches have recently been proposed to analyze and
1556
+ | segment 3D geometry, especially towards consistent segmentation and part-label asso-
1557
+ | ciation [HKG11, SvKK+ 11]. While similar MRF or CRF optimization can be applied
1558
+ | in our settings, we found that a fully geometric algorithm can produce comparable
1559
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 37
1560
+ blank |
1561
+ |
1562
+ |
1563
+ text | high-quality recognition results without extensive training. In our setting, learning
1564
+ | amounts to recovering the appropriate deformation model for the scanned model
1565
+ | in terms of arrangement of primitives and their connection types. While most of
1566
+ | machine-learning approaches are restricted to local features and limited viewpoints,
1567
+ | our geometric approach successfully handles the variability of objects and utilizes
1568
+ | extracted high-level information.
1569
+ blank |
1570
+ text | Learning
1571
+ blank |
1572
+ |
1573
+ text | I11 I12 I13 ... M1
1574
+ blank |
1575
+ text | S I 21 I 22 I 23 ... M2
1576
+ | Recognition
1577
+ blank |
1578
+ |
1579
+ |
1580
+ |
1581
+ text | o1 , o2 ,...
1582
+ blank |
1583
+ text | Figure 3.2: Our algorithm consists of two main phases: (i) a relatively slow learn-
1584
+ | ing phase to acquire object models as collection of interconnect primitives and their
1585
+ | joint properties and (ii) a fast object recognition phase that takes an average of
1586
+ | 200 ms/model.
1587
+ blank |
1588
+ |
1589
+ |
1590
+ |
1591
+ title | 3.2 Overview
1592
+ text | Our framework works in two main phases: a learning phase and a recognition phase
1593
+ | (see Figure 3.2).
1594
+ | In the learning phase, our system scans each object of interest a few times (typi-
1595
+ | cally 5-10 scans across different poses). The goal is to consistently segment the scans
1596
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 38
1597
+ blank |
1598
+ |
1599
+ |
1600
+ text | into parts as well as identify the junction between part-pairs to recover the respective
1601
+ | junction attributes. Such a goal, however, is challenging given the input quality. We
1602
+ | address the problem using two scene characteristics: (i) many man-made objects are
1603
+ | well approximated by a collection of simple primitives (e.g., planes, boxes, cylinders)
1604
+ | and (ii) the types of junctions between such primitives are limited (e.g., hinge, trans-
1605
+ | lational) and of low-complexity. First, our system recovers a set of stable primitives
1606
+ | for each individual scan. Then, for each object, the system collectively processes
1607
+ | the scans to extract a primitive-based proxy representation along with the necessary
1608
+ | inter-part junction attributes to build a collection of models {M1 , M2 , . . . }.
1609
+ | In the recognition phase, the system starts with a single scan S of the scene.
1610
+ | First, the system extracts the dominant planes in the scene – typically they capture
1611
+ | the ground, walls, desks, etc. The system identifies the ground plane by using the
1612
+ | (approximate) up-vector from the acquisition device and noting that the points lie
1613
+ | above the ground. Planes parallel to the ground are tagged as tabletops if they are at
1614
+ | heights as observed in the training phase (typically 1′ -3′ ) by exploiting the fact that
1615
+ | working surfaces have similar heights across rooms. The system removes the points
1616
+ | associated with the ground plane and the candidate tabletop, and perform connected
1617
+ | component analysis on the remaining points (on a kn -nearest neighbor graph) to
1618
+ | extract pointsets {o1 , o2 , . . . }.
1619
+ | The system tests if each pointset oi can be satisfactorily explained by any of the
1620
+ | object models Mj . Note, however, that this step is difficult since the data is unreliable
1621
+ | and the objects can have large geometric variations due to changes in the position
1622
+ | and pose of objects. The system performs hierarchical matching which uses the
1623
+ | learned geometry, while trying to match individual parts first, and exploits simple
1624
+ | scene priors like (i) placement relations (e.g., monitors are placed on desks, chairs
1625
+ | rest on the ground) and (ii) allowable repetition modes (e.g., monitors usually repeat
1626
+ | horizontally, chairs are repeated on the ground). We assume such priors are available
1627
+ | as domain knowledge (e.g., Fisher et al. [FSH11]).
1628
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 39
1629
+ blank |
1630
+ |
1631
+ |
1632
+ |
1633
+ text | points super-points parts objects
1634
+ | I X = {x1 , x2 ,... } P = { p1 , p2 ,... } O = {o1 , o2 ,... }
1635
+ blank |
1636
+ |
1637
+ text | Figure 3.3: Unstructured input point cloud is processed into hierarchical data struc-
1638
+ | ture composed of super-points, parts, and objects.
1639
+ blank |
1640
+ title | 3.2.1 Models
1641
+ text | Our system represents the objects of interest as models that approximate the object
1642
+ | shapes while encoding deformation and relationship information (see also [OLGM11]).
1643
+ | Each model can be thought of as a graph structure, the nodes of which denote the
1644
+ | primitives and the edges of which encode the nodes’ connectivity and relationship
1645
+ | to the environment. Currently, the primitive types are limited to box, cylinder, and
1646
+ | radial structure. A box is used to represent a large flat structure; a cylinder is used to
1647
+ | represent a long and narrow structure; and a radial structure is used to capture parts
1648
+ | with discrete rotational symmetry (e.g., the base of a swivel chair). As an additional
1649
+ | regularization, the system groups parallel cylinders of similar lengths (e.g., legs of
1650
+ | a desk or arms of a chair), which in turn provides valuable cues for possible mirror
1651
+ | symmetries.
1652
+ | The connectivity between a pair of primitives is represented as their transfor-
1653
+ | mation relative to each other and their possible deformations. Our current imple-
1654
+ | mentation restricts deformations to be 1-DOF translation, 1-DOF rotation, and an
1655
+ | attachment. The system tests for translational joints for the cylinders and rotational
1656
+ | joints for cylinders or boxes (e.g., a hinge joint). An attachment represents the ex-
1657
+ | istence of a whole primitive node and is especially useful when, depending on the
1658
+ | configuration, the segmentation of the primitive is ambiguous. For example, the ge-
1659
+ | ometry of doors or drawers of cabinets is not easily segmented when they are closed,
1660
+ | and thus they are handled as an attachment when opened.
1661
+ | Additionally, the system detects contact information for the model, i.e., whether
1662
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 40
1663
+ blank |
1664
+ |
1665
+ |
1666
+ text | the object rests on the ground or on a desk. Note that the system assumes that the
1667
+ | vertical direction is known for the scene. Both the direction of the model and the
1668
+ | direction of the ground define a canonical object transformation.
1669
+ blank |
1670
+ |
1671
+ title | 3.2.2 Hierarchical Structure
1672
+ text | For both the learning and recognition phases, the raw input is unstructured point
1673
+ | clouds. The input is hierarchically organized by considering neighboring points and
1674
+ | assign contextual information for each hierarchy level. The scene hierarchy has three
1675
+ | levels of segmentation (see Figure 3.3):
1676
+ blank |
1677
+ text | • super-points X = {x1 , x2 , ...};
1678
+ | • parts P = {p1 , p2 , ...} (association Xp = {x : P (x) = p}); and
1679
+ | • objects O = {o1 , o2 , ...} (association Po = {p : O(p) = o}).
1680
+ blank |
1681
+ text | Instead of working directly on individual points, our system uses super-points
1682
+ | x ∈ X as the atomic entities (analogous to super-pixels in images). The system
1683
+ | creates super-points by uniformly sampling points from the raw measurements and
1684
+ | associating local neighborhoods with the samples based on the normal consistency
1685
+ | of points. Such super-points, or a group of points within a small neighborhood, are
1686
+ | less noisy, while at the same time they are sufficiently small to capture the input
1687
+ | distribution of points.
1688
+ | Next, our system aggregates neighboring super-points into primitive parts p ∈ P .
1689
+ | Such parts are expected to relate to individual primitives of models. Each part p
1690
+ | comprises a set of superpoints Xp . The system initially find such parts by merging
1691
+ | neighboring super-points until the region can no longer be approximated by a plane
1692
+ | (in a least squares sense) with average error less than a threshold θdist . Note that the
1693
+ | initial association of super-points with parts can change later.
1694
+ | Objects form the final hierarchy level during the recognition phase for scenes con-
1695
+ | taining multiple objects. Objects, having been segmented, are mapped to individ-
1696
+ | ual instances of models, while the association between objects and parts (O(p) ∈
1697
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 41
1698
+ blank |
1699
+ |
1700
+ |
1701
+ text | {1, 2, · · · , No } and Po ) are discovered during the recognition process. Note that dur-
1702
+ | ing the learning phase the system deals with only one object at a time and hence
1703
+ | such segmentation is trivial.
1704
+ | The system creates such a hierarchy in the pre-processing stage using the following
1705
+ | parameters in all our tests: number of nearest neighbor kn used for normal estimation,
1706
+ | sampling rate fs for super-points, and distance threshold θdist , which reflects the
1707
+ | approximate noise level. Table 3.1 shows the actual values.
1708
+ blank |
1709
+ text | param. values usage
1710
+ | kn 50 number of nearest neighbor
1711
+ | fs 1/100 sampling rate
1712
+ | θdist 0.1m distance threshold for segmentation
1713
+ | Ñp 10-20 Equation 3.1
1714
+ | θheight 0.5 Equation 3.5
1715
+ | θnormal 20◦ Equation 3.6
1716
+ | θsize 2θdist Equation 3.7
1717
+ | λ 0.8 coverage ratio to declare a match
1718
+ blank |
1719
+ text | Table 3.1: Parameters used in our algorithm.
1720
+ blank |
1721
+ |
1722
+ |
1723
+ |
1724
+ title | 3.3 Learning Phase
1725
+ text | The input to the learning phase is a set of point clouds {I 1 , . . . , I n } obtained from
1726
+ | the same object in different configurations. Our goal is to build a model M consisting
1727
+ | of primitives that are linked by joints. Essentially, the system has to simultaneously
1728
+ | segment the scans into an unknown number of parts, establish correspondence across
1729
+ | different measurements, and extract relative deformations. We simplify the problem
1730
+ | by assuming that each part can be represented by primitives and that each joint
1731
+ | can be encoded with a simple degree of freedom (see also [CZ11]). This assumption
1732
+ | allows us to approximate many man-made objects, while at the same time it leads to
1733
+ | a lightweight model. Note that, unlike Schnabel et al. [SWWK08], who use patches
1734
+ | of partial primitives, our system uses full primitives to represent parts in the learning
1735
+ | phase.
1736
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 42
1737
+ blank |
1738
+ |
1739
+ |
1740
+ text | Initialize the skeleton (Sec. 3.3.1)
1741
+ | Mark stable parts/ Match marked Jointly fit primitives
1742
+ | Update parts
1743
+ | part-groups parts to matched parts
1744
+ blank |
1745
+ |
1746
+ |
1747
+ |
1748
+ text | Incrementally complete the coherent model (Sec. 3.3.2)
1749
+ | Match parts by Jointly fit primitives
1750
+ | Update parts
1751
+ | relative position to matched parts
1752
+ blank |
1753
+ |
1754
+ |
1755
+ |
1756
+ text | Figure 3.4: The learning phase starts by initializing the skeleton model, which is
1757
+ | defined from coherent matches of stable parts. After initialization, new primitives are
1758
+ | added by finding groups of parts at similar relative locations, and then the primitives
1759
+ | are jointly fitted.
1760
+ blank |
1761
+ text | The learning phase starts by detecting large and stable parts to establish a global
1762
+ | reference frame across different measurements I i (Section 3.3.1). The initial corre-
1763
+ | spondences serve as a skeleton of the model, while other parts are incrementally added
1764
+ | to the model until all of the points are covered within threshold θdist (Section 3.3.2).
1765
+ | While primitive fitting is unstable over isolated noisy scans, our system jointly refines
1766
+ | the primitives to construct a coherent model M (see Figure 3.4).
1767
+ | The final model also contains attributes necessary for robust matching. For ex-
1768
+ | ample, the distribution of height from the ground plane provides a prior for tables;
1769
+ | objects can have preferred a repetition direction, e.g., monitors or auditorium chairs
1770
+ | are typically repeated sidewise; or objects can have preferred orientations. These
1771
+ | learned attributes and relationships act as reliable regularizers in the recognition
1772
+ | phase, when data is typically sparse, incomplete, and noisy.
1773
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 43
1774
+ blank |
1775
+ |
1776
+ |
1777
+ title | 3.3.1 Initializing the Skeleton of the Model
1778
+ text | The initial structure is derived from large, stable parts across different measurements,
1779
+ | whose consistent correspondences define the reference frame that aligns the measure-
1780
+ | ments. In the pre-processing stage, individual scans I i are divided into super-points
1781
+ | X i and parts P i , as described in Section 3.2.2. The system then marks the stable
1782
+ | parts of candidate boxes or candidate cylinders.
1783
+ | A candidate face of a box is marked by finding parts with a sufficient number of
1784
+ | super-points:
1785
+ | |Xp | > |P|/Ñp , (3.1)
1786
+ blank |
1787
+ text | where Ñp is a user-defined parameter of the approximate number of primitives in the
1788
+ | model. In our tests, a threshold of 10-20 is used. Parallel planes with comparable
1789
+ | heights are grouped together based on their orientation to constitute the opposite
1790
+ | faces of a box primitive.
1791
+ | The system classifies a part as a candidate cylinder if the ratio of the top two
1792
+ | principle components is greater than 2. Subsequently, parallel cylinders with similar
1793
+ | heights (e.g., legs of chairs) are grouped.
1794
+ | After candidate boxes and cylinders are marked, the system matches the marked
1795
+ | (sometimes grouped) parts for pairs of measurements P i . The system only uses the
1796
+ | consistent matches to define a reference frame between measurements and jointly fit
1797
+ | primitives to the matched parts (see Section 3.3.2).
1798
+ blank |
1799
+ title | Matching
1800
+ blank |
1801
+ text | After extracting the stable parts P i for each measurement, our goal is to match the
1802
+ | parts across different measurements to build a connectivity structure. The system
1803
+ | picks a seed measurement j ∈ {1, 2, ..., n} at random and compare every other mea-
1804
+ | surement against the seed measurement.
1805
+ | Our system then uses spectral correspondences [LH05] to match parts in seed
1806
+ | {p, q} ∈ P k and other {p′ , q ′ } ∈ P i . The system builds an affinity matrix A, where
1807
+ | each entry represents the matching score between part pairs. Recall that candidate
1808
+ | parts p have associated types (box or cylinder), say t(p). Intuitively, the system
1809
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 44
1810
+ blank |
1811
+ |
1812
+ |
1813
+ text | assigns a higher matching score for the parts with the same type t(p) at similar
1814
+ | relative positions. If a candidate assignment a = (p, p′ ) assigns p ∈ P j to p′ ∈ P i , the
1815
+ | corresponding entries are defined as the following:
1816
+ | (
1817
+ | 0 if t(p) 6= t(p′ )
1818
+ | A(a, a) = (3.2)
1819
+ | exp(−(hp − hp′ )2 /2θdist
1820
+ | 2
1821
+ | ) otherwise,
1822
+ blank |
1823
+ text | where our system uses the height from the ground hp as a feature. The affinity value
1824
+ | for a pair-wise assignment between a = (p, p′ ) and b = (q, q ′ ) (p, q ∈ P j and p′ , q ′ ∈ P i )
1825
+ | is defined as:
1826
+ | (
1827
+ | 0 if t(p) 6= t(p′ ) or t(q) 6= t(q ′ )
1828
+ | A(a, b) = ′ ′ 2 (3.3)
1829
+ | exp(− (d(p,q)−d(p
1830
+ | 2θ 2
1831
+ | ,q ))
1832
+ | ) otherwise,
1833
+ | dist
1834
+ blank |
1835
+ |
1836
+ |
1837
+ text | where d(p, q) represents the distance between two parts p, q ∈ P . The system ex-
1838
+ | tracts the most dominant eigenvector of A to establish a correspondence among the
1839
+ | candidate parts.
1840
+ | After comparing the seed measurement P j against all the other measurements P i ,
1841
+ | the system retains those matches only that are consistent across different measure-
1842
+ | ments. The relative positions of the matched parts define the reference frame of the
1843
+ | object as well as the relative transformation between measurements.
1844
+ blank |
1845
+ title | Joint Primitive Fitting
1846
+ blank |
1847
+ text | Our system jointly fits primitives to the grouped parts, while adding necessary defor-
1848
+ | mation. First, the primitive type is fixed by testing for the three types of primitives
1849
+ | (box, cylinder, and rotational structure) and picking the primitive with the smallest
1850
+ | fitting error. Once the primitive type is fixed, the corresponding primitives from other
1851
+ | measurements are averaged and added to the model as a jointly fitted primitive.
1852
+ | Our system uses the coordinate frame to position the fitted primitives. More
1853
+ | specifically, the three orthogonal directions of a box are defined by the frame of
1854
+ | reference defined by the ground direction and the relative positions of the matched
1855
+ | parts. If the normal of the largest observed face does not align with the default frame
1856
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 45
1857
+ blank |
1858
+ |
1859
+ |
1860
+ text | of reference, the box is rotated around an axis to align the large plane. The cylinder
1861
+ | is aligned using its axis, while the rotational primitive is tested when the part is at
1862
+ | the bottom of an object.
1863
+ | Note that unlike a cylinder or a rotational structure, a box can introduce new
1864
+ | faces that are invisible because of the placement rules of objects. For example, the
1865
+ | bottom of a chair seat or the back of a monitor are often missing in the input scans.
1866
+ | Hence, the system retains the information about which of the six faces are visible to
1867
+ | simplify the subsequent recognition phase.
1868
+ | Our system now encodes the inter-primitive connectivity as an edge of the graph
1869
+ | structure. The joints between primitives are added by comparing the relationship
1870
+ | between the parent and child primitives. The first matched primitive acts as a root
1871
+ | to the model graph. Subsequent primitives are the children of the closest primitive
1872
+ | among those already existing in the model. A translational joint is added if the size
1873
+ | of the primitive node varies over different measurements by more than θdist ; or, a
1874
+ | rotational joint is added when the relative angle between the parent and child node
1875
+ | differs by more than 20◦ .
1876
+ blank |
1877
+ |
1878
+ title | 3.3.2 Incrementally Completing a Coherent Model
1879
+ text | Having built an initial model structure, the system incrementally adds primitives by
1880
+ | processing super-points that could not be explained by the primitives. The remaining
1881
+ | super-points are processed to create parts, and the parts are matched based on their
1882
+ | relative positions. Starting from the bottom-most matches, the system jointly fits
1883
+ | primitives to the matched parts, as described above. The system iterates the process
1884
+ | until all super-points in measurements are explained by the model.
1885
+ | If there exist some parts that only exist in a subset of measurements, then the
1886
+ | system adds an attachment of the primitive. For example, in Figure 3.5, after each
1887
+ | side of the rectangular shape of a drawer has been matched, the open drawer is added
1888
+ | as an attachment to the base shape.
1889
+ | The system also maintains the contact point of a model to the ground (or the
1890
+ | bottom-most primitive), the height distribution of each part as histogram, visible face
1891
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 46
1892
+ blank |
1893
+ |
1894
+ |
1895
+ |
1896
+ text | open drawers
1897
+ blank |
1898
+ |
1899
+ |
1900
+ |
1901
+ text | unmatched parts
1902
+ blank |
1903
+ text | Figure 3.5: The open drawers remain as unmatched (grey) after incremental matching
1904
+ | and joint primitive fitting. These parts will be added as an attachment of the model.
1905
+ blank |
1906
+ text | information, and the canonical frame of reference defined during the matching process.
1907
+ | This information, along with the extracted models, is used during the recognition
1908
+ | phase.
1909
+ blank |
1910
+ |
1911
+ title | 3.4 Recognition Phase
1912
+ text | Having learned a set of models (along with their deformation modes) M := {M1 , . . . , Mk }
1913
+ | for a particular environment, the system can quickly collect and understand the envi-
1914
+ | ronment in the recognition phase. This phase is much faster than the learning phase
1915
+ | since there are only a small number of simple primitives and certain deformation
1916
+ | modes from which to search. As an input, the scene S containing the learned models
1917
+ | is collected using the framework from Engelhard et al. [EEH+ 11] which takes a few
1918
+ | seconds. In a pre-processing stage, the system marks the most dominant plane as the
1919
+ | ground plane g. Then, the second most dominant plane that is parallel to the ground
1920
+ | plane is marked as the desk plane d. The system processes the remaining points to
1921
+ | form a hierarchical structure with super-points, parts, and objects (see Section 3.2.2).
1922
+ | The recognition phase starts from a part-based assignment, which quickly com-
1923
+ | pares parts in the measurement and primitive nodes in each model. The algorithm
1924
+ | infers deformation and transformation of the model from the matched parts, while
1925
+ | filtering the valid match by comparing actual measurement against the underlying
1926
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 47
1927
+ blank |
1928
+ |
1929
+ |
1930
+ text | Initial assignments for parts (Sec.3.4.1)
1931
+ | { p1 , p2 ,L} Î oi {m1 , m2 , m3 , l1 , a 3}Î M
1932
+ | p1 = m3
1933
+ | m3
1934
+ | m3 rotational
1935
+ | a3
1936
+ | m2
1937
+ | m2
1938
+ | m1 translational
1939
+ | l1 m1
1940
+ | g
1941
+ | {o1 , o2 ,L}Î S contact g
1942
+ blank |
1943
+ text | Refined assignment with geometry (Sec. 3.4.2)
1944
+ | Solve for deformation Find correspondence
1945
+ | given matches (Sec.5.2.a) and segmentation (Sec.5.2.b)
1946
+ | Iterate p1 = m3
1947
+ | h( p1 ) = h(m3 ) = f h (l1 , a 3 )
1948
+ | n
1949
+ | p2 = m2
1950
+ | n( p1 ) = n(m3 ) = f (a 3 )
1951
+ | p3 = m1
1952
+ blank |
1953
+ text | Figure 3.6: Overview of the recognition phase. The algorithm first finds matched parts
1954
+ | before proceeding to recover the entire model and its corresponding segmentation.
1955
+ blank |
1956
+ text | geometry. If a sufficient portion of measurements can be explained by the model,
1957
+ | the system accepts the match as valid, and the segmentation of both object-level and
1958
+ | part-level is refined to match the model.
1959
+ blank |
1960
+ |
1961
+ title | 3.4.1 Initial Assignment for Parts
1962
+ text | Our system first makes coarse assignments between segmented parts and model nodes
1963
+ | to quickly reduce the search space (see Figure 3.6, top). If a part and a primitive node
1964
+ | form a potential match, the system also induces the relative transformation between
1965
+ | them. The output of the algorithm is a list of triplets composed of part, node from
1966
+ | the model, and transformation groups {(p, m, T )}.
1967
+ | Our system uses geometric features to decide whether individual parts can be
1968
+ | matched with model nodes. Note that the system does not use color information in
1969
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 48
1970
+ blank |
1971
+ |
1972
+ |
1973
+ text | our setting. As features for individual parts Ap , our system considers the following:
1974
+ | (i) height distribution from ground plane as a histogram vector hp ; (ii) three principal
1975
+ | components of of the region x1p , x2p , x3p (x3p = np ); and (iii) sizes along the directions
1976
+ | lp1 > lp2 > lp3 .
1977
+ | Similarly, the system infers the counterpart of features for individual visible faces
1978
+ | of model parts Am . Thus, even if one face of a part is visible from the measurement,
1979
+ | our system is still able to detect the matched part of the model. The height histogram
1980
+ | hm is calculated from the relative area per height interval and the dimensions and
1981
+ | principal components are inferred from the shape of the faces.
1982
+ | All the parts are compared against all the faces of primitive nodes in the model:
1983
+ blank |
1984
+ text | E(Ap , Am ) = (3.4)
1985
+ | ψ height (hp , hm ) · ψ normal (np , nm ; g) · ψ size ({lp1 , lp2 }, {lm
1986
+ | 1 2
1987
+ | , lm }).
1988
+ blank |
1989
+ text | Individual potential function ψ returns either 1 (matched) or 0 (not matched) de-
1990
+ | pending on if a feature satisfies the criteria within an allowable threshold. Parts are
1991
+ | matched only if all the features criteria are satisfied. The height potential calculates
1992
+ | the histogram intersection
1993
+ | X
1994
+ | ψ height (hp , hm ) = min(hp (i), hm (i)) > θheight . (3.5)
1995
+ | i
1996
+ blank |
1997
+ |
1998
+ text | The normal potential calculates the relative angle with the ground plane normal (ng )
1999
+ | as
2000
+ | ψ normal (np , nm ; g) = |acos(np · ng ) − acos(nm · ng )| < θnormal . (3.6)
2001
+ blank |
2002
+ text | The size potential compares the size of the part
2003
+ blank |
2004
+ text | 1
2005
+ | ψ size ({lp1 , lp2 }, {lm 2
2006
+ | , lm }) = |lp1 − lm
2007
+ | 1
2008
+ | | < θsize and |lp2 − lm
2009
+ | 2
2010
+ | | < θsize . (3.7)
2011
+ blank |
2012
+ text | Our system sets the threshold generously to allow false positives and retain multiple
2013
+ | (or none) matched parts per object (see Table 3.1). In effect, the system first guesses
2014
+ | potential object-model associations and later prunes out the incorrect associations
2015
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 49
2016
+ blank |
2017
+ |
2018
+ |
2019
+ text | in the refinement step using the full geometry (see Section 3.4.2). If Equation 3.4
2020
+ | returns 1, then the system can obtain a good estimate of the relative transformation
2021
+ | T between the model and the part by using the position, normal, and the ground
2022
+ | plane direction to create a triplet (p, m, T ).
2023
+ blank |
2024
+ |
2025
+ title | 3.4.2 Refined Assignment with Geometry
2026
+ text | Starting from the list of part, node, and transformation triplets {(p, m, T )}, the sys-
2027
+ | tem verifies the assignments with a full model by comparing a segmented object
2028
+ | o = O(p) against models Mi . The goal is to produce accurate part assignments for
2029
+ | observable parts, transformation, and the deformation parameters. Intuitively, the
2030
+ | system finds a local minimum from the suggested starting point (p, m, T ) with the
2031
+ | help of the models extracted in the learning phase. The system then optimizes by
2032
+ | alternately refining the model pose and updating the segmentation (see Figure 3.6,
2033
+ | bottom).
2034
+ | Given the assignment between p and m, the system first refines the registration and
2035
+ | deformation parameters and places the model M to best explain the measurements.
2036
+ | If the placed model covers most of the points that belong to the object (ratio λ = 0.8
2037
+ | in our tests) within the distance threshold θdist , then the system confirms that the
2038
+ | model is matched to the object. Note that, compared to the generous threshold in
2039
+ | part-matching in Section 5.1, the system now sets a conservative threshold to prune
2040
+ | false-positives.
2041
+ | In the case of a match, the geometry is fixed and the system refines the segmen-
2042
+ | tation, i.e., the part and object boundaries are modified to match the underlying
2043
+ | geometry. The process is iterated until convergence.
2044
+ blank |
2045
+ title | Refining Deformation and Registration
2046
+ blank |
2047
+ text | Our system finds the deformation parameters using the relative location and orien-
2048
+ | tation of parts and the contact plane (e.g., desk top, the ground plane). Given any
2049
+ | pair of parts, or a part and the ground plane, their mutual distance and orientation
2050
+ | are formulated as functions of deformation parameters existing between the path of
2051
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 50
2052
+ blank |
2053
+ |
2054
+ |
2055
+ title | Input points Models matched Parts assigned
2056
+ blank |
2057
+ |
2058
+ |
2059
+ |
2060
+ text | Initial objects Refined objects
2061
+ blank |
2062
+ text | Figure 3.7: The initial object-level segmentation can be imperfect especially between
2063
+ | distant parts. For example, the top and base of a chair initially appeared to be sep-
2064
+ | arate objects, but were eventually understood as the same object after the segments
2065
+ | were refined based on the geometry of the matched model.
2066
+ blank |
2067
+ text | the two parts. For example, if our system starts from matched part-primitive pair p1
2068
+ | and m3 in Figure 3.6, then the height and the normal of the part can be expressed as
2069
+ | function of the deformation parameters l1 and α3 of the model. The system solves a
2070
+ | set of linear equations given for the observed parts and the contact location to solve
2071
+ | for the deformation parameters. Then, the registration between the scan and the
2072
+ | deformed model is refined by Iterative Closest Point (ICP) [BM92].
2073
+ | Ideally, part p in the scene measurement should be explained by the assigned
2074
+ | part geometry within the distance threshold θdist . The model is matched to the
2075
+ | measurement if the proportion of points within θdist is more than λ. (Note that not
2076
+ | all faces of the part need to be explained by the region measurement as only a subset
2077
+ | of the model is measured by the sensor.) Otherwise, the triplet (p, m, T ) is an invalid
2078
+ | assignment and the algorithm returns false. After initial matching (Section 3.4.1),
2079
+ | multiple parts of an object can match to different primitives of many models. If there
2080
+ | are multiple successful matches for an object, the system retains the assignment with
2081
+ | the most number of points.
2082
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 51
2083
+ blank |
2084
+ |
2085
+ |
2086
+ title | Refine Segmentation
2087
+ blank |
2088
+ text | After a model is picked and positioned in the configuration, its location is fixed
2089
+ | while the system refines the segmentation based on the underlying model. Recall
2090
+ | that the initial segment of parts P merge super-points with similar normals and
2091
+ | objects O group neighboring parts using the distance threshold. Although the initial
2092
+ | segmentations provide a sufficient approximation to roughly locate the models, they
2093
+ | do not necessarily coincide with the actual part and object boundaries without being
2094
+ | compared against the geometry.
2095
+ | First, the system updates the association between super-points and the parts by
2096
+ | finding the closest primitive node of the model for each super-point. The super-points
2097
+ | that belong to the same model node are grouped to the same part (see Figure 3.7).
2098
+ | In contrast, super-points that are farther away than the distance threshold θdist from
2099
+ | any of the primitives are separated to form a new segment with a null assignment.
2100
+ | After the part assignment, the system searches for the missing primitives by merg-
2101
+ | ing neighboring objects (see Figure 3.7). In the initial segmentation, objects which
2102
+ | are close to each other in the scene can lead to multiple objects grouped into a sin-
2103
+ | gle segment. Further, particular viewpoints of an object can cause parts within the
2104
+ | model to appear farther apart, leading to spurious multiple segments. Hence, the
2105
+ | super-points are assigned to an object only after the existence of the object is verified
2106
+ | with the underlying geometry.
2107
+ blank |
2108
+ |
2109
+ title | 3.5 Results
2110
+ text | In this section, we present the performance results obtained from testing our system
2111
+ | on various synthetic and real-world scenes.
2112
+ blank |
2113
+ |
2114
+ title | 3.5.1 Synthetic Scenes
2115
+ text | We tested our framework on synthetic scans of 3D scenes obtained from the Google
2116
+ | 3D Warehouse (see Figure 3.8). We implemented a virtual scanner to generate the
2117
+ | synthetic data: once the user specifies a viewpoint, we read the depth buffer to recover
2118
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 52
2119
+ blank |
2120
+ |
2121
+ |
2122
+ text | 3D range data of the virtual scene from the specified viewpoint. We control the scan
2123
+ | quality using three parameters: (i) scanning density d to control the fraction points
2124
+ | that are retained, (ii) noise level g to control the zero mean Gaussian noise added to
2125
+ | each point along the current viewing direction, and (iii) the angle noise a to perturb
2126
+ | the position in the local tangent plane using zero mean Gaussian noise. Unless stated,
2127
+ | we used default values of d = 0.4, g = 0.01, and a = 5◦ .
2128
+ | In Figure 3.8, we present typical recognition results using our framework. The
2129
+ | system learned different models of chairs and placed them with varying deformations
2130
+ | (see Table 3.2). We exaggerated some of the deformation modes, including very
2131
+ | high chairs and severely tilted monitors, but could still reliably detect them all (see
2132
+ | Table 3.3). Beyond recognition, our system reliably recovered both positions and
2133
+ | pose parameters within 5% error margin of the object size. Incomplete data can,
2134
+ | however, result in ambiguities: for example, in synthetic #2 our system correctly
2135
+ | detected a chair, but displayed it in a flipped position, since the scan contained data
2136
+ blank |
2137
+ |
2138
+ |
2139
+ |
2140
+ text | synthetic 1
2141
+ blank |
2142
+ |
2143
+ |
2144
+ |
2145
+ text | synthetic 2
2146
+ blank |
2147
+ |
2148
+ |
2149
+ |
2150
+ text | synthetic 3
2151
+ blank |
2152
+ text | Figure 3.8: Recognition results on synthetic scans of virtual scenes: (left to right) syn-
2153
+ | thetic scenes, virtual scans, and detected scene objects with variations. Unmatched
2154
+ | points are shown in gray.
2155
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 53
2156
+ blank |
2157
+ |
2158
+ |
2159
+ text | only from the chair’s back. While specific volume-based reasoning can be used to
2160
+ | give preference to chairs in an upright position, our system avoided such case-specific
2161
+ | rules in the current implementation.
2162
+ blank |
2163
+ |
2164
+ |
2165
+ |
2166
+ text | similar different
2167
+ blank |
2168
+ |
2169
+ text | Figure 3.9: Chair models used in synthetic scenes.
2170
+ blank |
2171
+ text | In practice, acquired data sets suffer from varying sampling resolution, noise, and
2172
+ | occlusion. While it is difficult to exactly mimic real-world scenarios, we ran synthetic
2173
+ | tests to access the stability of our algorithm. We placed two classes of chairs (see
2174
+ | Figure 3.9) on a ground plane, 70-80 chairs of each type, and created scans from
2175
+ | 5 different viewpoints with varying density and noise parameters. For both classes,
2176
+ | we used our recognition framework to measure precision and recall while varying
2177
+ | parameter λ. Note that precision represents how many of the detected objects are
2178
+ | correctly classified out of total number of detections, while recall represents how many
2179
+ | objects were correctly detected out of the total number of placed objects. In other
2180
+ | words, a precision measure of 1 indicates no false positives, while a recall measure of
2181
+ | 1 indicates there are no false negatives.
2182
+ | Figure 3.10 shows the corresponding precision-recall curves. The first two plots
2183
+ | show precision-recall curves using a similar pair of models, where the chairs have sim-
2184
+ | ilar dimensions, which is expected to result in high false-positive rates (see Figure 3.9,
2185
+ | left). Not surprisingly, recognition improves with a lower noise margin and/or higher
2186
+ | sampling density. Performance, however, is saturated with Gaussian noise lower than
2187
+ | 0.3 and density higher than 0.6 since both our model- and part-based components
2188
+ | are approximations of the true data, resulting in an inherent discrepancy between
2189
+ | measurement and the model, even in absence of noise. Note that as long as the parts
2190
+ | and dimensions are captured, our system still detects objects even under high noise
2191
+ | and sparse sampling.
2192
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 54
2193
+ blank |
2194
+ |
2195
+ |
2196
+ text | Density (a similar pair) Noise (a similar pair) Data type
2197
+ | 1.2 1.2 1.2
2198
+ blank |
2199
+ text | 1 1 1
2200
+ blank |
2201
+ text | 0.8 0.8 0.8
2202
+ blank |
2203
+ |
2204
+ |
2205
+ |
2206
+ text | Recall
2207
+ | Recall
2208
+ | Recall
2209
+ blank |
2210
+ |
2211
+ |
2212
+ |
2213
+ text | 0.6 0.6 0.6
2214
+ blank |
2215
+ text | 0.4 0.4 0.4
2216
+ blank |
2217
+ text | 0.2 0.2 0.2
2218
+ blank |
2219
+ text | 0 0 0
2220
+ | 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
2221
+ | Precision Precision Precision
2222
+ | density 0.4 density 0.5 Gaussian 0.004 Gaussian 0.008 Gaussian 0.004 Gaussian 0.004
2223
+ | density 0.6 density 0.7 Gaussian 0.3 Gaussian 0.5 Gaussian 0.3 Gaussian 0.3
2224
+ | density 0.8 Gaussian 1.0 Gaussian 2.0 Gaussian 1.0 Gaussian 1.0
2225
+ | Different pair Similar pair
2226
+ blank |
2227
+ text | Figure 3.10: Precision-recall curve with varying parameter λ.
2228
+ blank |
2229
+ text | Our algorithm has higher robustness when the pair of models are sufficiently
2230
+ | different (see Figure 3.10, right). We tested with two pairs of chairs (see Figure 3.9):
2231
+ | the first pair had chairs of similar dimensions as before (in solid lines), while the
2232
+ | second pair had a chair and a sofa with large geometric differences (in dotted lines).
2233
+ | When tested with the different pairs, our system achieved precision higher than 0.98
2234
+ | for recall larger than 0.9. Thus, as long as the geometric space of the objects is sparsely
2235
+ | populated, our algorithm has a high accuracy in quickly acquiring the geometry of
2236
+ | environment without assistance from data-driven or machine-learning techniques.
2237
+ blank |
2238
+ |
2239
+ title | 3.5.2 Real-World Scenes
2240
+ text | The more practical test of our system is its performance on real scanned data since
2241
+ | it is difficult to synthetically recreate all the artifacts encountered during scanning
2242
+ | of a actual physical space. We tested our framework on a range of real-world ex-
2243
+ | amples, each consisting of multiple objects arranged over large spaces (e.g., office
2244
+ | areas, seminar rooms, auditoriums) at a university. For both the learning and the
2245
+ | recognition phases, we acquired the scenes using a Microsoft Kinect scanner with an
2246
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 55
2247
+ blank |
2248
+ |
2249
+ text | points no. of no. of no. of
2250
+ | scene model
2251
+ | per scan scans prim. joints
2252
+ | chair 28445 7 10 4
2253
+ | synthetic1 stool 19944 7 3 2
2254
+ | monitor 60933 7 3 2
2255
+ | chaira 720364 7 9 5
2256
+ | synthetic2
2257
+ | chairb 852072 1 6 0
2258
+ | synthetic3 chair 253548 4 10 2
2259
+ | chair 41724 7 8 4
2260
+ | monitor 20011 5 3 2
2261
+ | office
2262
+ | trash bin 28348 2 4 0
2263
+ | whitebrd. 356231 1 3 0
2264
+ | auditorium chair 31534 5 4 2
2265
+ | seminar rm. chair 141301 1 4 0
2266
+ blank |
2267
+ text | Table 3.2: Models obtained from the learning phase (see Figure 3.11).
2268
+ blank |
2269
+ text | open source scanning library [EEH+ 11]. The scenes were challenging, especially due
2270
+ | to the amount of variability in the individual model poses (see our project page for
2271
+ | the input scans and recovered models). Table 3.2 summarizes all the models built
2272
+ | during the learning stage for these scenes ranging from 3-10 primitives with 0-5 joints
2273
+ | extracted from only a few scans (see Figure 3.11). While we evaluated our framework
2274
+ | based on the raw Kinect output rather than on processed data (e.g., [IKH+ 11]), the
2275
+ | performance limits should be similar when calibrated to the data quality and physical
2276
+ | size of the objects.
2277
+ blank |
2278
+ |
2279
+ |
2280
+ |
2281
+ text | Figure 3.11: Various models learned/used in our test (see Table 3.2).
2282
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 56
2283
+ blank |
2284
+ |
2285
+ |
2286
+ text | Our recognition phase was lightweight and fast, taking on average 200ms to com-
2287
+ | pare a point cluster to a model on a 2.4Hz CPU with 6GB RAM. For example, in
2288
+ | Figure 3.1, our system detected all 5 chairs present and 4 of the 5 monitors, along with
2289
+ | their poses. Note that objects that were not among the learned models remained un-
2290
+ | detected, including a sofa in the middle of the space and other miscellaneous clutter.
2291
+ | We overlaid the unresolved points on the recognized parts for comparison. Note that
2292
+ | our algorithm had access to only the geometry of objects, not any color or texture
2293
+ | attributes. The complexity of our problem setting can be appreciated by looking at
2294
+ | the input scan, which is difficult even for a human to parse visually. We observed
2295
+ | Kinect data to exhibit highly non-linear noise effects that were not simulated in our
2296
+ | synthetic scans; data also went missing when an object was narrow or specular (e.g.,
2297
+ | monitor), with flying pixels along depth discontinuities, and severe quantization noise
2298
+ | for distant objects.
2299
+ | number of input points objects objects
2300
+ | scene
2301
+ | ave. min. max. present detected*
2302
+ | syn. 1 3227 1168 9967 5c 3s 5m 5c 3s 5m
2303
+ | syn. 2 2422 1393 3427 4ca 4cb 4ca 4cb
2304
+ | syn. 3 1593 948 2704 14 chairs 14 chairs
2305
+ | teaser 6187 2575 12083 5c 5m 0t 5c 4m 0t
2306
+ | office 1 3452 1129 7825 5c 2m 1t 2w 5c 2m 1t 2w
2307
+ | office 2 3437 1355 10278 8c 5m 0t 2w 6c 3m 0t 2w
2308
+ | aud. 1 19033 11377 29260 26 chairs 26 chairs
2309
+ | aud. 2 9381 2832 13317 21 chairs 19 chairs
2310
+ | sem. 1 4326 840 11829 13 chairs 11 chairs
2311
+ | sem. 2 6257 2056 12467 18 chairs 16 chairs
2312
+ | *c: chair, m: monitor, t: trash bin, w: whiteboard, s: stool
2313
+ | Table 3.3: Statistics for the recognition phase. For each scene, we also indicate the
2314
+ | corresponding scene in Figure 3.8 and Figure 3.12, when applicable.
2315
+ blank |
2316
+ text | Figure 3.12 compiles the results for cluttered office setups, auditoriums, and sem-
2317
+ | inar rooms. Although we tested with different scenes, we present only representative
2318
+ | examples as the performance on all types of scenes was comparable. Our system
2319
+ | detected the chairs, computer monitors, whiteboards, and trash bins across different
2320
+ | rooms, and the rows of auditorium chairs in different configurations. Our system
2321
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 57
2322
+ blank |
2323
+ |
2324
+ |
2325
+ text | missed some of the monitors because the material property of the screens were proba-
2326
+ | bly not favorable to Kinect capture. The missed monitors (as in Figure 3.1 and office
2327
+ | #2 in Figure 3.12) have big rectangular holes within the screen in the scans. In office
2328
+ | #2, the system also missed two of the chairs that were mostly occluded and beyond
2329
+ | what our framework can handle.
2330
+ | Even under such demanding data quality, our system can recognize the models
2331
+ | and recover poses from data sets an order of magnitude sparser than those required
2332
+ | in the learning phase. Surprisingly, the system could also detect the small tables in
2333
+ | the two auditorium scenes (1 in auditorium #1, and 3 in auditorium #2) and also
2334
+ | identify pose changes in the auditorium seats. Figure 3.13 shows a close-up office
2335
+ | scene to better illustrate the deformation modes that our system captured. All of the
2336
+ | recognized object models have one or more deformation modes, and we can visually
2337
+ | compare the quality of data to the recovered pose and deformation.
2338
+ | The segmentation of real-world scenes are challenging with naturally cluttered
2339
+ | set-ups. The challenge is well demonstrated in the seminar rooms because of closely
2340
+ | spaced chairs or chairs leaning against the wall. In contrast to the auditorium scenes,
2341
+ | where the rows of chairs are detected together making the segmentation trivial, in
2342
+ | the seminar room setting chairs often occlude each other. The quality of data also
2343
+ | deteriorates because of thin metal legs with specular highlights. Nevertheless, our
2344
+ | system correctly recognized most of the chairs along with correct configurations by
2345
+ | first detecting the larger parts. Although only 4-6 chairs were detected in the initial
2346
+ | iteration, our system eventually detected most of chairs in the seminar rooms by
2347
+ | refining the segmentation based on the learned geometry (in 3-4 iterations).
2348
+ blank |
2349
+ |
2350
+ title | 3.5.3 Comparisons
2351
+ text | In the learning phase, our system requires multiple scans of an object to build a proxy
2352
+ | model along with its deformation modes. Unfortunately, the existing public data sets
2353
+ | do not provide such multiple scans. Instead, we compared our recognition routine
2354
+ | to the algorithm proposed by Koppula et al. [KAJS11] using author provided code
2355
+ | to recognize objects from a real-time stream of Kinect data after the user manually
2356
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 58
2357
+ blank |
2358
+ |
2359
+ |
2360
+ text | marks the ground plane. We fixed the device location and qualitatively compared
2361
+ | the recognition results of the two algorithms (see Figure 3.14). We observed that
2362
+ | Koppula et al. reliably detect floors, table tops and front-facing chairs, but often fail
2363
+ | to detect chairs facing backwards, or distant ones. They also miss all the monitors,
2364
+ | which usually are very noisy. In contrast, our algorithm being pose- and variation-
2365
+ | aware is more stable across multiple frames, even with access to less information (we
2366
+ | do not use color). Note that while our system detected some monitors, their poses are
2367
+ | typically biased toward parts where measurements exist. In summary, for partial and
2368
+ | noisy point-clouds, the probabilistic formulation coupled with geometric reasoning
2369
+ | results in robust semantic labeling of the objects.
2370
+ blank |
2371
+ |
2372
+ title | 3.5.4 Limitations
2373
+ text | While in our tests the recognition results were mostly satisfactory (see Table 3.3),
2374
+ | we observed two main failure modes. First, our system failed to detect objects when
2375
+ | large amounts of data were missing. In real-world scenarios, our object scans could
2376
+ | easily exhibit large holes because of occlusions, specular materials, or thin structures.
2377
+ | Further, scans can be sparse and distorted for distant objects. Second, our system
2378
+ | cannot overcome the limitations of our initial segmentation. For example, if objects
2379
+ | are closer than θdist , our system groups them as a single object; while a single object
2380
+ | can be confused for multiple objects if its measurements are separated by more than
2381
+ | θdist from a particular viewpoint. Although in certain cases the algorithm can recover
2382
+ | segmentations with the help of other visible parts, this recovery becomes difficult
2383
+ | because our system allows objects to deform and hence have variable extent.
2384
+ | However, even with these limitations, our system overall reliably recognized scans
2385
+ | with 1000-3000 points per scan since in the learning phase the system extracted
2386
+ | the important degrees of variation, thus providing a compact, yet powerful, model
2387
+ | (and deformation) abstraction. In a real office settings, the simplicity and speed
2388
+ | of our framework would allow a human operator to immediately notice missed or
2389
+ | misclassified objects and quickly re-scan those areas under more favorable conditions.
2390
+ | We believe that such a progressive scanning possibility to become more common place
2391
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 59
2392
+ blank |
2393
+ |
2394
+ |
2395
+ text | in future acquisition setups.
2396
+ blank |
2397
+ |
2398
+ title | 3.5.5 Applications
2399
+ text | Our results suggest that our system is also useful for obtaining a high-level under-
2400
+ | standing of recognized objects, e.g., relative position, orientation, frequency of learned
2401
+ | objects. Specifically, as our system progressively scans multiple rooms populated with
2402
+ | the same objects, the system gathers valuable co-occurrence statistics (see Table 3.4).
2403
+ | For example, from the collected data, the system extracts that the orientation of audi-
2404
+ | torium chairs are consistent (i.e., face a single direction), or observe a pattern among
2405
+ | the relative orientation between a chair and its neighboring monitor. Not surprisingly,
2406
+ | our system found chairs to be more frequent in seminar rooms rather than in offices.
2407
+ | In the future, we plan to incorporate such information to handle cluttered datasets
2408
+ | while scanning similar environments but with differently shaped objects.
2409
+ blank |
2410
+ text | distance (m) angle (◦ )
2411
+ | scene relationship
2412
+ | mean std mean std
2413
+ | chair-chair 1.207 0.555 78.7 74.4
2414
+ | office
2415
+ | chair-monitor 0.943 0.164 152 39.4
2416
+ | aud. chair-chair 0.548 0 0 0
2417
+ | sem. chair-chair 0.859 0.292 34.1 47.4
2418
+ blank |
2419
+ text | Table 3.4: Statistics between objects learned for each scene category.
2420
+ blank |
2421
+ text | As an exciting possibility, the system can efficiently detect change. By change, we
2422
+ | mean introduction of a new object, previously not seen in the learning phase while
2423
+ | factoring out variations due to different spatial arrangements or changes in individual
2424
+ | model poses. For example, in the auditorium #2, a previously unobserved chair
2425
+ | is successfully detected (highlighted in yellow). Such a mode is particularly useful
2426
+ | for surveillance and automated investigation of indoor environments, or for disaster
2427
+ | planning in environments that are unsafe for humans to venture.
2428
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 60
2429
+ blank |
2430
+ |
2431
+ |
2432
+ title | 3.6 Conclusions
2433
+ text | We have presented a simple system for recognizing man-made objects in cluttered 3D
2434
+ | indoor environments, while factoring out low-dimensional deformations and pose vari-
2435
+ | ations, on a scale previously not demonstrated. Our pipeline can be easily extended
2436
+ | to more complex environments primarily requiring reliable acquisition of additional
2437
+ | object models and their variability modes.
2438
+ | Several future challenges and opportunities remain: (i) With an increasing number
2439
+ | of object prototypes, the system will need more sophisticated search data structures
2440
+ | in the recognition phase. We hope to benefit from recent advances in shape search.
2441
+ | (ii) We have focused on a severely restricted form of sensor input, namely, poor and
2442
+ | sparse geometry alone. We intentionally left out color and texture, which can be quite
2443
+ | beneficial, especially if appearance variations can be accounted for. (iii) A natural
2444
+ | extension would be to take the recognized models along with their pose and joint
2445
+ | attributes to create data-driven, high-quality interior CAD models for visualization,
2446
+ | or more schematic representations, that may be sufficient for indoor navigation, or
2447
+ | simply for scene understanding (see Figure 3.1, rightmost image, and recent efforts
2448
+ | in scene modeling [NXS12, SXZ+ 12]).
2449
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 61
2450
+ blank |
2451
+ |
2452
+ |
2453
+ text | office 1 chair monitor
2454
+ | desk
2455
+ blank |
2456
+ |
2457
+ |
2458
+ |
2459
+ text | trash bin whiteboard
2460
+ | office 2
2461
+ blank |
2462
+ |
2463
+ |
2464
+ |
2465
+ text | auditorium 1
2466
+ blank |
2467
+ |
2468
+ |
2469
+ |
2470
+ text | change
2471
+ | auditorium 2 open tables
2472
+ | detection
2473
+ blank |
2474
+ |
2475
+ |
2476
+ |
2477
+ text | open
2478
+ | seat
2479
+ blank |
2480
+ text | seminar room 1
2481
+ blank |
2482
+ |
2483
+ |
2484
+ |
2485
+ text | seminar room 2 missed chairs
2486
+ blank |
2487
+ |
2488
+ |
2489
+ |
2490
+ text | Figure 3.12: Recognition results on various office and auditorium scenes. Since the
2491
+ | input scans have limited viewpoints and thus are too poor to provide a clear represen-
2492
+ | tation of the scene complexity, we include scene images for visualization (these were
2493
+ | unavailable to the algorithm). Note that for the auditorium examples, our system
2494
+ | even detected the small tables attached to the chairs — this was possible since the
2495
+ | system extracted this variation mode in the learning phase.
2496
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 62
2497
+ blank |
2498
+ |
2499
+ |
2500
+ |
2501
+ text | missed monitor laptop monitor
2502
+ | chair
2503
+ blank |
2504
+ |
2505
+ |
2506
+ |
2507
+ text | drawer deformations
2508
+ blank |
2509
+ text | Figure 3.13: A close-up office scene. All of the recognized objects have one or more
2510
+ | deformation modes. The algorithm inferred the angles of the laptop screen and the
2511
+ | chair back, heights of the chair seat, the arm rests and the monitor. Note that our
2512
+ | system also captured the deformation modes of open drawers.
2513
+ meta | CHAPTER 3. ENVIRONMENTS WITH VARIABILITY AND REPETITION 63
2514
+ blank |
2515
+ |
2516
+ |
2517
+ |
2518
+ text | input scene 1 input scene 2
2519
+ blank |
2520
+ |
2521
+ |
2522
+ |
2523
+ text | shifted wrong labels
2524
+ blank |
2525
+ |
2526
+ |
2527
+ |
2528
+ text | missed missed
2529
+ blank |
2530
+ |
2531
+ |
2532
+ |
2533
+ text | [Koppula et al.] ours [Koppula et al.] ours
2534
+ | table top wall floor chair base table leg monitor chair back
2535
+ blank |
2536
+ text | Figure 3.14: We compared our algorithm and Koppula et al. [KAJS11] using multiple
2537
+ | frames of scans from the same viewpoint. Our recognition results are more stable
2538
+ | across different frames.
2539
+ meta | Chapter 4
2540
+ blank |
2541
+ title | Guided Real-Time Scanning of
2542
+ | Indoor Objects3
2543
+ blank |
2544
+ text | Acquiring 3-D models of the indoor environments is a critical component for under-
2545
+ | standing and mapping the environments. For successful 3-D acquisition in indoor
2546
+ | scenes, it is necessary to simultaneously scan the environment, interpret the incom-
2547
+ | ing data stream, and plan subsequent data acquisition, all in a real-time fashion. The
2548
+ | challenge is, however, that individual frames from portable commercial 3-D scanners
2549
+ | (RGB-D cameras) can be of poor quality. Typically, complex scenes can only be
2550
+ | acquired by accumulating multiple scans. Information integration is done in a post-
2551
+ | scanning phase, when such scans are registered and merged, leading eventually to
2552
+ | useful models of the environment. Such a workflow, however, is limited by the fact
2553
+ | that poorly scanned or missing regions are only identified after the scanning process
2554
+ | is finished, when it may be costly to revisit the environment being acquired to per-
2555
+ | form additional scans. In the study presented in this chapter, we focused on real-time
2556
+ | 3D model quality assessment and data understanding, that could provide immediate
2557
+ | feedback for guidance in subsequent acquisition.
2558
+ | Evaluating acquisition quality without having any prior knowledge about an un-
2559
+ | known environment, however, is an ill-posed problem. We observe that although the
2560
+ meta | 3
2561
+ text | The contents of the chapter will be published as Y.M. Kim, N. Mitra, Q. Huang, L. Guibas,
2562
+ | Guided Real-Time Scanning of Indoor Environments, Pacific Graphics 2013.
2563
+ blank |
2564
+ |
2565
+ |
2566
+ meta | 64
2567
+ | CHAPTER 4. GUIDED REAL-TIME SCANNING 65
2568
+ blank |
2569
+ |
2570
+ |
2571
+ text | target scene itself maybe unknown, in many cases, the scene consists of objects from
2572
+ | a well-prescribed pre-defined set of object categories. Moreover, these categories are
2573
+ | well represented in publicly available 3-D shape repositories (e.g., Trimble 3D Ware-
2574
+ | house). For example, an office setting typically consists of various tables, chairs,
2575
+ | monitors, etc., all of which have thousands of instances in the Trimble 3D Ware-
2576
+ | house. In our approach, instead of attempting to reconstruct detailed 3D geometry
2577
+ | from low-quality inconsistent 3D measurements, we focus on parsing the input scans
2578
+ | into simpler geometric entities, and use existing 3D model repositories like Trimble
2579
+ | 3D warehouse as proxies to assist the process of assessing data quality. Thus, we
2580
+ | defined two key tasks that an effective acquisition method would need to complete:
2581
+ | (i) given a partially scanned object, reliably and efficiently retrieve appropriate proxy
2582
+ blank |
2583
+ |
2584
+ |
2585
+ |
2586
+ text | Figure 4.1: We introduce a real-time guided scanning system. As streaming 3D
2587
+ | data is progressively accumulated (top), the system retrieves the top matching mod-
2588
+ | els (bottom) along with their pose to act as geometric proxies to assess the current
2589
+ | scan quality, and provide guidance for subsequent acquisition frames. Only a few
2590
+ | intermediate frames with corresponding retrieved models are shown in this figure.
2591
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 66
2592
+ blank |
2593
+ |
2594
+ |
2595
+ text | models of it from the database; and (ii) position the retrieved models in the scene
2596
+ | and provide real-time feedback (e.g., missing geometry that still needs to be scanned)
2597
+ | to guide subsequent data gathering.
2598
+ | We introduce a novel partial shape retrieval approach for finding similar shapes
2599
+ | of a query partial scan. In our setting, we used the Microsoft Kinect to acquire
2600
+ | the scans of real objects. The proposed approach, which combines both descriptor-
2601
+ | based retrieval and registration-based verification, is able to search in a database of
2602
+ | thousands of models in real-time. To account for partial similarity between the input
2603
+ | scan and the models in a database, we created simulated scans of each database model
2604
+ | and compared a scan of real setting to a scan of simulated setting. This allowed us to
2605
+ | efficiently compare shapes using global descriptors even in the presence of only partial
2606
+ | similarity; and the approach remains robust in the case of occlusions or missing data
2607
+ | about the object being scanned.
2608
+ | Once our system finds a match, to mark out missing parts in the current merged
2609
+ | scan, the system aligns it with the retrieved model and highlights the missing part
2610
+ | or places where the scan density is low. This visual feedback allows the operator
2611
+ | to quickly adjust the scanning device for subsequent scans. In effect, our 3D model
2612
+ | database and matching algorithms make it possible for the operator to assess the
2613
+ | quality of the data being acquired and discover badly scanned or missing areas while
2614
+ | the scan is being performed, thus allowing corrective actions to be taken immediately.
2615
+ | We extensively evaluated the robustness and accuracy of our system using syn-
2616
+ | thetic data sets with available ground truth. Further, we tested our system on physical
2617
+ | environments to achieve real-time scene understanding (see the supplementary video
2618
+ | that includes the actual scanning session recorded). In summary, in this chapter, we
2619
+ | present a novel guided scanning interface and introduce a relation-based light-weight
2620
+ | descriptor for fast and accurate model retrieval and positioning to provide real-time
2621
+ | guidance for scanning.
2622
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 67
2623
+ blank |
2624
+ |
2625
+ |
2626
+ title | 4.1 Related Work
2627
+ blank |
2628
+ title | 4.1.1 Interactive Acquisition
2629
+ text | Fast, accurate, and autonomous model acquisition have long been primary goals in
2630
+ | robotics, computer graphics, and computer vision. With the introduction of afford-
2631
+ | able, portable, commercial RGBD cameras, there has been a pressing need to simplify
2632
+ | scene acquisition workflows to allow less experienced individuals to acquire scene ge-
2633
+ | ometries. Recent efforts fall into two broad categories: (i) combining individual
2634
+ | frames of low-quality point-cloud data with SLAM algorithms [EEH+ 11, HKH+ 12] to
2635
+ | improve scan quality [IKH+ 11]; (ii) using supervised learning to train classifiers for
2636
+ | scene labeling [RBF12] with applications to robotics [KAJS11]. Previously, [RHHL02]
2637
+ | aggregated scans at interactive rates to provide visual feedback to the user. This work
2638
+ | was recently expanded by [DHR+ 11]. [KDS+ 12] extracted simple planes and recon-
2639
+ | struct floor plans with guidance from a projector pattern. While our goal is also to
2640
+ | provide real-time feedback, our system differs from previous efforts in that it uses
2641
+ | retrieved proxy models to automatically access the current scan quality, enabling
2642
+ | guided scanning.
2643
+ blank |
2644
+ |
2645
+ title | 4.1.2 Scan Completion
2646
+ text | Various strategies have been proposed to improve noisy scans or plausibly fill in miss-
2647
+ | ing data due to occlusion: researchers have exploited repetition [PMW+ 08], symme-
2648
+ | try [TW05, MPWC12], or used primitives to complete missing parts [SWK07]. Other
2649
+ | approaches have focused on using geometric proxies and abstractions including curves,
2650
+ | skeletons, planar abstractions, etc. In the context of image understanding, indoor
2651
+ | scenes have been abstracted and modeled as a collection of simple cuboids [LGHK10,
2652
+ | ZCC+ 12] to capture a variety of man-made objects.
2653
+ blank |
2654
+ |
2655
+ title | 4.1.3 Part-Based Modeling
2656
+ text | Simple geometric primitives, however, are not always sufficiently expressive for com-
2657
+ | plex shapes. Meanwhile, such objects can still be split into simpler parts that aid
2658
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 68
2659
+ blank |
2660
+ |
2661
+ |
2662
+ text | shape understanding. For example, parts can act as entities for discovering rep-
2663
+ | etitions [TSS10], training classifiers [SFC+ 11, XS12], or facilitating shape synthe-
2664
+ | sis [JTRS12]. Alternately, a database of part-based 3D model templates can be used
2665
+ | to detect shapes from incomplete data [SXZ+ 12, NXS12, KMYG12]. Such methods
2666
+ | often rely on expensive matching, and thus do not lend themselves to low-memory
2667
+ | footprint real-time realizations.
2668
+ blank |
2669
+ |
2670
+ title | 4.1.4 Template-Based Completion
2671
+ text | Our system also uses database of 3D models (e.g., chairs, lamps, tables) to retrieve
2672
+ | shape from 3D scans. However, by defining a novel simple descriptor, our sys-
2673
+ | tem, compared to previous efforts, can reliably handle much larger model databases.
2674
+ | Specifically, instead of geometrically matching templates [HCI+ 11], or using templates
2675
+ | to complete missing parts [PMG+ 05], our system initially searches for consistency in
2676
+ | distribution of relation among primitive faces.
2677
+ blank |
2678
+ |
2679
+ title | 4.1.5 Shape Descriptors
2680
+ text | In the context of shape retrieval, various descriptors have been investigated for group-
2681
+ | ing, classification, or retrieval of 3D geometry. For example, the method proposed by
2682
+ | [CTSO03] uses light-field descriptors based on silhouettes, the method by [OFCD02]
2683
+ | uses shape distributions to categorize different object classes, etc. The silhouette
2684
+ | method requires an expensive rotational alignment search, limiting its usefulness in
2685
+ | our setting to a small number of models (100-200). Both methods assume access
2686
+ | to nearly complete models to match against. In contrast, for guided scanning, our
2687
+ | approach can support much larger model sets (about 2000 models) and, more impor-
2688
+ | tantly, focus on handling poor and incomplete point sets as inputs to the matcher.
2689
+ blank |
2690
+ |
2691
+ title | 4.2 Overview
2692
+ text | Figure 4.2 illustrates the pipeline of our guided real-time scanning system, which con-
2693
+ | sists of a scanning device (Kinect in our case) and a database of 3D shapes containing
2694
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 69
2695
+ blank |
2696
+ |
2697
+ |
2698
+ |
2699
+ text | Off-line process
2700
+ blank |
2701
+ text | Database of Simulated Similarity
2702
+ | A2h descriptor
2703
+ | 3D models scans measure
2704
+ blank |
2705
+ |
2706
+ |
2707
+ text | Retrieved
2708
+ | shape
2709
+ | …
2710
+ blank |
2711
+ |
2712
+ |
2713
+ |
2714
+ text | …
2715
+ | …
2716
+ blank |
2717
+ |
2718
+ |
2719
+ |
2720
+ text | registered Density voxel
2721
+ blank |
2722
+ text | Retrieved
2723
+ | model +
2724
+ | Segmented, pose
2725
+ | Frames of
2726
+ | registered A2h descriptor Align shape
2727
+ | measurement
2728
+ | pointcloud
2729
+ | …
2730
+ blank |
2731
+ |
2732
+ |
2733
+ |
2734
+ text | registered Densityvoxel provide
2735
+ | guidance
2736
+ blank |
2737
+ |
2738
+ text | Figure 4.2: Pipeline of the real-time guided scanning framework.
2739
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 70
2740
+ blank |
2741
+ |
2742
+ |
2743
+ text | the categories of the shapes present in the environment. In each iteration, the sys-
2744
+ | tem performs three tasks: (i) scan acquisition from a set of viewpoints specified by a
2745
+ | user (or a planning algorithm); (ii) shape retrieval using distribution of relations; and
2746
+ | (iii) comparison of the scanned pointset with the best retrieved model. The system
2747
+ | iterates these steps until a sufficiently good match is found (see supplementary video).
2748
+ | The challenge is how to maintain real-time response.
2749
+ blank |
2750
+ |
2751
+ title | 4.2.1 Scan Acquisition
2752
+ text | The input stream of a real-time depth sensor (in our case, the Kinect was used) is col-
2753
+ | lected and processed using an open-source implementation [EEH+ 11] that calibrates
2754
+ | the color and depth measurements and outputs the pointcloud data. The color fea-
2755
+ | tures of individual frames are then extracted and matched from consecutive frames.
2756
+ | The corresponding depth values are used to incrementally register the depth mea-
2757
+ | surements [HKH+ 12]. The pointcloud that belongs to the object is segmented as the
2758
+ | system detects the ground plane and exclude the points that belong to the plane. We
2759
+ | will refer to the segmented, registered set of depth measurements as a merged scan.
2760
+ | Whenever each new frame is processed, the system calculates the descriptor and the
2761
+ | density voxels from the pointcloud data for the merged scan.
2762
+ blank |
2763
+ |
2764
+ title | 4.2.2 Shape Retrieval
2765
+ text | Our goal is to find shapes in the database that are similar to the merged scan. Since
2766
+ | the merged scan may contain only partial information about the object being scanned,
2767
+ | our system internally generates simulated views of both the merged scan as well as
2768
+ | shapes in the database, and then compare their point clouds associated with these
2769
+ | views. The key observation is that although the merged scan may still have missing
2770
+ | geometry, it is likely that it contains all the visible geometry of the object being
2771
+ | scanned when the object is viewed from a particular point of view (i.e., the self-
2772
+ | occlusions are predictable); it thus becomes comparable to database model views
2773
+ | from the same or nearby viewpoints. Hence, the system measures shape similarity
2774
+ | between such point-cloud views. For shape retrieval, our system first performs a
2775
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 71
2776
+ blank |
2777
+ |
2778
+ |
2779
+ text | descriptor-based similarity search against the entire database to obtain a candidate
2780
+ | set of similar models. Finally, the system performs registration of each model with
2781
+ | the merged scan and returns the model with the best alignment score.
2782
+ | We note here that past research on global shape descriptors has mostly focused on
2783
+ | broad differentiation of shape classes, e.g., separating shapes of vehicles from those
2784
+ | of furniture or of people, etc. In our case, since the system is looking for potentially
2785
+ | modest amounts of missing geometry in the scans, we aim more for fine variability
2786
+ | differentiation among a particular object class, such as chairs. We have therefore
2787
+ | developed and exploited a novel histogram descriptor based on the angles between
2788
+ | the shape normals for this task (see Section 4.3.2).
2789
+ blank |
2790
+ |
2791
+ title | 4.2.3 Scan Evaluation
2792
+ text | Once the retrieved model is computed, the retrieved proxy is displayed for the user.
2793
+ | The system also highlights voxels with missing data when compared with the best
2794
+ | matching model, and finishes when the retrieved best match model is close enough to
2795
+ | the current measurement (when the missing voxels are less than 1% of total number
2796
+ | of voxels). In Section 4.3.4, we elaborate on this guided scanning interface.
2797
+ blank |
2798
+ |
2799
+ title | 4.3 Partial Shape Retrieval
2800
+ text | Our goal is to quickly assess the quality of the current scan and guide the user in
2801
+ | subsequent scans. This is challenging on the following counts: (i) the system has
2802
+ | to assess model quality without necessarily knowing which model is being scanned;
2803
+ | (ii) the scans are potentially incomplete, with large parts of data missing; and (iii) the
2804
+ | system should respond in real-time.
2805
+ | We observe that existing database models such as Trimble 3D warehouse models
2806
+ | can be used as proxies for evaluating scan quality of similar objects being scanned,
2807
+ | thus addressing the first challenge. Hence, for any merged query scan (i.e., point-
2808
+ | cloud) S, the system looks for a match among similar models in the database M =
2809
+ | {M1 , · · · MN }. For simplicity, we assume that the up-right orientation of each model
2810
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 72
2811
+ blank |
2812
+ |
2813
+ |
2814
+ text | in the model database is available in existing database.
2815
+ | To handle the second challenge, we note that missing data, even in large chunks,
2816
+ | are mostly the result of self occlusion, and hence are predictable. To address this
2817
+ | problem, our system synthetically scans the models Mi from different viewpoints to
2818
+ | simulate such self occlusions. This greatly simplifies the problem by allowing us to
2819
+ | directly compare S to the simulated scans of Mi , thus automatically accounting for
2820
+ | missing data in S.
2821
+ | Finally, to achieve real-time performance, we propose a simple, robust, yet effective
2822
+ | descriptor to match S to view-dependent scans of Mi . Subsequently, the system
2823
+ | performs registration to verify the match between each matched simulated scan and
2824
+ | the query scan, and returns the most similar simulated scan and the corresponding
2825
+ | model Mi . The following subsections provide further details of the each step for
2826
+ | partial shape retrieval.
2827
+ blank |
2828
+ |
2829
+ title | 4.3.1 View-Dependent Simulated Scans
2830
+ text | For each model Mi , the system generates simulated scans S k (Mi ) from multiple cam-
2831
+ | era positions. Let dup denote the up-right orientation for model Mi . Our system takes
2832
+ | dup as the z-axis and arbitrarily fixes any orthogonal direction di (i.e., dTi dup = 0) as
2833
+ | the x-axis. The system also translates the centroid of Mi to the origin.
2834
+ | The system then virtually positions the cameras at the surface of a view-sphere
2835
+ | around the origin. Specifically, the camera is placed at
2836
+ blank |
2837
+ text | ci := (2d cos θ sin φ, 2d sin θ sin φ, 2d cos φ)
2838
+ blank |
2839
+ text | where d denotes the length of the diagonal of the bounding box of Mi , and φ denotes
2840
+ | the camera altitude. The camera up-vector is defined as
2841
+ blank |
2842
+ text | dup − < dup , ci > ci
2843
+ | ui := with ci = ci /kci k
2844
+ | kdup − < dup , ci > ci k
2845
+ blank |
2846
+ text | and the gaze point is defined as the origin. The fields of view are set to π/2 in both
2847
+ | the up and horizontal directions.
2848
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 73
2849
+ blank |
2850
+ |
2851
+ |
2852
+ text | For each such camera location, our system obtains a synthetic scan using the z-
2853
+ | buffer with a grid setting of 200 × 200. Such a grid results in vertices where the grid
2854
+ | rays intersect the model. The system generates the simulated scan by computing one
2855
+ | surfel (pf , nf , df ) (i.e., a point, normal, and density, respectively) from each quad
2856
+ | face f = (qf 1 , qf 2 , qf 3 , qf 4 ), as follows,
2857
+ blank |
2858
+ text | 4
2859
+ | X X
2860
+ | pf := qf i /4, nf := nijk /4, (4.1)
2861
+ | i=1 ijk∈{123,234,341,412}
2862
+ | X
2863
+ | df := 1/ area(qf i , qf j , qf k ) (4.2)
2864
+ | ijk∈{123,234,341,412}
2865
+ blank |
2866
+ |
2867
+ text | where, nijk denotes the normal of the triangular face (qf i , qf j , qf k ) and nf ← nf /knf k.
2868
+ | Thus the simulated scans simply collects surfels generated from all the quad faces of
2869
+ | the sampling grid.
2870
+ | Our system places K samples of θ, i.e., θ = 2kπ/K where k ∈ [0, K) and φ =
2871
+ | {π/6, π/3} to obtain view-dependent simulated scans for each model Mi . Empirically,
2872
+ | we set K = 6 to balance between efficiency and quality when comparing simulated
2873
+ | scans and the merged scan S.
2874
+ blank |
2875
+ |
2876
+ title | 4.3.2 A2h Scan Descriptor
2877
+ text | Our goal is to design a descriptor that (i) is efficient to compute, (ii) is robust to
2878
+ | noise and outliers, and (iii) has a low-memory footprint. We draw inspiration from
2879
+ | shape distributions [OFCD02] that computes statistics about geometric quantities
2880
+ | that are invariant to global transforms, e.g., distances between pairs of points on
2881
+ | the models. Shape distribution descriptors, however, were designed to be resilient to
2882
+ | local geometric changes. Hence, they are ineffective in our setting, where shapes are
2883
+ | distinguished by subtle local features. Instead, our system computes the distributions
2884
+ | of angles between point normals, which better capture the local geometric features.
2885
+ | Further, since the system knows the upright direction of each shape,this information
2886
+ | is incorporated into the design of the descriptor.
2887
+ | Specifically, for each scan S (real or simulated), our system first allocates the
2888
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 74
2889
+ blank |
2890
+ |
2891
+ |
2892
+ text | points into three bins based on their height along the z-axis, i.e., the up-right direction.
2893
+ | Then, among the points within each bin, the system computes the distribution of
2894
+ | angles between normals of all pairs of points. The angle space is discretized using 50
2895
+ | bins between [0, π], e.g., each bin counts the frequency of normal angles within each
2896
+ | bin. We call this the A2h scan descriptor, which for each point cloud is a 50 × 3 = 150
2897
+ | dimensional vector; this collects the angle distribution within each height bin.
2898
+ | In practice, for pointclouds belonging to any merged scan, our system randomly
2899
+ | samples 10, 000 pairs of points within each height bin to speed-up the computation. In
2900
+ | our extensive tests, we found this simple descriptor to perform better than distance-
2901
+ | only histograms in distinguishing fine variability within a broad shape class (see
2902
+ | Figure 4.3).
2903
+ blank |
2904
+ |
2905
+ title | 4.3.3 Descriptor-Based Shape Matching
2906
+ text | A straightforward way to compare two descriptor vectors f1 of f2 is to take the Lp
2907
+ | norm of their difference vector f1 − f2 . However, the Lp norm can be sensitive to
2908
+ | noise and does not account for the similarity of distribution between similar curves.
2909
+ | Instead, our system uses the Earth Mover’s distance (EMD) to compare a pair of
2910
+ | distributions [RTG98]. Intuitively, given two distributions, one distribution can be
2911
+ | seen as a mass of earth properly spread in space, the other distribution as a collection
2912
+ | of holes that need to be filled with that earth. Then, the EMD measures the least
2913
+ | amount of work needed to fill the holes with earth. Here, a unit of work corresponds to
2914
+ | transporting a unit of earth by a unit of ground distance. The costs of “moving earth”
2915
+ | reflect the notion of nearness between bins; therefore the distortion, due to noise is
2916
+ | minimized. In a 1D setting, EMD with L1 norms is equivalent to calculating an L1
2917
+ | norm for cumulative distribution functions (CDF) of the distribution [Vil03]. Hence,
2918
+ | our system achieves robustness to noise at the same time complexity as calculating
2919
+ | an L1 norm between the A2h distributions. For all of the results presented below, our
2920
+ | system used EMD with L1 norms of the CDFs computed from the A2h distributions.
2921
+ | Because there are 2K view-dependent pointclouds associated with each model Mi ,
2922
+ | the system matches the query S with each such pointcloud S k (Mi ) (k = 1, 2, ..., 2K)
2923
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 75
2924
+ blank |
2925
+ |
2926
+ |
2927
+ text | and records the best matching score. In the end, the system returns the top 25
2928
+ | matches across the models in M.
2929
+ blank |
2930
+ |
2931
+ title | 4.3.4 Scan Registration
2932
+ text | Our system overlays the retrieved model Mi over merged scan S as follows: the system
2933
+ | first aligns the centroid of the simulated scan S k (Mi ) to match the centroid of S (note
2934
+ | that we do not force the model Mi to touch the ground), while scaling model Mi to
2935
+ | match the data. To fix the remaining 1DOF rotational ambiguity, the angle space is
2936
+ | discretized into 10◦ intervals, and the system picks the angle for which the rotated
2937
+ | model best matches the scan S. In practice, we found this refinement step necessary
2938
+ | since our view-dependent scans have coarse angular resolution (K = 6).
2939
+ | Finally, the system uses the positioned proxy model Mi to assess the quality of the
2940
+ | current scan. Specifically, the bounding box of Mi is discretized into 9 × 9 × 9 voxels
2941
+ | and the density of points that falls within the voxel location is calculated. Those
2942
+ | voxels are highlighted where the matched model has high density of points (more
2943
+ | than the average) but where there are insufficient points coming from the scan S,
2944
+ | thus providing guidance for subsequent acquisitions. The process is terminated when
2945
+ | there is less than 10 such highlighted voxels, and the best matching model is simply
2946
+ | displayed.
2947
+ blank |
2948
+ |
2949
+ title | 4.4 Interface Design
2950
+ text | The real-time system guides the user to scan an object and retrieve the closest match.
2951
+ | In our study, we used the Kinect scanner for the acquisition and the retrieval process
2952
+ | took 5-10 seconds/iteration on our unoptimized implementation. The user scans an
2953
+ | object from an operating distance of about 1-3m. The sensor data of real-time video
2954
+ | stream of depth pointcloud and color images are visible to the user at all times (see
2955
+ | Figure 4.4).
2956
+ | The user starts scanning by pointing the sensor to the ground plane. The ground
2957
+ | plane is detected if the sensor captures a dominant plane that covers more than 50% of
2958
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 76
2959
+ blank |
2960
+ |
2961
+ |
2962
+ text | the scene. Our system uses this plane to extract the upright direction of the captured
2963
+ | scene. When the ground plane is successfully detected, the user receives an indication
2964
+ | on the screen(Figure 4.4 top-right).
2965
+ | In a separate window, the pointcloud data corresponding to the object being cap-
2966
+ | tured is continuously displayed. The system registers the points using image features
2967
+ | and segments the object by extracting the groundplane. The displayed pointcloud
2968
+ | data is also used to calculate the descriptor and the voxel density. At the end of
2969
+ | the retrieval stage (see Section 4.3), the system retains the information between the
2970
+ | closest match of the model and the current pointcloud data. The pointcloud is over-
2971
+ | laid with two additional cues: (i) missing data in voxels as compared with the closest
2972
+ | matched model, and (ii) the 3D model of the closest match of the object. Based on
2973
+ | this guidance, the user can then acquire the next scan. The system automatically
2974
+ | stops when the matched model is similar to the captured pointcloud.
2975
+ blank |
2976
+ |
2977
+ title | 4.5 Evaluation
2978
+ text | We tested the robustness of the proposed A2h descriptor on synthetically generated
2979
+ | data against available groundtruth. Further, we let novice users use our system
2980
+ | to scan different indoor environments. The real-time guidance allowed the users to
2981
+ | effectively capture the indoor scenes (see supplementary video).
2982
+ blank |
2983
+ text | dataset # models average # points/scan
2984
+ | chair 2138 45068
2985
+ | couch 1765 129310
2986
+ | lamp 1805 11600
2987
+ | table 5239 61649
2988
+ blank |
2989
+ text | Table 4.1: Database and scan statistics.
2990
+ blank |
2991
+ |
2992
+ |
2993
+ title | 4.5.1 Model Database
2994
+ text | We considered four categories of objects (i.e., chairs, couches, lamps, tables) in our
2995
+ | implementation. For each category, we downloaded a large number of models from
2996
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 77
2997
+ blank |
2998
+ |
2999
+ |
3000
+ text | the Trimble 3D Warehouse (see Table 4.1) to act as proxy geometry in the online
3001
+ | scanning phase. The models were pre-scaled and moved to the origin. We syntheti-
3002
+ | cally scanned each such model from 12 different viewpoints and computed the A2h
3003
+ | descriptor for each such scan. Note that we placed the camera only above the objects
3004
+ | (altitudes of π/6 and π/3) as the input scans rarely capture the underside of the ob-
3005
+ | jects. We used the Kinect scanner to gather streaming data and used an open source
3006
+ | library [EEH+ 11] to accumulate the input data to produce merged scans.
3007
+ blank |
3008
+ |
3009
+ title | 4.5.2 Retrieval Results with Simulated Data
3010
+ text | The proposed A2h descriptor is effective in retrieving similar shapes in fractions of
3011
+ | seconds. Figure 4.5, 4.6, 4.7, and 4.8 show typical retrieval results. In our tests, we
3012
+ | found the retrieval results to be useful for chairs and couches, which have a wider
3013
+ | variation of angles compared to lamps or tables, the shape of which is almost always
3014
+ | very symmetric.
3015
+ blank |
3016
+ title | Effect of Viewpoints
3017
+ blank |
3018
+ text | The scanned data often have significant parts missing, mainly due to self-occlusion.
3019
+ | We simulated this effect on the A2h descriptor-based retrieval and compared the
3020
+ | performance against retrieval with merged (simulated) scans, Figure 4.9. We found
3021
+ | the retrieval results to be robust and the models sufficiently representative to be used
3022
+ | as proxies for subsequent model assessment.
3023
+ blank |
3024
+ title | Comparison with Other Descriptors
3025
+ blank |
3026
+ text | We also tested existing shape descriptors: silhouette-based light field descriptor [CTSO03],
3027
+ | local spin image [Joh97], and the D2 descriptor [OFCD02]. In all the cases, we found
3028
+ | our A2h descriptor to be more effective in quickly resolving local geometric changes,
3029
+ | particularly for low quality partial pointclouds. In contrast, we found the light field
3030
+ | descriptor to be more susceptible to noise, local spin image more expensive to com-
3031
+ | pute, and the D2 descriptor less able to distinguish between local variations than our
3032
+ | A2h descriptor (see Figure 4.3).
3033
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 78
3034
+ blank |
3035
+ |
3036
+ |
3037
+ text | We next evaluated the degradation in the retrieval results under perturbations in
3038
+ | sampling density and noise.
3039
+ blank |
3040
+ title | Effect of Density
3041
+ blank |
3042
+ text | During scanning, points are sampled uniformly on the sensor grid, instead of uniformly
3043
+ | on the model surface. This uniform sampling on the sensor grid results in varying
3044
+ | densities of scanned points depending on the viewpoint. Our system compensates for
3045
+ | this effect by assigning probabilities that are inversely proportional to the density of
3046
+ | sample points.
3047
+ | Figure 4.10 shows the effect of density compensation on the histogram distribu-
3048
+ | tions. We tested two different combination of viewpoints and compared the distribu-
3049
+ | tions, using sampling based on uniform distribution or inversely proportional to the
3050
+ | density. Density-aware sampling are indicated by dotted lines. The overall shapes
3051
+ | of the graphs are similar for uniform and density-aware samplings. However, the ab-
3052
+ | solute values on the peaks are observed at similar heights while using density-aware
3053
+ | sampling. Hence, our system uses density-aware sampling to achieve robustness to
3054
+ | sampling variations.
3055
+ blank |
3056
+ title | Effect of Noise
3057
+ blank |
3058
+ text | In Figure 4.11, we show the robustness of A2h histograms under noise. Generally, the
3059
+ | histograms become smoother under increasing noise as subtle orientation variations
3060
+ | get masked. For reference, the Kinect measurements from a distance range of 1-2m
3061
+ | have noise perturbations comparable to 0.005 noise in the simulated data. We added
3062
+ | synthetic Gaussian noise on the simulated data to calculate the A2h descriptors to
3063
+ | better simulate the shape of the histogram.
3064
+ blank |
3065
+ |
3066
+ title | 4.5.3 Retrieval Results with Real Data
3067
+ text | Figure 4.12 shows retrieval results on a range of objects (i.e., chairs, couches, lamps,
3068
+ | and tables). Overall we found the guided interface to work well in practice. The
3069
+ | performance was better for chairs and couches, while for lamps and tables, the thin
3070
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 79
3071
+ blank |
3072
+ |
3073
+ |
3074
+ text | structures led to some failure cases. In all cases, the system successfully handled
3075
+ | missing data as high as 40-60% of the object surface (or half of the object surface
3076
+ | invisible) and the response of the system was at interactive rates. Note that for
3077
+ | testing purposes we manually pruned the input database models to leave out models
3078
+ | (if any) that looked very similar to the target objects to be scanned. Please refer to
3079
+ | the supplementary video for the system in action.
3080
+ blank |
3081
+ |
3082
+ title | 4.6 Conclusions
3083
+ text | We have presented a real-time guided scanning setup for online quality assessment of
3084
+ | streaming RGBD data obtained while acquiring indoor environments. The proposed
3085
+ | approach is motivated by three key observations: (i) indoor scenes largely consist of
3086
+ | a few different types of objects, each of which can be reasonably approximated by
3087
+ | commonly available 3D model sets; (ii) data is often missed due to self-occlusions,
3088
+ | and hence such missing regions can be predicted by comparisons against synthetically
3089
+ | scanned database models from multiple viewpoints; and (iii) streaming scan data can
3090
+ | be robustly and effectively compared against simulated scans by a direct comparison
3091
+ | of the distribution of relative local orientations in the two types of scans. The best
3092
+ | retrieved model is then used as a proxy to evaluate the quality of the current scan and
3093
+ | guide subsequent acquisition frames. We have demonstrated the real-time system on
3094
+ | a large number of synthetic and real-world examples with a database of 3D models,
3095
+ | often ranging in a few thousands.
3096
+ | In the future, we would like to extend our guided system to create online recon-
3097
+ | structions while specifically focusing on generating semantically valid scene models.
3098
+ | Using context information in the form of co-occurrence cues (e.g., a keyboard and
3099
+ | mouse are usually near each other) can prove to be effective. Finally, we plan to use
3100
+ | GPU-based optimized codes to handle additional categories of 3D models.
3101
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 80
3102
+ blank |
3103
+ |
3104
+ |
3105
+ |
3106
+ text | D2
3107
+ blank |
3108
+ |
3109
+ |
3110
+ |
3111
+ text | A2h
3112
+ blank |
3113
+ |
3114
+ |
3115
+ text | query
3116
+ | aligned model
3117
+ blank |
3118
+ |
3119
+ |
3120
+ |
3121
+ text | D2
3122
+ blank |
3123
+ |
3124
+ |
3125
+ |
3126
+ text | A2h
3127
+ blank |
3128
+ |
3129
+ |
3130
+ text | query
3131
+ | aligned model
3132
+ blank |
3133
+ text | Figure 4.3: Representative shape retrieval results using the D2 descriptor( [OFCD02],
3134
+ | first row), the A2h descriptor introduced in this chapter (Section 4.3.2, second row),
3135
+ | and the aligned models after scan registration (Section 4.3.4, third row) on the top 25
3136
+ | matches from A2h. For each method, we only show the top 4 matches. The D2 and
3137
+ | A2h descriptor (first two rows) are compared by histogram distributions, which is a
3138
+ | quick and efficient. Empirically, we observed the A2h descriptor to better capture
3139
+ | local geometric features compared to the D2 descriptor, with local registration further
3140
+ | improving the retrieval quality. The comparison based on 3D alignment (third row)
3141
+ | is more accurate, but require more computation time, and cannot be performed in
3142
+ | real-time given the size of our database of models.
3143
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 81
3144
+ blank |
3145
+ |
3146
+ |
3147
+ text | scanning setup
3148
+ blank |
3149
+ |
3150
+ |
3151
+ |
3152
+ text | detected groundplane
3153
+ blank |
3154
+ |
3155
+ |
3156
+ text | scanning guidance
3157
+ blank |
3158
+ |
3159
+ |
3160
+ |
3161
+ text | current scan
3162
+ blank |
3163
+ |
3164
+ |
3165
+ |
3166
+ text | current scan retreived model
3167
+ blank |
3168
+ text | Figure 4.4: The proposed guided real-time scanning setup is simple to use. The
3169
+ | user starts by scanning using a Microsoft Kinect (top-left). The system first detects
3170
+ | the ground plane and the user is notified (top-right). The current pointcloud corre-
3171
+ | sponding to the target object is displayed in the 3D view window, the best matching
3172
+ | database model is retrieved (overlaid in transparent white), and the predicted missing
3173
+ | voxels are highlighted as yellow voxels (middle-right). Based on the provided guid-
3174
+ | ance, the user acquires the next frame of data, and the process continues. Our method
3175
+ | stops when the retrieved shape explains well the captured pointcloud. Finally, the
3176
+ | overlaid 3D shape is highlighted in white (bottom-right). Note that the accumulated
3177
+ | scans have significant parts missing in most scanning steps.
3178
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 82
3179
+ blank |
3180
+ |
3181
+ |
3182
+ |
3183
+ text | Figure 4.5: Retrieval results with simulated data using a chair data set. Given the
3184
+ | model in the first column, the database of 2138 models are matched using the A2h
3185
+ | descriptor, and the top 5 matches are shown.
3186
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 83
3187
+ blank |
3188
+ |
3189
+ |
3190
+ |
3191
+ text | Figure 4.6: Retrieval results with simulated data using a couch data set. Given the
3192
+ | model in the first column, the database of 1765 models are matched using the A2h
3193
+ | descriptor, and the top 5 matches are shown.
3194
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 84
3195
+ blank |
3196
+ |
3197
+ |
3198
+ |
3199
+ text | Figure 4.7: Retrieval results with simulated data using a lamp data set. Given the
3200
+ | model in the first column, the database of 1805 models are matched using the A2h
3201
+ | descriptor, and the top 5 matches are shown.
3202
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 85
3203
+ blank |
3204
+ |
3205
+ |
3206
+ |
3207
+ text | Figure 4.8: Retrieval results with simulated data using a table data set. Given the
3208
+ | model in the first column, the database of 5239 models are matched using the A2h
3209
+ | descriptor, and the top 5 matches are shown.
3210
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 86
3211
+ blank |
3212
+ |
3213
+ |
3214
+ |
3215
+ text | View-dependent
3216
+ blank |
3217
+ |
3218
+ |
3219
+ text | Query object Merged scan
3220
+ blank |
3221
+ |
3222
+ |
3223
+ |
3224
+ text | View-dependent
3225
+ blank |
3226
+ |
3227
+ |
3228
+ text | Query object
3229
+ | Merged scan
3230
+ blank |
3231
+ |
3232
+ |
3233
+ |
3234
+ text | View-dependent
3235
+ blank |
3236
+ |
3237
+ text | Query object
3238
+ | Merged scan
3239
+ blank |
3240
+ text | Figure 4.9: Comparison between retrieval with view-dependant and merged scans.
3241
+ | The models are sorted by matching scores, with lower scores denoting better matches.
3242
+ | The leftmost images show the query scans. Note that the view-dependent scan-based
3243
+ | retrieval are robust even with significant missing regions (∼30-50%). The numbers
3244
+ | in parenthesis denote the view index.
3245
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 87
3246
+ blank |
3247
+ |
3248
+ |
3249
+ |
3250
+ text | Figure 4.10: Effect of density-aware sampling on two different combination of views
3251
+ | (comb1 and comb2). The sampling that considers the density of points are comb1d
3252
+ | and comb2d , respectively.
3253
+ blank |
3254
+ |
3255
+ |
3256
+ |
3257
+ text | Figure 4.11: Effect of noise. The shape of histogram becomes smoother as the level
3258
+ | of noise increases.
3259
+ meta | CHAPTER 4. GUIDED REAL-TIME SCANNING 88
3260
+ blank |
3261
+ |
3262
+ |
3263
+ text | image accumulated
3264
+ | proxy model scan
3265
+ | retrieved
3266
+ blank |
3267
+ |
3268
+ |
3269
+ |
3270
+ text | chairs
3271
+ blank |
3272
+ |
3273
+ |
3274
+ |
3275
+ text | couches
3276
+ blank |
3277
+ |
3278
+ |
3279
+ |
3280
+ text | lamps
3281
+ blank |
3282
+ |
3283
+ |
3284
+ |
3285
+ text | tables
3286
+ blank |
3287
+ |
3288
+ text | Figure 4.12: Real-time retrieval results on various datasets. For each set, we show
3289
+ | the image of the object being scanned, the accumulated pointcloud, and the closest
3290
+ | shape retrieved model, along with the top 25 candidates that are picked from the
3291
+ | database of thousands of models using the proposed A2h descriptor.
3292
+ meta | Chapter 5
3293
+ blank |
3294
+ title | Conclusions
3295
+ blank |
3296
+ text | 3-D reconstruction in indoor environment is a challenging problem because of the
3297
+ | complexity and variety of the objects present, and frequent changes in positions of
3298
+ | objects made by the people who inhabit space. Based on recent technology, the
3299
+ | work presented in this dissertation frames the reconstruction of indoor environment
3300
+ | as light-weight systems.
3301
+ | RGB-D cameras (e.g., Microsoft Kinect) are a new type of sensor and the standard
3302
+ | for utilizing the data is not yet fully established. Still, the sensor is revolutionary
3303
+ | because it is an affordable technology that can capture the 3-D data of everyday
3304
+ | environments at video frame rate. This dissertation covers quick pipelines that allow
3305
+ | possible real-time interaction between the user and the system. However, such data
3306
+ | comes at the price of complex noise characteristics.
3307
+ | To reconstruct the challenging indoor structures with limited data, we imposed
3308
+ | different geometric priors depending on the target applications and aimed for high-
3309
+ | level understanding. In chapter 2, we present a pipeline to acquire floor plans using
3310
+ | large planes as a geometric prior. We followed the well-known Manhattan-world
3311
+ | assumption and utilized user feedback to overcome ambiguous situations and specify
3312
+ | the important planes to be included in the model. Chapter 3 described our use
3313
+ | of simple models of repeating objects with deformation modes. Public places with
3314
+ | many of repeating objects can be reconstructed by recovering the low-dimensional
3315
+ | deformation and placement information. Chapter 4 showed how we retrieve complex
3316
+ blank |
3317
+ |
3318
+ meta | 89
3319
+ | CHAPTER 5. CONCLUSIONS 90
3320
+ blank |
3321
+ |
3322
+ |
3323
+ text | shape of objects with the help of a large database of 3-D models, as we develop a
3324
+ | descriptor that can be computed and searched efficiently and allow online quality
3325
+ | assessment to be presented to the user.
3326
+ | Each of the pipelines presented in these chapters targets at a specific application
3327
+ | and has been evaluated accordingly. The work of the dissertation can be extended
3328
+ | into other possible real-life applications that can connect actual environments with
3329
+ | the virtual world. The depth data from RGB-D cameras is easy to acquire, but we
3330
+ | still do not know how to make full use of the massive amount of information produced.
3331
+ | The potential applications can benefit from better understanding and handling of the
3332
+ | data. As one extension, we are interested in scaling the database of models and data
3333
+ | with special attention paid to data structure. The research community and others
3334
+ | would also benefit from the advances made in the use of reliable depth and color
3335
+ | features in the new type of data obtained from the RGB-D sensors in addition to the
3336
+ | presented descriptor.
3337
+ meta | Bibliography
3338
+ blank |
3339
+ ref | [BAD10] Soonmin Bae, Aseem Agarwala, and Fredo Durand. Computational
3340
+ | rephotography. ACM Trans. Graph., 29(5), 2010.
3341
+ blank |
3342
+ ref | [BM92] Paul J. Besl and Neil D. McKay. A method for registration of 3-D
3343
+ | shapes. IEEE PAMI, 14(2):239–256, 1992.
3344
+ blank |
3345
+ ref | [CTSO03] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On
3346
+ | visual similarity based 3d model retrieval. CGF, 22(3):223–232, 2003.
3347
+ blank |
3348
+ ref | [CY99] James M. Coughlan and A. L. Yuille. Manhattan world: Compass
3349
+ | direction from a single image by bayesian inference. In ICCV, pages
3350
+ | 941–947, 1999.
3351
+ blank |
3352
+ ref | [CZ11] Will Chang and Matthias Zwicker. Global registration of dynamic range
3353
+ | scans for articulated model reconstruction. ACM TOG, 30(3):26:1–
3354
+ | 26:15, 2011.
3355
+ blank |
3356
+ ref | [Dey07] T. K. Dey. Curve and Surface Reconstruction : Algorithms with Math-
3357
+ | ematical Analysis. Cambridge University Press, 2007.
3358
+ blank |
3359
+ ref | [DHR+ 11] Hao Du, Peter Henry, Xiaofeng Ren, Marvin Cheng, Dan B. Goldman,
3360
+ | Steven M. Seitz, and Dieter Fox. Interactive 3d modeling of indoor
3361
+ | environments with a consumer depth camera. In Proc. Ubiquitous com-
3362
+ | puting, pages 75–84, 2011.
3363
+ blank |
3364
+ ref | [EEH+ 11] Nikolas Engelhard, Felix Endres, Jürgen Hess, Jürgen Sturm, and Wol-
3365
+ | fram Burgard. Real-time 3D visual SLAM with a hand-held RGB-D
3366
+ blank |
3367
+ meta | 91
3368
+ | BIBLIOGRAPHY 92
3369
+ blank |
3370
+ |
3371
+ |
3372
+ ref | camera. In Proc. of the RGB-D Workshop on 3D Perception in Robotics
3373
+ | at the European Robotics Forum, 2011.
3374
+ blank |
3375
+ ref | [FB81] Martin A. Fischler and Robert C. Bolles. Random sample consensus:
3376
+ | a paradigm for model fitting with applications to image analysis and
3377
+ | automated cartography. Commun. ACM, 24(6):381–395, June 1981.
3378
+ blank |
3379
+ ref | [FCSS09] Y. Furukawa, B. Curless, S.M. Seitz, and R. Szeliski. Reconstructing
3380
+ | building interiors from images. In ICCV, pages 80–87, 2009.
3381
+ blank |
3382
+ ref | [FSH11] Matthew Fisher, Manolis Savva, and Pat Hanrahan. Characterizing
3383
+ | structural relationships in scenes using graph kernels. ACM TOG,
3384
+ | 30(4):34:1–34:11, 2011.
3385
+ blank |
3386
+ ref | [GCCMC08] Andrew P. Gee, Denis Chekhlov, Andrew Calway, and Walterio Mayol-
3387
+ | Cuevas. Discovering higher level structure in visual slam. IEEE Trans-
3388
+ | actions on Robotics, 24(5):980–990, October 2008.
3389
+ blank |
3390
+ ref | [GEH10] Abhinav Gupta, Alexei A. Efros, and Martial Hebert. Blocks world re-
3391
+ | visited: Image understanding using qualitative geometry and mechan-
3392
+ | ics. In ECCV, pages 482–496, 2010.
3393
+ blank |
3394
+ ref | [HCI+ 11] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab,
3395
+ | and V. Lepetit. Multimodal templates for real-time detection of texture-
3396
+ | less objects in heavily cluttered scenes. ICCV, 2011.
3397
+ blank |
3398
+ ref | [HKG11] Qixing Huang, Vladlen Koltun, and Leonidas Guibas. Joint-shape seg-
3399
+ | mentation with linear programming. ACM TOG (SIGGRAPH Asia),
3400
+ | 30(6):125:1–125:11, 2011.
3401
+ blank |
3402
+ ref | [HKH+ 12] Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, and Dieter
3403
+ | Fox. RGBD mapping: Using kinect-style depth cameras for dense 3D
3404
+ | modeling of indoor environments. I. J. Robotic Res., 31(5):647–663,
3405
+ | 2012.
3406
+ meta | BIBLIOGRAPHY 93
3407
+ blank |
3408
+ |
3409
+ |
3410
+ ref | [IKH+ 11] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard
3411
+ | Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Free-
3412
+ | man, Andrew Davison, and Andrew Fitzgibbon. Kinectfusion: real-time
3413
+ | 3D reconstruction and interaction using a moving depth camera. In
3414
+ | Proc. UIST, pages 559–568, 2011.
3415
+ blank |
3416
+ ref | [Joh97] Andrew Johnson. Spin-Images: A Representation for 3-D Surface
3417
+ | Matching. PhD thesis, Robotics Institute, CMU, 1997.
3418
+ blank |
3419
+ ref | [JTRS12] Arjun Jain, Thorsten Thormahlen, Tobias Ritschel, and Hans-Peter Sei-
3420
+ | del. Exploring shape variations by 3d-model decomposition and part-
3421
+ | based recombination. CGF (EUROGRAPHICS), 31(2):631–640, 2012.
3422
+ blank |
3423
+ ref | [KAJS11] H.S. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic la-
3424
+ | beling of 3D point clouds for indoor scenes. In NIPS, pages 244–252,
3425
+ | 2011.
3426
+ blank |
3427
+ ref | [KDS+ 12] Young Min Kim, Jennifer Dolson, Michael Sokolsky, Vladlen Koltun,
3428
+ | and Sebastian Thrun. Interactive acquisition of residential floor plans.
3429
+ | In ICRA, pages 3055–3062, 2012.
3430
+ blank |
3431
+ ref | [KMYG12] Young Min Kim, Niloy J. Mitra, Dong-Ming Yan, and Leonidas Guibas.
3432
+ | Acquiring 3d indoor environments with variability and repetition. ACM
3433
+ | TOG, 31(6), 2012.
3434
+ blank |
3435
+ ref | [LAGP09] Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. Robust
3436
+ | single-view geometry and motion reconstruction. ACM TOG (SIG-
3437
+ | GRAPH), 28(5):175:1–175:10, 2009.
3438
+ blank |
3439
+ ref | [LGHK10] David Changsoo Lee, Abhinav Gupta, Martial Hebert, and Takeo
3440
+ | Kanade. Estimating spatial layout of rooms using volumetric reasoning
3441
+ | about objects and surfaces. In NIPS, pages 1288–1296, 2010.
3442
+ blank |
3443
+ ref | [LH05] Marius Leordeanu and Martial Hebert. A spectral technique for cor-
3444
+ | respondence problems using pairwise constraints. In ICCV, volume 2,
3445
+ | pages 1482–1489, 2005.
3446
+ meta | BIBLIOGRAPHY 94
3447
+ blank |
3448
+ |
3449
+ |
3450
+ ref | [MFO+ 07] Niloy J. Mitra, Simon Flory, Maks Ovsjanikov, Natasha Gelfand,
3451
+ | Leonidas Guibas, and Helmut Pottmann. Dynamic geometry registra-
3452
+ | tion. In Symp. on Geometry Proc., pages 173–182, 2007.
3453
+ blank |
3454
+ ref | [Mic10] MicroSoft. Kinect for xbox 360. http://www.xbox.com/en-US/kinect,
3455
+ | November 2010.
3456
+ blank |
3457
+ ref | [MM09] Pranav Mistry and Pattie Maes. Sixthsense: a wearable gestural in-
3458
+ | terface. In SIGGRAPH ASIA Art Gallery & Emerging Technologies,
3459
+ | page 85, 2009.
3460
+ blank |
3461
+ ref | [MPWC12] Niloy J. Mitra, Mark Pauly, Michael Wand, and Duygu Ceylan. Symme-
3462
+ | try in 3d geometry: Extraction and applications. In EUROGRAPHICS
3463
+ | State-of-the-art Report, 2012.
3464
+ blank |
3465
+ ref | [MYY+ 10] N. Mitra, Y.-L. Yang, D.-M. Yan, W. Li, and M. Agrawala. Illus-
3466
+ | trating how mechanical assemblies work. ACM TOG (SIGGRAPH),
3467
+ | 29(4):58:1–58:12, 2010.
3468
+ blank |
3469
+ ref | [MZL+ 09] Ravish Mehra, Qingnan Zhou, Jeremy Long, Alla Sheffer, Amy Gooch,
3470
+ | and Niloy J. Mitra. Abstraction of man-made shapes. ACM TOG
3471
+ | (SIGGRAPH Asia), 28(5):#137, 1–10, 2009.
3472
+ blank |
3473
+ ref | [ND10] Richard A. Newcombe and Andrew J. Davison. Live dense reconstruc-
3474
+ | tion with a single moving camera. In CVPR, 2010.
3475
+ blank |
3476
+ ref | [NXS12] Liangliang Nan, Ke Xie, and Andrei Sharf. A search-classify approach
3477
+ | for cluttered indoor scene understanding. ACM TOG (SIGGRAPH
3478
+ | Asia), 31(6), 2012.
3479
+ blank |
3480
+ ref | [OFCD02] Robert Osada, Thomas Funkhouser, Bernard Chazelle, and David
3481
+ | Dobkin. Shape distributions. ACM Transactions on Graphics,
3482
+ | 21(4):807–832, October 2002.
3483
+ meta | BIBLIOGRAPHY 95
3484
+ blank |
3485
+ |
3486
+ |
3487
+ ref | [OLGM11] Maks Ovsjanikov, Wilmot Li, Leonidas Guibas, and Niloy J. Mitra.
3488
+ | Exploration of continuous variability in collections of 3D shapes. ACM
3489
+ | TOG (SIGGRAPH), 30(4):33:1–33:10, 2011.
3490
+ blank |
3491
+ ref | [PMG+ 05] Mark Pauly, Niloy J. Mitra, Joachim Giesen, Markus Gross, and
3492
+ | Leonidas J. Guibas. Example-based 3D scan completion. In Symp.
3493
+ | on Geometry Proc., pages 23–32, 2005.
3494
+ blank |
3495
+ ref | [PMW+ 08] M. Pauly, N. J. Mitra, J. Wallner, H. Pottmann, and L. Guibas. Discov-
3496
+ | ering structural regularity in 3D geometry. ACM TOG (SIGGRAPH),
3497
+ | 27(3):43:1–43:11, 2008.
3498
+ blank |
3499
+ ref | [RBF12] Xiaofeng Ren, Liefeng Bo, and D. Fox. RGB-D scene labeling: Features
3500
+ | and algorithms. In CVPR, pages 2759 – 2766, 2012.
3501
+ blank |
3502
+ ref | [RHHL02] Szymon Rusinkiewicz, Olaf Hall-Holt, and Marc Levoy. Real-time 3D
3503
+ | model acquisition. ACM TOG (SIGGRAPH), 21(3):438–446, 2002.
3504
+ blank |
3505
+ ref | [RL01] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp
3506
+ | algorithm. In Proc. 3DIM, 2001.
3507
+ blank |
3508
+ ref | [RTG98] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for
3509
+ | distributions with applications to image databases. In ICCV, pages
3510
+ | 59–, 1998.
3511
+ blank |
3512
+ ref | [SFC+ 11] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark
3513
+ | Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-
3514
+ | time human pose recognition in parts from a single depth image. In
3515
+ | CVPR, pages 1297–1304, 2011.
3516
+ blank |
3517
+ ref | [SvKK+ 11] Oana Sidi, Oliver van Kaick, Yanir Kleiman, Hao Zhang, and Daniel
3518
+ | Cohen-Or. Unsupervised co-segmentation of a set of shapes via
3519
+ | descriptor-space spectral clustering. ACM TOG (SIGGRAPH Asia),
3520
+ | 30(6):126:1–126:10, 2011.
3521
+ meta | BIBLIOGRAPHY 96
3522
+ blank |
3523
+ |
3524
+ |
3525
+ ref | [SWK07] Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Efficient RANSAC
3526
+ | for point-cloud shape detection. CGF (EUROGRAPHICS), 26(2):214–
3527
+ | 226, 2007.
3528
+ blank |
3529
+ ref | [SWWK08] Ruwen Schnabel, Raoul Wessel, Roland Wahl, and Reinhard Klein.
3530
+ | Shape recognition in 3D point-clouds. In Proc. WSCG, pages 65–72,
3531
+ | 2008.
3532
+ blank |
3533
+ ref | [SXZ+ 12] Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and
3534
+ | Baining Guo. An interactive approach to semantic modeling of indoor
3535
+ | scenes with an RGBD camera. ACM TOG (SIGGRAPH Asia), 31(6),
3536
+ | 2012.
3537
+ blank |
3538
+ ref | [Thr02] S. Thrun. Robotic mapping: A survey. In G. Lakemeyer and B. Nebel,
3539
+ | editors, Exploring Artificial Intelligence in the New Millenium. Morgan
3540
+ | Kaufmann, 2002.
3541
+ blank |
3542
+ ref | [TMHF00] Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W.
3543
+ | Fitzgibbon. Bundle adjustment - a modern synthesis. In Proceedings of
3544
+ | the International Workshop on Vision Algorithms: Theory and Practice,
3545
+ | ICCV ’99. Springer-Verlag, 2000.
3546
+ blank |
3547
+ ref | [TSS10] R. Triebel, J. Shin, and R. Siegwart. Segmentation and unsupervised
3548
+ | part-based discovery of repetitive objects. In Proceedings of Robotics:
3549
+ | Science and Systems, 2010.
3550
+ blank |
3551
+ ref | [TW05] Sebastian Thrun and Ben Wegbreit. Shape from symmetry. In ICCV,
3552
+ | pages 1824–1831, 2005.
3553
+ blank |
3554
+ ref | [VAB10] Carlos A. Vanegas, Daniel G. Aliaga, and Bedrich Benes. Building
3555
+ | reconstruction using manhattan-world grammars. In CVPR, pages 358–
3556
+ | 365, 2010.
3557
+ blank |
3558
+ ref | [Vil03] C. Villani. Topics in Optimal Transportation. Graduate Studies in
3559
+ | Mathematics. American Mathematical Society, 2003.
3560
+ meta | BIBLIOGRAPHY 97
3561
+ blank |
3562
+ |
3563
+ |
3564
+ ref | [XLZ+ 10] Kai Xu, Honghua Li, Hao Zhang, Daniel Cohen-Or, Yueshan Xiong,
3565
+ | and Zhiquan Cheng. Style-content separation by anisotropic part scales.
3566
+ | ACM TOG (SIGGRAPH Asia), 29(5):184:1–184:10, 2010.
3567
+ blank |
3568
+ ref | [XS12] Yu Xiang and Silvio Savarese. Estimating the aspect layout of object
3569
+ | categories. In CVPR, pages 3410–3417, 2012.
3570
+ blank |
3571
+ ref | [XZZ+ 11] Kai Xu, Hanlin Zheng, Hao Zhang, Daniel Cohen-Or, , Ligang Liu, and
3572
+ | Yueshan Xiong. Photo-inspired model-driven 3D object modeling. ACM
3573
+ | TOG (SIGGRAPH), 30(4):80:1–80:10, 2011.
3574
+ blank |
3575
+ ref | [ZCC+ 12] Youyi Zheng, Xiang Chen, Ming-Ming Cheng, Kun Zhou, Shi-Min Hu,
3576
+ | and Niloy J. Mitra. Interactive images: Cuboid proxies for smart image
3577
+ | manipulation. ACM TOG (SIGGRAPH), 31(4):99:1–99:11, 2012.
3578
+ blank |